Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)

2018-09-20 Thread Dave Chinner
On Wed, Sep 19, 2018 at 12:12:03AM -0400, Zygo Blaxell wrote:
> On Mon, Sep 10, 2018 at 07:06:46PM +1000, Dave Chinner wrote:
> > On Thu, Sep 06, 2018 at 11:53:06PM -0400, Zygo Blaxell wrote:
> > > On Thu, Sep 06, 2018 at 06:38:09PM +1000, Dave Chinner wrote:
> > > > On Fri, Aug 31, 2018 at 01:10:45AM -0400, Zygo Blaxell wrote:
> > > > > On Thu, Aug 30, 2018 at 04:27:43PM +1000, Dave Chinner wrote:
> > > > > > On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote:
> > > > > For future development I've abandoned the entire dedupe_file_range
> > > > > approach.  I need to be able to read and dedupe the data blocks of
> > > > > the filesystem directly without having to deal with details like which
> > > > > files those blocks belong to, especially on filesystems with lots of
> > > > > existing deduped blocks and snapshots. 
> > > > 
> > > > IOWs, your desired OOB dedupe algorithm is:
> > > > 
> > > > a) ask the filesystem where all it's file data is
> > > 
> > > Actually, it's "ask the filesystem where all the *new* file data is"
> > > since we don't want to read any unique data twice on subsequent runs.
> > 
> > Sorry, how do you read "unique data" twice? By definition, unique
> > data only occurs once
> 
> ...but once it has been read, we don't want to read it again.  Ever.
> Even better would be to read unique data less than 1.0 times on average.
> 
> > Oh, and you still need to hash the old data so you can find
> > collisions with the new data that got written. Unless, of course,
> > you are keeping your hash tree in a persistent database 
> 
> I do that.

OK, so you're slowly re-inventing the typical HSM implementation
that keep a userspace database to keep track of what is happening in
the filesystem. They do this to make better decisions when moving
less frequently accessed data out to a slower teirs of storage to
keep space in the primary storage available as it fills up. You're
approaching dedupe in essentially the same way, for very similar
reasons.

> One big difference I am noticing in our approaches is latency.  ZFS (and
> in-kernel btrfs dedupe) provides minimal dedupe latency (duplicate
> data occupies disk space for zero time as it is never written to disk
> at all) but it requires more RAM for a given dedupe hit rate than any
> other dedupe implementation I've seen.  What you've written tells me
> XFS saves RAM by partitioning the data and relying on an existing but
> very large source of iops (sharing scrub reads with dedupe), but then
> the dedupe latency is the same as the scrub interval (the worst so far).

That's just the background maintenance implementation. xfs_fsr can
run this way, but it can also be run as a complete immediate scan,
or it can be pointed at a defined set of files to run on. We can
(and probably will) do exactly the same thing with dedupe.

> bees aims to have latency of a few minutes (ideally scanning data while
> it's still dirty in cache, but there's no good userspace API for that)
> though it's obviously not there yet.

Yup, this is pretty much the "I need immediate notification that a
file changed" problem that HSMs face. That's one of the things DMAPI
was used for - XFS has (now unused) dmapi event notification fields in
it's inode structure that were used by DMAPI to configure the
notifications the filesystem was supposed to send userspace

With no DMAPI in the future, people with custom HSM-like interfaces
based on dmapi are starting to turn to fanotify and friends to
provide them with the change notifications they require

> > e.g. a soft requirement is that we need to scan the entire fs at
> > least once a month. 
> 
> I have to scan and dedupe multiple times per hour.  OK, the first-ever
> scan of a non-empty filesystem is allowed to take much longer, but after
> that, if you have enough spare iops for continuous autodefrag you should
> also have spare iops for continuous dedupe.

Yup, but using notifications avoids the for even these scans - you'd
know exactly what data has changed, when it changed, and know
exactly that you needed to read to calculate the new hashes.

> > A simple piece-wise per-AG scanning algorithm (like we use in
> > xfs_repair) could easily work within a 3GB RAM per AG constraint and
> > would scale very well. We'd only need to scan 30-40 AGs in the hour,
> > and a single AG at 1GB/s will only take 2 minutes to scan. We can
> > then do the processing while the next AG gets scanned. If we've got
> > 10-20GB RAM to use (and who doesn't when they have 1PB of storage?)
> > then we can scan 5-10AGs at once to keep th

Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)

2018-09-10 Thread Dave Chinner
On Thu, Sep 06, 2018 at 11:53:06PM -0400, Zygo Blaxell wrote:
> On Thu, Sep 06, 2018 at 06:38:09PM +1000, Dave Chinner wrote:
> > On Fri, Aug 31, 2018 at 01:10:45AM -0400, Zygo Blaxell wrote:
> > > On Thu, Aug 30, 2018 at 04:27:43PM +1000, Dave Chinner wrote:
> > > > On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote:
> > > For future development I've abandoned the entire dedupe_file_range
> > > approach.  I need to be able to read and dedupe the data blocks of
> > > the filesystem directly without having to deal with details like which
> > > files those blocks belong to, especially on filesystems with lots of
> > > existing deduped blocks and snapshots. 
> > 
> > IOWs, your desired OOB dedupe algorithm is:
> > 
> > a) ask the filesystem where all it's file data is
> 
> Actually, it's "ask the filesystem where all the *new* file data is"
> since we don't want to read any unique data twice on subsequent runs.

Sorry, how do you read "unique data" twice? By definition, unique
data only occurs once

Oh, and you still need to hash the old data so you can find
collisions with the new data that got written. Unless, of course,
you are keeping your hash tree in a persistent database and can work
out how to prune stale entries out of it efficiently

And, well, see my comments about media scrubbing below.

> > b) read that used space to build a data hash index
> > c) on all the data hash collisions find the owners of the
> >colliding blocks
> > d) if the block data is the same dedupe it
> > 
> > I agree - that's a simple and effective algorithm. It's also the
> > obvious solution to an expert in the field.
> 
> It's obvious, but if you literally implement that, you end up with a
> filesystem shattered into billions of fragments.

Well, yes, that's obvious. I thought that "details omitted for
reasons of brevity" would be understood, not require omitted details
to be explained to me.

> At step c) I need to figure out _which_ colliding block I want to keep.
> That's where the algorithm needs to inspect the higher-level file
> structure to determine the physical and logical situation for each block.
> That leads to one of:
> 
>   a) store the hashes and block locations in the hash table
>   b) relocate the blocks (defrag)
>   c) replace some blocks with a reference to other blocks (dedupe)
> 
> This sounds a bit like what xfs_fsr does (or will do?).

xfs_fsr currently does the defrag bit w/ the swapext ioctl(*). It'll
get extended to also do dedupe scans eventually, as defrag and
dedupe obviously need to be integrated to prevent them from
duelling.

(*) swapext - in case you might not know about it - atomically
swaps a range of extents between two files without copying data.  So
you remember those dedupe selection problems you were talking about
(details omitted for brevity):

inode 1:ABCDEFG|H|I|JK|L|M|N|O|P
inode 2:a|b|c|d|e|f|g|hijklmop

The XFS solution will be to first defrag the master file (inode 1)
with the larger extents from inode 2:

swapext(H-P, h-p)

inode 1:ABCDEFG|hijklmop
inode 2:a|b|c|d|e|f|g|H|I|JK|L|M|N|O|P

And the we dedupe to the defragged file:

dedupe(inode 1, inode 2)

inode 1:ABCDEFG|hijklmop
inode 2:ABCDEFG|hijklmop

> Bees also operates under a constant-RAM constraint, so it doesn't operate
> in two distinct "collect data" and "act on data collected" passes,
> and cannot spend memory to store data about more than a few extents at
> any time.

I suspect that I'm thinking at a completely different scale to you.
I don't really care for highly constrained or optimal dedupe
algorithms  because those last few dedupe percentages really don't
matter that much to me. I care much more about using all the
resources we can and running as fast as we possibly can, then
providing the admin with means to throttle performance back to what
they need.

i.e. I'm concerned about how to effectively scan and dedupe PBs of
data, where scan rates may need to be measured in tens of GB/s.  In
these cases the limiting factor is either suboptimal IO patterns
(low IO throughput), or not having enough CPU and memory bandwidth
for hashing the data the IO engine is streaming through memory.
Once we get past a certain point, we'll gain far more by
parallelising the dedupe scan over lots of CPU than we ever will by
optimising an algorithm that cannot be parallelised. Hence we need
to start with a parallelisable algorithm, not an artificially
constrained one.

e.g. a soft requirement is that we need to scan the entire fs at
least once a month. That will need to be broken up into nightly
work, with a full 1PB filesystem needing 33+TB a night to be
scanned. We typically get 1

Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)

2018-09-06 Thread Dave Chinner
On Fri, Aug 31, 2018 at 01:10:45AM -0400, Zygo Blaxell wrote:
> On Thu, Aug 30, 2018 at 04:27:43PM +1000, Dave Chinner wrote:
> > On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote:
> > > On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote:
> > > > On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote:
> > > > >   - is documenting rejection on request alignment grounds
> > > > > (i.e. EINVAL) in the man page sufficient for app
> > > > > developers to understand what is going on here?
> > > > 
> > > > I think so.  The manpage says: "The filesystem does not support
> > > > reflinking the ranges of the given files", which (to my mind) covers
> > > > this case of not supporting dedupe of EOF blocks.
> > > 
> > > Older versions of btrfs dedupe (before v4.2 or so) used to do exactly
> > > this; however, on btrfs, not supporting dedupe of EOF blocks means small
> > > files (one extent) cannot be deduped at all, because the EOF block holds
> > > a reference to the entire dst extent.  If a dedupe app doesn't go all the
> > > way to EOF on btrfs, then it should not attempt to dedupe any part of the
> > > last extent of the file as the benefit would be zero or slightly negative.
> > 
> > That's a filesystem implementation issue, not an API or application
> > issue.
> 
> The API and application issue remains even if btrfs is not considered.
> btrfs is just the worst case outcome.  Other filesystems still have
> fragmentation issues, and applications have efficiency-vs-capability
> tradeoffs to make if they can't rely on dedupe-to-EOF being available.
> 
> Tools like 'cp --reflink=auto' work by trying the best case, then falling
> back to a second choice if the first choice returns an error.

Well, yes. That's necessary for the "cp" tool to behave according to
user expectations.  That's not a kernel API issue - that's just an
implementation of an *application requirement*.  Indeed, this is
identical to the behaviour of rename() in "mv" - if rename fails
with -EXDEV, mv needs to fall back to a manual copy because the user
expects the file to be moved.

IOWS, these application level requirements you talk about are just
not relevant to the kernel API for dedupe/clone operations.

[snip]

> It is something that naive dedupe apps will do.  "naive" here meaning
> "does not dive deeply into the filesystem's physical structure (or survey
> the entire filesystem with FIEMAP) to determine that the thousand-refs
> situation does not exist at dst prior to invoking the dedupe() call."

/me sighs and points at FS_IOC_GETFSMAP

$ man ioctl_getfsmap

DESCRIPTION
   This ioctl(2) operation retrieves physical extent mappings
   for a filesystem.  This information can be used to discover
   which files are mapped to a physical block, examine free
   space, or find known bad blocks, among other things.
.

I don't really care about "enabling" naive, inefficient
applications. I care about applications that scale to huge
filesystems and can get the job done quickly and efficiently.

> > XFS doesn't have partial overlaps, we don't have nodatacow hacks,
> > and the subvol snapshot stuff I'm working on just uses shared data
> > extents so it's 100% compatible with dedupe.
> 
> If you allow this sequence of operations, you get partial overlaps:
> 
>   dedupe(fd1, 0, fd2, 0, 1048576);
> 
>   dedupe(fd2, 16384, fd3, 0, 65536);

Oh, I misunderstood - I thought you were refering to sharing partial
filesystem blocks (like at EOF) because that's what this discussion
was originally about. XFS supports the above just fine.

[snip]

tl;dr we don't need a new clone or dedupe API

> For future development I've abandoned the entire dedupe_file_range
> approach.  I need to be able to read and dedupe the data blocks of
> the filesystem directly without having to deal with details like which
> files those blocks belong to, especially on filesystems with lots of
> existing deduped blocks and snapshots. 

IOWs, your desired OOB dedupe algorithm is:

a) ask the filesystem where all it's file data is
b) read that used space to build a data hash index
c) on all the data hash collisions find the owners of the
   colliding blocks
d) if the block data is the same dedupe it

I agree - that's a simple and effective algorithm. It's also the
obvious solution to an expert in the field.

> The file structure is frankly
> just noise for dedupe on large filesystems.

We learnt that lesson back in the late 1990s. xfsdump, xfs_fsr, all
the SGI^WHPE HSM scanning tools, etc all avoid the directory
structure

Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)

2018-08-30 Thread Dave Chinner
On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote:
> On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote:
> > On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote:
> > >   - is documenting rejection on request alignment grounds
> > > (i.e. EINVAL) in the man page sufficient for app
> > > developers to understand what is going on here?
> > 
> > I think so.  The manpage says: "The filesystem does not support
> > reflinking the ranges of the given files", which (to my mind) covers
> > this case of not supporting dedupe of EOF blocks.
> 
> Older versions of btrfs dedupe (before v4.2 or so) used to do exactly
> this; however, on btrfs, not supporting dedupe of EOF blocks means small
> files (one extent) cannot be deduped at all, because the EOF block holds
> a reference to the entire dst extent.  If a dedupe app doesn't go all the
> way to EOF on btrfs, then it should not attempt to dedupe any part of the
> last extent of the file as the benefit would be zero or slightly negative.

That's a filesystem implementation issue, not an API or application
issue.

> The app developer would need to be aware that such a restriction could
> exist on some filesystems, and be able to distinguish this from other
> cases that could lead to EINVAL.  Portable code would have to try a dedupe
> up to EOF, then if that failed, round down and retry, and if that failed
> too, the app would have to figure out which filesystem it's running on
> to know what to do next.  Performance demands the app know what the FS
> will do in advance, and avoid a whole class of behavior.

Nobody writes "portable" applications like that. They read the man
page first, and work out what the common subset of functionality is
and then code from that. Man page says:

"Disk filesystems generally require the offset and length arguments
to be aligned to the fundamental block size."

IOWs, code compatible with starts with supporting the general case.
i.e. a range rounded to filesystem block boundaries (it's already
run fstat() on the files it wants to dedupe to find their size,
yes?), hence ignoring the partial EOF block. Will just work on
everything.

Code that then wants to optimise for btrfs/xfs/ocfs quirks runs
fstatvfs to determine what fs it's operating on and applies the
necessary quirks. For btrfs it can extend the range to include the
partial EOF block, and hence will handle the implementation quirks
btrfs has with single extent dedupe.

Simple, reliable, and doesn't require any sort of flailing
about with offsets and lengths to avoid unexpected EINVAL errors.

> btrfs dedupe reports success if the src extent is inline and the same
> size as the dst extent (i.e. file is smaller than one page).  No dedupe
> can occur in such cases--a clone results in a simple copy, so the best
> a dedupe could do would be a no-op.  Returning EINVAL there would break
> a few popular tools like "cp --reflink".  Returning OK but doing nothing
> seems to be the best option in that case.

Again, those are a filesystem implementation issues, not problems
with the API itself.

> > >   - should we just round down the EOF dedupe request to the
> > > block before EOF so dedupe still succeeds?
> > 
> > I've often wondered if the interface should (have) be(en) that we start
> > at src_off/dst_off and share as many common blocks as possible until we
> > find a mismatch, then tell userspace where we stopped... instead of like
> > now where we compare the entire extent and fail if any part of it
> > doesn't match.
> 
> The usefulness or harmfulness of that approach depends a lot on what
> the application expects the filesystem to do.
> 
> In btrfs, the dedupe operation acts on references to data, not the
> underlying data blocks.  If there are 1000 slightly overlapping references
> to a single contiguous range of data blocks in dst on disk, each dedupe
> operation acts on only one of those, leaving the other 999 untouched.
> If the app then submits 999 other dedupe requests, no references to the
> dst blocks remain and the underlying data blocks can be deleted.

Assuming your strawman is valid, if you have a thousand separate
references across the same set of data blocks on disk, then that data is
already heavily deduplicated.  Trying to optimise that further
seems misguided, way down the curve of diminishing returns.

> In a parallel universe (or a better filesystem, or a userspace emulation
  ^^^
> built out of dedupe and other ioctls), dedupe could work at the extent
> data (physical) level.  The app points at src and dst extent references
> (inode/offset/length tuples), and the filesystem figures out which
> physical blocks t

Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)

2018-08-20 Thread Dave Chinner
On Mon, Aug 20, 2018 at 08:17:18PM -0500, Eric Sandeen wrote:
> 
> 
> On 8/20/18 7:49 PM, Dave Chinner wrote:
> > Upon successful completion of this ioctl, the number of
> > bytes successfully deduplicated is returned in bytes_deduped
> > and a status code for the deduplication operation is
> > returned in status.  If even a single byte in the range does
> > not match, the deduplication request will be ignored  and
> > status set  to FILE_DEDUPE_RANGE_DIFFERS.
> > 
> > This implies we can dedupe less than the entire range as long as the
> > entire range matches. If the entire range does not match, we have
> > to return FILE_DEDUPE_RANGE_DIFFERS, but in this case it does match
> > so we can pick and choose how much we deduplicate. How much we
> > dedupe is then returned as a byte count. In this case, it will be a
> > few bytes short of the entire length requested because we aligned
> > the dedupe inwards
> > 
> > Does that sound reasonable?
> 
> I had hoped that dedupe was advisory as Darrick wished for, but TBH my
> reading of that is no, if you ask for a range to be deduped and any of
> it differs, "even a single byte," you fail it all.

Yes, if the data differs, then it fails.  But that's not what I'm
questioning nor what I care about in this case. I'm asking whether
the filesystem gets to choose how much of the range of same data is
deduped when the file data is apparently the same.

> Why else would that last part be present, if the interface is free to
> ignore later parts that don't match and truncate the range to the
> matching portion?

I think there is a clear differentiation in the man page text
between "all the bytes are the same" and "how much of that range the
filesystem deduped". i.e. the man page does not say that the
filesystem *must* dedupe the entire range if all the data is the
same - it says the filesystem will return /how many bytes it
successfully deduplicated/. IOWs, "do nothing and return zero" is a
valid ->dedupe_file_range implementation.

AFAIC, it's the fact that we do data comparison from the page cache
before calling into the filesystem to check on-disk formats that is
the problem here. Having identical data in the page cache is not the
same thing as "the blocks on disk containing the user data are
identical". i.e. Filesystems deduplicate disk blocks, but they often
perform transformations on the data as it passes between the page
cache and disk blocks or have constraints on the format of the data
on disk. For example:

- encrypted files. unencrypted data in the page cache is the
  same, therefore we can't return FILE_DEDUPE_RANGE_DIFFERS
  because the user will be able to see that they are the
  same. But they will different on disk after encryption
  (e.g. because different nonce/seed or completely different
  keys) and so the filesystem should not dedupe them.

- the two files differ in on-disk format e.g. compressed
  vs encrypted.

- data might be inline with metadata

- tail packing

- dedupe request might be block aligned at file offsets, but
  file offsets are not aligned to disk blocks due to, say,
  multi-block data compression.

- on disk extent tracking granularity might be larger than
  filesystem block size (e.g. ext4's bigalloc mode) so can't
  be deduped on filesystem block size alignment.

- the source or destination inode is marked "NODATACOW" so
  can't contain shared extents

I'm sure there's others, but I think this is enough to demonstrate
that "user visible file data is the same" does not mean the
filesystem can dedupe it. The wording in the man page appears to
understand these limitations and hence it allows us to silently
ignore the partial EOF block when it comes to dedupe

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)

2018-08-20 Thread Dave Chinner
On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote:
> On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote:

> > So why was this dedupe request even accepted by the kernel? Well,
> > I think there's a bug in the check just above this:
> > 
> > /* If we're linking to EOF, continue to the block boundary. */
> > if (pos_in + *len == isize)
> > blen = ALIGN(isize, bs) - pos_in;
> > else
> > blen = *len;
> > 
> > basically, when the "in" file dedupe/reflink range is to EOF, it
> > expands the range to include /all/ of the block that contains the
> > EOF byte. IOWs, it now requests us to dedupe /undefined data beyond
> > EOF/. But when we go to compare the data in these ranges, it
> > truncates the comparison to the length that the user asked for
> > (i.e. *len) and not the extended block length.
> > 
> > IOWs, it doesn't compare the bytes beyond EOF in the source block to
> > the data in the destination block it would replace, and so doesn't
> > fail the compare like it should.
..
> > i.e. the dedupe request should fail as we cannot dedupe the EOF
> > block.
> > 
> > The patch below does this for the VFS dedupe code. it's not a final
> > patch, it's just a demonstration of how this should never have got
> > past the range checks. Open questions:
> 
> (Here's my five minutes of XFS that I'm allowed per day... :/)
> 
> > - is documenting rejection on request alignment grounds
> >   (i.e. EINVAL) in the man page sufficient for app
> >   developers to understand what is going on here?
> 
> I think so.  The manpage says: "The filesystem does not support
> reflinking the ranges of the given files", which (to my mind) covers
> this case of not supporting dedupe of EOF blocks.

Ok.

> > - should we just round down the EOF dedupe request to the
> >   block before EOF so dedupe still succeeds?
> 
> I've often wondered if the interface should (have) be(en) that we start
> at src_off/dst_off and share as many common blocks as possible until we
> find a mismatch, then tell userspace where we stopped... instead of like
> now where we compare the entire extent and fail if any part of it
> doesn't match.

I think that the ioctl_fideduperange() man page description gives us
the flexibility to do this. That is, this text:

Upon successful completion of this ioctl, the number of
bytes successfully deduplicated is returned in bytes_deduped
and a status code for the deduplication operation is
returned in status.  If even a single byte in the range does
not match, the deduplication request will be ignored  and
status set  to FILE_DEDUPE_RANGE_DIFFERS.

This implies we can dedupe less than the entire range as long as the
entire range matches. If the entire range does not match, we have
to return FILE_DEDUPE_RANGE_DIFFERS, but in this case it does match
so we can pick and choose how much we deduplicate. How much we
dedupe is then returned as a byte count. In this case, it will be a
few bytes short of the entire length requested because we aligned
the dedupe inwards

Does that sound reasonable?

> > - should file cloning (i.e. reflink) be allow allowing the
> >   EOF block to be shared somewhere inside EOF in the
> >   destination file?
> 
> I don't think we should.
> 
> > That's stale data exposure, yes?
> 
> Haven't tested that, but seems likely.

Yeah, I haven't tested it either, but it I can't see how the current
behaviour ends well.

> > - should clone only allow sharing of the EOF block if it's a
> >   either a full file clone or a matching range clone and
> >   isize is the same for both src and dest files?
> 
> The above sound reasonable for clone, though I would also allow clone to
> share the EOF block if we extend isize of the dest file.

Hmmm. That covers the only valid rule for sharing the EOF block,
right? i.e. That the source EOF lands exactly at or beyond the
destination EOF, so it remains the EOF block in both files?

FWIW, I just noticed that the ioctl_ficlonerange man page doesn't
document the return value on success of a clone.

> The above also nearly sound reasonable for dedupe too.  If we encounter
> EOF at the same places in the src range and the dest range, should we
> allow sharing there?  The post-eof bytes are undefined by definition;
> does undefined == undefined?

It's undefined, espescially as different fs implementations may be
doing different things with partial blocks (e.g. tail packing).

>From an XFS perspective, I don't think we should physically share
partial EOF blocks on dedupe because it means ex

[patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)

2018-08-19 Thread Dave Chinner
[cc linux-fsdevel now, too]

On Mon, Aug 20, 2018 at 09:11:26AM +1000, Dave Chinner wrote:
> [cc linux-...@vger.kernel.org]
> 
> On Fri, Aug 17, 2018 at 09:39:24AM +0100, fdman...@kernel.org wrote:
> > From: Filipe Manana 
> > 
> > Test that deduplication of an entire file that has a size that is not
> > aligned to the filesystem's block size into a different file does not
> > corrupt the destination's file data.

Ok, I've looked at this now. My first question is where did all the
magic offsets in this test come from? i.e. how was this bug
found and who is it affecting?

> > This test is motivated by a bug found in Btrfs which is fixed by the
> > following patch for the linux kernel:
> > 
> >   "Btrfs: fix data corruption when deduplicating between different files"
> > 
> > XFS also fails this test, at least as of linux kernel 4.18-rc7, exactly
> > with the same corruption as in Btrfs - some bytes of a block get replaced
> > with zeroes after the deduplication.
> 
> Filipe, in future can please report XFS bugs you find to the XFS
> list the moment you find them. We shouldn't ever find out about a
> data corruption bug we need to fix via a "oh, by the way" comment in
> a commit message for a regression test

This becomes much more relevant because of what I've just found

.

> > +# The first byte with a value of 0xae starts at an offset (2518890) which 
> > is not
> > +# a multiple of the block size.
> > +$XFS_IO_PROG -f \
> > +   -c "pwrite -S 0x6b 0 2518890" \
> > +   -c "pwrite -S 0xae 2518890 102398" \
> > +   $SCRATCH_MNT/foo | _filter_xfs_io
> > +
> > +# Create a second file with a length not aligned to the block size, whose 
> > bytes
> > +# all have the value 0x6b, so that its extent(s) can be deduplicated with 
> > the
> > +# first file.
> > +$XFS_IO_PROG -f -c "pwrite -S 0x6b 0 557771" $SCRATCH_MNT/bar | 
> > _filter_xfs_io
> > +
> > +# The file is filled with bytes having the value 0x6b from offset 0 to 
> > offset
> > +# 2518889 and with the value 0xae from offset 2518890 to offset 2621287.
> > +echo "File content before deduplication:"
> > +od -t x1 $SCRATCH_MNT/foo

Please use "od -Ad -t x1 " so the file offsets reported by od
match the offsets used in the test (i.e. in decimal, not octal).

> > +
> > +# Now deduplicate the entire second file into a range of the first file 
> > that
> > +# also has all bytes with the value 0x6b. The destination range's end 
> > offset
> > +# must not be aligned to the block size and must be less then the offset of
> > +# the first byte with the value 0xae (byte at offset 2518890).
> > +$XFS_IO_PROG -c "dedupe $SCRATCH_MNT/bar 0 1957888 557771" 
> > $SCRATCH_MNT/foo \
> > +   | _filter_xfs_io

Ok, now it gets fun. dedupe to non-block aligned rtanges is supposed
to be rejected by the kernel in vfs_clone_file_prep_inodes(). i.e
this check:

/* Only reflink if we're aligned to block boundaries */
if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_in + blen, bs) ||
!IS_ALIGNED(pos_out, bs) || !IS_ALIGNED(pos_out + blen, bs))
return -EINVAL;

And it's pretty clear that a length of 557771 is not block aligned
(being an odd number).

So why was this dedupe request even accepted by the kernel? Well,
I think there's a bug in the check just above this:

/* If we're linking to EOF, continue to the block boundary. */
if (pos_in + *len == isize)
blen = ALIGN(isize, bs) - pos_in;
else
blen = *len;

basically, when the "in" file dedupe/reflink range is to EOF, it
expands the range to include /all/ of the block that contains the
EOF byte. IOWs, it now requests us to dedupe /undefined data beyond
EOF/. But when we go to compare the data in these ranges, it
truncates the comparison to the length that the user asked for
(i.e. *len) and not the extended block length.

IOWs, it doesn't compare the bytes beyond EOF in the source block to
the data in the destination block it would replace, and so doesn't
fail the compare like it should.

And, well, btrfs has the same bug. extent_same_check_offsets()
extends the range for alignment so it passes alignment checks, but
then /explicitly/ uses the original length for the data compare
and dedupe. i.e:

   /* pass original length for comparison so we stay within i_size */
ret = btrfs_cmp_data(olen, cmp);
if (ret == 0)
ret = btrfs_clone(src, dst, loff, olen, len, dst_loff, 1);

This is what we should see if someone tried to dedupe the EOF block
of a file:

generic/505 - output mismatch (see 
/home/dave/src/xfstests-dev/resul

Re: [PATCH] generic: test for deduplication between different files

2018-08-19 Thread Dave Chinner
> +od -t x1 $SCRATCH_MNT/foo
> +
> +status=0
> +exit
> diff --git a/tests/generic/505.out b/tests/generic/505.out
> new file mode 100644
> index ..7556b9fb
> --- /dev/null
> +++ b/tests/generic/505.out
> @@ -0,0 +1,33 @@
> +QA output created by 505
> +wrote 2518890/2518890 bytes at offset 0
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 102398/102398 bytes at offset 2518890
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 557771/557771 bytes at offset 0
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +File content before deduplication:
> +000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> +*
> +11467540 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ae ae ae ae ae ae
> +11467560 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
> +*
> +11777540 ae ae ae ae ae ae ae ae
> +11777550
> +deduped 557771/557771 bytes at offset 1957888
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +File content after deduplication and before unmounting:
> +000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> +*
> +11467540 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ae ae ae ae ae ae
> +11467560 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
> +*
> +11777540 ae ae ae ae ae ae ae ae
> +11777550
> +File content after unmounting:
> +000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> +*
> +11467540 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ae ae ae ae ae ae
> +11467560 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
> +*
> +11777540 ae ae ae ae ae ae ae ae
> +11777550
> diff --git a/tests/generic/group b/tests/generic/group
> index 55155de8..2ff1bd7e 100644
> --- a/tests/generic/group
> +++ b/tests/generic/group
> @@ -507,3 +507,4 @@
>  502 auto quick log
>  503 auto quick dax punch collapse zero
>  504 auto quick locks
> +505 auto quick clone dedupe
> -- 
> 2.11.0
> 
> 

-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH V4] test online label ioctl

2018-05-17 Thread Dave Chinner
On Thu, May 17, 2018 at 10:28:26AM -0500, Eric Sandeen wrote:
> This tests the online label ioctl that btrfs has, which has been
> recently proposed for XFS.
> 
> To run, it requires an updated xfs_io with the label command and a
> filesystem that supports it
> 
> A slight change here to _require_xfs_io_command as well, so that tests
> which simply fail with "Inappropriate ioctl" can be caught in the
> common case.
> 
> Signed-off-by: Eric Sandeen <sand...@redhat.com>
> ---
> 
> Now with new and improved sequential V4 versioning!

Looks good now.

Reviewed-by: Dave Chinner <dchin...@redhat.com>
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3] test online label ioctl

2018-05-15 Thread Dave Chinner
On Tue, May 15, 2018 at 10:22:37AM -0500, Eric Sandeen wrote:
> This tests the online label ioctl that btrfs has, which has been
> recently proposed for XFS.
> 
> To run, it requires an updated xfs_io with the label command and a
> filesystem that supports it
> 
> A slight change here to _require_xfs_io_command as well, so that tests
> which simply fail with "Inappropriate ioctl" can be caught in the
> common case.
> 
> Signed-off-by: Eric Sandeen <sand...@redhat.com>
> ---
> 
> (urgh send as proper new thread, sorry)
> 
> This passes on btrfs, _notruns on xfs/ext4 of yore, and passes
> on xfs w/ my online label patchset (as long as xfs_io has the new
> capability)
> 
> V2: Add a max label length helper
> Set the proper btrfs max label length o_O oops
> Filter trailing whitespace from blkid output
> 
> V3: lowercase local vars, simplify max label len function

Looks good now, but I wondered about one thing the test doesn't
cover: can you clear the label by setting it to a null string?
i.e you check max length bounds, but don't check empty string
behaviour...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2] test online label ioctl

2018-05-14 Thread Dave Chinner
On Mon, May 14, 2018 at 06:26:07PM -0500, Eric Sandeen wrote:
> On 5/14/18 6:11 PM, Dave Chinner wrote:
> > On Mon, May 14, 2018 at 12:09:16PM -0500, Eric Sandeen wrote:
> >> This tests the online label ioctl that btrfs has, which has been
> >> recently proposed for XFS.
> >>
> >> To run, it requires an updated xfs_io with the label command and a
> >> filesystem that supports it
> >>
> >> A slight change here to _require_xfs_io_command as well, so that tests
> >> which simply fail with "Inappropriate ioctl" can be caught in the
> >> common case.
> >>
> >> Signed-off-by: Eric Sandeen <sand...@redhat.com>
> >> ---
> 
> 
>  
> >> +# The maximum filesystem label length, not including terminating NULL
> >> +_label_get_max()
> >> +{
> >> +  case $FSTYP in
> >> +  xfs)
> >> +  MAXLEN=12
> >> +  ;;
> >> +  btrfs)
> >> +  MAXLEN=255
> >> +  ;;
> >> +  *)
> >> +  MAXLEN=0
> > 
> > Why not just _notrun here?
> 
> do we want to go through the other steps only to get here and notrun
> halfway through?
> 
> >> +  ;;
> >> +  esac
> >> +
> >> +  echo $MAXLEN
> >> +}
> >> +
> >> +_require_label_get_max()
> >> +{
> >> +  if [ $(_label_get_max) -eq 0 ]; then
> >> +  _notrun "$FSTYP does not define maximum label length"
> >> +  fi
> > 
> > And this check can go away?
> 
> We'd like to know if all the validations in this can complete before we
> get started, right?  That's why I did it this way.

Sure, just trying to be a bit defensive as people often forget
_requires directives when writing new tests

> > Also, shouldn't it be _require_online_label_change() ? And then
> > maybe you can move the xfs_io label command check inside it?
> 
> Well, we want to know a lot of things:
> 
> 1) can the kernel code for this filesystem support online label
> 2) does xfs_io support the label command
> 3) does this test know the maximum label length to test for this fs
> 
> the above stuff is for 3)

What I was suggesting was doing all of these in one function similar
to _require_xfs_sparse_inodes, _require_meta_uuid, etc:

_require_online_label_change()
{
# need xfs_io support
_require_xfs_io_command "label"

# need fstests knowledge of label size
[ $(_label_get_max) -eq 0 ] && _notrun "$FSTYP does not define maximum 
label length"

# need kernel support
$XFS_IO_PROG -c label $TEST_DIR > /dev/null 2>&1
[ $? -ne 0 ] && _notrun "Kernel does not support FS_IOC_GETLABEL"
}

Which also means that the label_f command in xfs_io needs to set the
exitcode to non-zero when the ioctl fails so that xfs_io returns
non-zero on failure...

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2] test online label ioctl

2018-05-14 Thread Dave Chinner
On Mon, May 14, 2018 at 12:09:16PM -0500, Eric Sandeen wrote:
> This tests the online label ioctl that btrfs has, which has been
> recently proposed for XFS.
> 
> To run, it requires an updated xfs_io with the label command and a
> filesystem that supports it
> 
> A slight change here to _require_xfs_io_command as well, so that tests
> which simply fail with "Inappropriate ioctl" can be caught in the
> common case.
> 
> Signed-off-by: Eric Sandeen <sand...@redhat.com>
> ---
> 
> (urgh send as proper new thread, sorry)
> 
> This passes on btrfs, _notruns on xfs/ext4 of yore, and passes
> on xfs w/ my online label patchset (as long as xfs_io has the new
> capability)
> 
> V2: Add a max label length helper
> Set the proper btrfs max label length o_O oops
> Filter trailing whitespace from blkid output
> 
> 
> diff --git a/common/rc b/common/rc
> index 9ffab7fd..88a99cff 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -2158,6 +2158,9 @@ _require_xfs_io_command()
>   echo $testio | grep -q "Inappropriate ioctl" && \
>   _notrun "xfs_io $command support is missing"
>   ;;
> + "label")
> + testio=`$XFS_IO_PROG -c "label" $TEST_DIR 2>&1`
> + ;;
>   "open")
>   # -c "open $f" is broken in xfs_io <= 4.8. Along with the fix,
>   # a new -C flag was introduced to execute one shot commands.
> @@ -2196,7 +2199,7 @@ _require_xfs_io_command()
>   rm -f $testfile 2>&1 > /dev/null
>   echo $testio | grep -q "not found" && \
>   _notrun "xfs_io $command support is missing"
> - echo $testio | grep -q "Operation not supported" && \
> + echo $testio | grep -q "Operation not supported\|Inappropriate ioctl" 
> && \
>   _notrun "xfs_io $command failed (old kernel/wrong fs?)"
>   echo $testio | grep -q "Invalid" && \
>   _notrun "xfs_io $command failed (old kernel/wrong fs/bad args?)"
> @@ -3802,6 +3805,31 @@ _require_scratch_feature()
>   esac
>  }
>  
> +# The maximum filesystem label length, not including terminating NULL
> +_label_get_max()
> +{
> + case $FSTYP in
> + xfs)
> + MAXLEN=12
> + ;;
> + btrfs)
> + MAXLEN=255
> + ;;
> + *)
> + MAXLEN=0

Why not just _notrun here?

> + ;;
> + esac
> +
> + echo $MAXLEN
> +}
> +
> +_require_label_get_max()
> +{
> + if [ $(_label_get_max) -eq 0 ]; then
> + _notrun "$FSTYP does not define maximum label length"
> + fi

And this check can go away?

Also, shouldn't it be _require_online_label_change() ? And then
maybe you can move the xfs_io label command check inside it?

> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +
> +_supported_fs generic
> +_supported_os Linux
> +_require_scratch
> +_require_xfs_io_command "label"
> +_require_label_get_max
> +
> +_scratch_mkfs > $seqres.full 2>&1
> +_scratch_mount
> +
> +# Make sure we can set & clear the label
> +$XFS_IO_PROG -c "label label.$seq" $SCRATCH_MNT
> +$XFS_IO_PROG -c "label" $SCRATCH_MNT
> +
> +# And that userspace can see it now, while mounted
> +# NB: some blkid has trailing whitespace, filter it out here
> +blkid -s LABEL $SCRATCH_DEV | _filter_scratch | sed -e "s/ $//g"
> +
> +# And that the it is still there when it's unmounted
> +_scratch_unmount
> +blkid -s LABEL $SCRATCH_DEV | _filter_scratch | sed -e "s/ $//g"

Ok, so "LABEL" here is a special blkid match token

> +# And that it persists after a remount
> +_scratch_mount
> +$XFS_IO_PROG -c "label" $SCRATCH_MNT
> +
> +# And that a too-long label is rejected, beyond the interface max:
> +LABEL=$(perl -e "print 'l' x 257;")

And now you use it as a variable. Nasty and confusing. Using lower
case for local variables is the norm, right? I thought we were only
supposed to use upper case for global test harness variables...

But even making it "label" is problematic:

> +$XFS_IO_PROG -c "label $LABEL" $SCRATCH_MNT

because "label" is an xfs_io command. Perhaps call it "fs_label"?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root

2018-05-09 Thread Dave Chinner
On Tue, May 08, 2018 at 10:06:44PM -0400, Jeff Mahoney wrote:
> On 5/8/18 7:38 PM, Dave Chinner wrote:
> > On Tue, May 08, 2018 at 11:03:20AM -0700, Mark Fasheh wrote:
> >> Hi,
> >>
> >> The VFS's super_block covers a variety of filesystem functionality. In
> >> particular we have a single structure representing both I/O and
> >> namespace domains.
> >>
> >> There are requirements to de-couple this functionality. For example,
> >> filesystems with more than one root (such as btrfs subvolumes) can
> >> have multiple inode namespaces. This starts to confuse userspace when
> >> it notices multiple inodes with the same inode/device tuple on a
> >> filesystem.
> > 
> > Devil's Advocate - I'm not looking at the code, I'm commenting on
> > architectural issues I see here.
> > 
> > The XFS subvolume work I've been doing explicitly uses a superblock
> > per subvolume. That's because subvolumes are designed to be
> > completely independent of the backing storage - they know nothing
> > about the underlying storage except to share a BDI for writeback
> > purposes and write to whatever block device the remapping layer
> > gives them at IO time.  Hence XFS subvolumes have (at this point)
> > their own unique s_dev, on-disk format configuration, journal, space
> > accounting, etc. i.e. They are fully independent filesystems in
> > their own right, and as such we do not have multiple inode
> > namespaces per superblock.
> 
> That's a fundamental difference between how your XFS subvolumes work and
> how btrfs subvolumes do.

Yup, you've just proved my point: this is not a "subvolume problem";
but rather a "multiple namespace per root" problem.

> There is no independence among btrfs
> subvolumes.  When a snapshot is created, it has a few new blocks but
> otherwise shares the metadata of the source subvolume.  The metadata
> trees are shared across all of the subvolumes and there are several
> internal trees used to manage all of it.

I don't need btrfs 101 stuff explained to me. :/

> a single transaction engine.  There are housekeeping and maintenance
> tasks that operate across the entire file system internally.  I
> understand that there are several problems you need to solve at the VFS
> layer to get your version of subvolumes up and running, but trying to
> shoehorn one into the other is bound to fail.

Actually, the VFS has provided everything I need for XFS subvolumes
so far without requiring any sort of modifications. That's the
perspective I'm approaching this from - if the VFS can do what we
need for XFS subvolumes, as well as overlay (which are effectively
VFS-based COW subvolumes), then lets see if we can make that work
for btrfs too.

> > So this doesn't sound like a "subvolume problem" - it's a "how do we
> > sanely support multiple independent namespaces per superblock"
> > problem. AFAICT, this same problem exists with bind mounts and mount
> > namespaces - they are effectively multiple roots on a single
> > superblock, but it's done at the vfsmount level and so the
> > superblock knows nothing about them.
> 
> In this case, you're talking about the user-visible file system
> hierarchy namespace that has no bearing on the underlying file system
> outside of per-mount flags.

Except that it tracks and provides infrastructure that allows user
visible  "multiple namespace per root" constructs. Subvolumes - as a
user visible namespace construct - are little different to bind
mounts in behaviour and functionality. 

How the underlying filesystem implements subvolumes is really up to
the filesystem, but we should be trying to implement a clean model
for "multiple namespaces on a single root" at the VFS so we have
consistent behaviour across all filesystems that implement similar
functionality.

FWIW, bind mounts and overlay also have similar inode number
namespace problems to what Mark describes for btrfs subvolumes.
e.g. overlay recently introduce the "xino" mount option to separate
the user presented inode number namespace for overlay inode from the
underlying parent filesystem inodes. How is that different to btrfs
subvolumes needing to present different inode number namespaces from
the underlying parent?

This sort of "namespace shifting" is needed for several different
pieces of information the kernel reports to userspace. The VFS
replacement for shiftfs is an example of this. So is inode number
remapping. I'm sure there's more.

My point is that if we are talking about infrastructure to remap
what userspace sees from different mountpoint views into a
filesystem, then it should be done above the filesystem layers in
the VFS so all filesystems behave th

Re: [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root

2018-05-08 Thread Dave Chinner
On Tue, May 08, 2018 at 11:03:20AM -0700, Mark Fasheh wrote:
> Hi,
> 
> The VFS's super_block covers a variety of filesystem functionality. In
> particular we have a single structure representing both I/O and
> namespace domains.
> 
> There are requirements to de-couple this functionality. For example,
> filesystems with more than one root (such as btrfs subvolumes) can
> have multiple inode namespaces. This starts to confuse userspace when
> it notices multiple inodes with the same inode/device tuple on a
> filesystem.

Devil's Advocate - I'm not looking at the code, I'm commenting on
architectural issues I see here.

The XFS subvolume work I've been doing explicitly uses a superblock
per subvolume. That's because subvolumes are designed to be
completely independent of the backing storage - they know nothing
about the underlying storage except to share a BDI for writeback
purposes and write to whatever block device the remapping layer
gives them at IO time.  Hence XFS subvolumes have (at this point)
their own unique s_dev, on-disk format configuration, journal, space
accounting, etc. i.e. They are fully independent filesystems in
their own right, and as such we do not have multiple inode
namespaces per superblock.

So this doesn't sound like a "subvolume problem" - it's a "how do we
sanely support multiple independent namespaces per superblock"
problem. AFAICT, this same problem exists with bind mounts and mount
namespaces - they are effectively multiple roots on a single
superblock, but it's done at the vfsmount level and so the
superblock knows nothing about them.

So this kinda feel like there's still a impedence mismatch between
btrfs subvolumes being mounted as subtrees on the underlying root
vfsmount rather than being created as truly independent vfs
namespaces that share a superblock. To put that as a question: why
aren't btrfs subvolumes vfsmounts in their own right, and the unique
information subvolume information get stored in (or obtained from)
the vfsmount?

> In addition, it's currently impossible for a filesystem subvolume to
> have a different security context from it's parent. If we could allow
> for subvolumes to optionally specify their own security context, we
> could use them as containers directly instead of having to go through
> an overlay.

Again, XFS subvolumes don't have this problem. So really we need to
frame this discussion in terms of supporting multiple namespaces
within a superblock sanely, not subvolumes.

> I ran into this particular problem with respect to Btrfs some years
> ago and sent out a very naive set of patches which were (rightfully)
> not incorporated:
> 
> https://marc.info/?l=linux-btrfs=130074451403261=2
> https://marc.info/?l=linux-btrfs=130532890824992=2
> 
> During the discussion, one question did come up - why can't
> filesystems like Btrfs use a superblock per subvolume? There's a
> couple of problems with that:
> 
> - It's common for a single Btrfs filesystem to have thousands of
>   subvolumes. So keeping a superblock for each subvol in memory would
>   get prohibively expensive - imagine having 8000 copies of struct
>   super_block for a file system just because we wanted some separation
>   of say, s_dev.

That's no different to using individual overlay mounts for the
thousands of containers that are on the system. This doesn't seem to
be a major problem...

> - Writeback would also have to walk all of these superblocks -
>   again not very good for system performance.

Background writeback is backing device focussed, not superblock
focussed. It will only iterate the superblocks that have dirty
inodes on the bdi writeback lists, not all the superblocks on the
bdi. IOWs, this isn't a major problem except for sync() operations
that iterate superblocks.

> - Anyone wanting to lock down I/O on a filesystem would have to
> freeze all the superblocks. This goes for most things related to
> I/O really - we simply can't afford to have the kernel walking
> thousands of superblocks to sync a single fs.

Not with XFS subvolumes. Freezing the underlying parent filesystem
will effectively stop all IO from the mounted subvolumes by freezing
remapping calls before IO. Sure, those subvolumes aren't in a
consistent state, but we don't freeze userspace so none of the
application data is ever in a consistent state when filesystems are
frozen.

So, again, I'm not sure there's /subvolume/ problem here. There's
definitely a "freeze heirarchy" problem, but that already exists and
it's something we talked about at LSFMM because we need to solve it
for reliable hibernation.

> It's far more efficient then to pull those fields we need for a
> subvolume namespace into their own structure.

I'm not convinced yet - it still feels like it's the wrong layer to
be solving the multiple namespace per superblock problem


Re: Symlink not persisted even after fsync

2018-04-16 Thread Dave Chinner
On Sun, Apr 15, 2018 at 07:10:52PM -0500, Vijay Chidambaram wrote:
> Thanks! As I mentioned before, this is useful. I have a follow-up
> question. Consider the following workload:
> 
>  creat foo
>  link (foo, A/bar)
>  fsync(foo)
>  crash
> 
> In this case, after the file system recovers, do we expect foo's link
> count to be 2 or 1? 

So, strictly ordered behaviour:

create foo:
- creates dirent in inode B and new inode A in an atomic
  transaction sequence #1

link foo -> A/bar
- creates dirent in inode C and bumps inode A link count in
  an atomic transaction seqeunce #2.

fsync foo
- looks at inode A, sees it's "last modification" sequence
  counter as #2
- flushes all transactions up to and including #2 to the
  journal.

See the dependency chain? Both the inodes and dirents in the create
operation and the link operation are chained to the inode foo via
the atomic transactions. Hence when we flush foo, we also flush the
dependent changes because of the change atomicity requirements

> I would say 2,

Correct, for strict ordering. But

> but POSIX is silent on this,

Well, it's not silent, POSIX explicitly allows for fsync() to do
nothing and report success. Hence we can't really look to POSIX to
define how fsync() should behave.

> so
> thought I would confirm. The tricky part here is we are not calling
> fsync() on directory A.

Right. But directory A has a dependent change linked to foo. If we
fsync() foo, we are persisting the link count change in that file,
and hence all the other changes related to that link count change
must also be flushed. Similarly, all the cahnges related to the
creation on foo must be flushed, too.

> In this case, its not a symlink; its a hard link, so I would say the
> link count for foo should be 2.

Right - that's the "reference counted object dependency" I refered
to. i.e. it's a bi-direction atomic dependency - either we show both
the new dirent and the link count change, or we show neither of
them.  Hence fsync on one object implies that we are also persisting
the related changes in the other object, too.

> But btrfs and F2FS show link count of
> 1 after a crash.

That may be valid if the dirent A/bar does not exist after recovery,
but it also means fsync() hasn't actually guaranteed inode changes
made prior to the fsync to be persistent on disk. i.e. that's a
violation of ordered metadata semantics and probably a bug.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Symlink not persisted even after fsync

2018-04-14 Thread Dave Chinner
er it is even valid.

There is no way reliable ordering dependencies can be created for
indirect references, especially as symlinks can point to any type of
object (e.g. dir, blkdev, etc), it can point to something outside
the filesystem, and it can even point to something that doesn't
exist.

This also means that "fsync on a symlink" may, in fact, run a fsync
method of a completely different filesystem or subsystem. There is
no way this could possible trigger a directory fsync of the symlink
parent, because the object being fsync()d may not even know what a
filesystem is...

If you want a symlink to have ordering behaviour like a dirent
pointing to a regular file, then use hard links

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Symlink not persisted even after fsync

2018-04-13 Thread Dave Chinner
On Fri, Apr 13, 2018 at 09:39:27AM -0500, Jayashree Mohan wrote:
> Hey Dave,
> 
> Thanks for clarifying the crash recovery semantics of strictly
> metadata ordered filesystems. We had a follow-up question in this
> case.
> 
> On Fri, Apr 13, 2018 at 8:16 AM, Amir Goldstein <amir7...@gmail.com> wrote:
> > On Fri, Apr 13, 2018 at 3:54 PM, Vijay Chidambaram <vi...@cs.utexas.edu> 
> > wrote:
> >> Hi Amir,
> >>
> >> Thanks for the reply!
> >>
> >> On Fri, Apr 13, 2018 at 12:52 AM, Amir Goldstein <amir7...@gmail.com> 
> >> wrote:
> >>>
> >>> Not a bug.
> >>>
> >>> From man 2 fsync:
> >>>
> >>> "Calling  fsync() does not necessarily ensure that the entry in the
> >>>  directory containing the file has also reached disk.  For that an
> >>>  explicit fsync() on a file descriptor for the directory is also needed."
> >>
> >>
> >> Are we understanding this right:
> >>
> >> ext4 and xfs fsync the parent directory if a sym link file is fsync-ed. But
> >> btrfs does not. Is this what we are seeing?
> >
> > Nope.
> >
> > You are seeing an unintentional fsync of parent, because both
> > parent update and symlink update are metadata updates that are
> > tracked by the same transaction.
> >
> > fsync of symlink forces the current transaction to the journal,
> > pulling in the parent update with it.
> >
> >
> >>
> >> I agree that fsync of a file does not mean fsync of its directory entry, 
> >> but
> >> it seems odd to do it for regular files and not for sym links. We do not 
> >> see
> >> this behavior if we use a regular file instead of a sym link file.
> >>
> >
> > fsync of regular file behaves differently than fsync of non regular file.
> > I suggest this read:
> > https://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/
> >
> >>>
> >>> There is a reason why this behavior is not being reproduces in
> >>> ext4/xfs, but you should be able to reproduce a similar issue
> >>> like this:
> >>>
> >>>
> >>> 1. symlink (foo, bar.tmp)
> >>> 2. open bar.tmp
> >>> 3. fsync bar.tmp
> >>> 4. rename(bar.tmp, bar)
> >>> 5. fsync bar
> >>> crash here
> >>
> 
> Going by your argument that all previous transactions that referenced
> the file being fsync-ed needs to be committed, should we expect xfs
> (and ext4) to persist file bar in this case?

No, that's not what I'm implying. I'm implying that there is
specific ordering dependencies that govern this behaviour, and
assuming that what the fsync man page says about files applies to
symlinks is not a valid assumption because files and symlinks are
not equivalent objects.

In these cases, you first have to ask "what are we actually running
fsync on?"

The fsync is being performed on the inode the symlink points to, not
the symlink. You can't directly open a symlink to fsync the symlink.

Then you have to ask "what is the dependency chain between the
parent directory, the symlink and the file it points to?"

the short answer is that symlinks have no direct relationship to the
object they point to. i.e. symlinks contain a path, not a reference
to a specific filesystem object.

IOWs, symlinks are really a directory construct, not a file.
However, there is no ordering dependency between a symlink and what
it points to. symlinks contain a path which needs to be resolved to
find out what it points to, and that may not even exist. Files have
no reference to symlinks that point at them, so there's no way we
can create an ordering dependency between file updates and any
symlink that points to them.

Directories, OTOH, contain a pointer to a reference counted object
(an inode) in their dirents. hence if you add/remove directory
dirents that point to an inode, you also have to modify the inode
link counts as it records how many directory entries point at it.
That's a bi-directional atomic modification ordering dependency
between directories and inodes they point at.

So when we look at symlinks, the parent directory has a ordering
dependency with the symlink inode, not whatever is found by
resolving the path in the symlink data. IOWs, there is no ordering
relationship between the symlink's parent directory and whatever the
symlink points at. i.e. it's a one-way relationship, and so there is
no reverse ordering dependency that requires fsync() on the file to
force synchronisation of a symlink it knows nothing about.

i.e. the ordering dependency that exists with symlinks is between
the symlink and it's parent directory, not whatever the symlink
points to. Hence fsyncing whatever the symlink points to does not
guarantee that the symlink is made stable because the symlink is not
part of the dependency chain of the object being fsync()d

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Symlink not persisted even after fsync

2018-04-13 Thread Dave Chinner
On Fri, Apr 13, 2018 at 08:52:19AM +0300, Amir Goldstein wrote:
> On Thu, Apr 12, 2018 at 8:51 PM, Jayashree Mohan
> <jayashree2...@gmail.com> wrote:
> > Hi,
> >
> > We came across what seems to be a crash consistency bug on btrfs and
> > f2fs. When we create a symlink for a file and fsync the symlink, on
> > recovery from crash, the fsync-ed file is missing.
> >
> > You can reproduce this behaviour using this minimal workload :
> >
> > 1. symlink (foo, bar)
> > 2. open bar
> > 3. fsync bar
> > crash here
> >
> > When we recover, we find that file bar is missing. This behaviour
> > seems unexpected as the fsynced file is lost on a crash. ext4 and xfs
> > correctly recovers file bar. This seems like a bug. If not, could you
> > explain why?
> >
> 
> Not a bug.

Actually, for a filesystem with strictly ordered metadata recovery
semantics, it is a bug.

> From man 2 fsync:
> 
> "Calling  fsync() does not necessarily ensure that the entry in the
>  directory containing the file has also reached disk.  For that an
>  explicit fsync() on a file descriptor for the directory is also needed."

We've been through this before, many times. This caveat does not
apply to strictly ordered metadata filesystems. If you fsync a file
on an ordered metadata filesystem, then all previous transactions
that are needed to reference the file are also committed.

The behaviour from ext4 and XFS is correct for strictly ordered
filesystems.  This is not a "fsync requirement", nor is it a general
linux filesystem requirement. It is a requirement of the desired
filesystem crash recovery mechanisms

BTRFS is advertised as having strictly ordered metadata
recovery semantics, so it should behave the same way as ext4 and
XFS in tests like these. If it doesn't, then there's filesystem bugs
that need fixing...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] common/rc: raise mixed mode threshold to 1GB

2018-04-11 Thread Dave Chinner
On Wed, Apr 11, 2018 at 10:07:29PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osan...@fb.com>
> 
> generic/427 creates a 256 MB filesystem and then writes a 200 MB file,
> which fails on Btrfs if mixed mode is not enabled. Raise the threshold
> to 1GB, which is where we typically recommend mixed mode.
> 
> Signed-off-by: Omar Sandoval <osan...@fb.com>
> ---
>  common/rc | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/common/rc b/common/rc
> index 5dbb8fe5..dc0062de 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -999,7 +999,7 @@ _scratch_mkfs_sized()
>   ;;
>  btrfs)
>   local mixed_opt=
> - (( fssize <= 100 * 1024 * 1024 )) && mixed_opt='--mixed'
> + (( fssize <= 1024 * 1024 * 1024 )) && mixed_opt='--mixed'
>   $MKFS_BTRFS_PROG $MKFS_OPTIONS $mixed_opt -b $fssize $SCRATCH_DEV
>   ;;
>  jfs)

Makes sense.

Reviewed-by: Dave Chinner <dchin...@redhat.com>

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] generic/427: used mixed mode for Btrfs

2018-04-11 Thread Dave Chinner
On Wed, Apr 11, 2018 at 04:47:30PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osan...@fb.com>
> 
> This test creates a 256 MB filesystem and then writes a 200 MB file.
> With separate data and metadata, Btrfs will run out of data space since
> it needs to allocate some metadata space. Use mixed mode, which is the
> recommendation for smaller filesystems.
> 
> Signed-off-by: Omar Sandoval <osan...@fb.com>
> ---
>  tests/generic/427 | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/tests/generic/427 b/tests/generic/427
> index 9cde5f50..b2cb4526 100755
> --- a/tests/generic/427
> +++ b/tests/generic/427
> @@ -55,6 +55,8 @@ _require_test_program "feature"
>  _require_aiodio aio-dio-eof-race
>  
>  # limit the filesystem size, to save the time of filling filesystem
> +# Btrfs needs to use mixed mode for such a small filesystem
> +[ "$FSTYP" = "btrfs" ] && MKFS_OPTIONS="$MKFS_OPTIONS -M"


_scratch_mkfs_sized() should already be adding this for small btrfs
filesystems. Yup, it does:

btrfs)  
 
local mixed_opt=
 
(( fssize <= 100 * 1024 * 1024 )) && mixed_opt='--mixed'
 
$MKFS_BTRFS_PROG $MKFS_OPTIONS $mixed_opt -b $fssize $SCRATCH_DEV   
 
;;  
 

>  _scratch_mkfs_sized $((256 * 1024 * 1024)) >>$seqres.full 2>&1

But this uses a filesystem larger than the mixed mode threshold in
_scratch_mkfs_sized(). Please update the generic threshold rather
than special case this test.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: generic test for fsync after fallocate

2018-04-09 Thread Dave Chinner
On Mon, Apr 09, 2018 at 11:00:52AM +0100, Filipe Manana wrote:
> On Mon, Apr 9, 2018 at 10:51 AM, Dave Chinner <da...@fromorbit.com> wrote:
> > On Sun, Apr 08, 2018 at 10:07:54AM +0800, Eryu Guan wrote:
> >> On Thu, Apr 05, 2018 at 10:56:14PM +0100, fdman...@kernel.org wrote:
> > You also cannot assume that two separate preallocations beyond EOF
> > are going to be contiguous (i.e. it could be two separate extents.
> 
> I really don't care about that.
> Just want to check that allocated space is not lost, don't care if
> that space is covered by 1, 2 or more extents, or whether they are
> contiguous or not.

Sure, but that's not what the test does :/

> > What you should just be checking is that there are extents allocated
> > covering EOF to 3MB, not the exactly size, shape and type of extents
> > are allocated.
> 
> How to do such check, for a generic test, without using fiemap?

We already do such checks using fiemap, yes?

i.e. there's a bunch of _filter*fiemap() functions in common/punch
which should show you exactly how to do this i.e. filter all the
unwritten/data extents to be the same name, then they coalesce into
single ranges

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: generic test for fsync after fallocate

2018-04-09 Thread Dave Chinner
On Sun, Apr 08, 2018 at 10:07:54AM +0800, Eryu Guan wrote:
> On Thu, Apr 05, 2018 at 10:56:14PM +0100, fdman...@kernel.org wrote:
> > From: Filipe Manana <fdman...@suse.com>
> > 
> > Test that fsync operations preserve extents allocated with fallocate(2)
> > that are placed beyond a file's size.
> > 
> > This test is motivated by a bug found in btrfs where unwritten extents
> > beyond the inode's i_size were not preserved after a fsync and power
> > failure. The btrfs bug is fixed by the following patch for the linux
> > kernel:
> > 
> >  "Btrfs: fix loss of prealloc extents past i_size after fsync log replay"
> > 
> > Signed-off-by: Filipe Manana <fdman...@suse.com>
> 
> Hmm, xfs fails this test, while ext4 passes.
> 
> # diff -u tests/generic/483.out 
> /root/workspace/xfstests/results//xfs_4k_crc/generic/483.out.bad
> --- tests/generic/483.out   2018-04-07 23:35:00.55511 +0800
> +++ /root/workspace/xfstests/results//xfs_4k_crc/generic/483.out.bad
> 2018-04-07 23:39:48.780659707 +0800
> @@ -6,5 +6,5 @@
>  0: [0..511]: data
>  1: [512..2559]: unwritten
>  File baz fiemap:
> -0: [0..511]: data
> -1: [512..6143]: unwritten
> +0: [0..895]: data
> +1: [896..6143]: unwritten

Perfectly valid result from the test.

> > +# A file which already has a prealloc extent beyond its size.
> > +# The fsync done on it is motivated by differences in the btrfs 
> > implementation
> > +# of fsync (first fsync has different logic from subsequent fsyncs).
> > +$XFS_IO_PROG -f -c "pwrite -S 0xf1 0 256K" \
> > +-c "falloc -k 256K 768K" \
> > +-c "fsync" \
> > +$SCRATCH_MNT/baz >/dev/null

Delayed allocation extending the file to 256k means we'll have
speculative prealloc of data beyond 256k - it should be somewhere
around 448k IIRC.

Now, falloc doesn't guarantee unwritten extents are allocated - it
just guarantees space is allocated. unwritten extents are an
optimisation to avoid needing to zero the extents pre-allocated
within EOF. i.e. we can quite safely allocate data extents beyond
EOF rather than unwritten extents, and this is not a bug.

In this case, the delalloc extent is completely allocated as written
data, but nothing is written beyond EOF @ 256k.  Hence it reports as
a data extent, not an unwritten extent, and this is a perfectly
valid thing to do.

> > +# Allocate another extent beyond the size of file baz.
> > +$XFS_IO_PROG -c "falloc -k 1M 2M"\
> > +-c "fsync" \
> > +$SCRATCH_MNT/baz

You also cannot assume that two separate preallocations beyond EOF
are going to be contiguous (i.e. it could be two separate extents.

What you should just be checking is that there are extents allocated
covering EOF to 3MB, not the exactly size, shape and type of extents
are allocated.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/3] fstests: generic: Check the fs after each FUA writes

2018-03-28 Thread Dave Chinner
On Wed, Mar 28, 2018 at 12:40:23PM +0800, Qu Wenruo wrote:
> Basic test case which triggers fsstress with dm-log-writes, and then
> check the fs after each FUA writes.
> With needed infrastructure and special handlers for journal based fs.
> 
> Signed-off-by: Qu Wenruo <w...@suse.com>
.
> For xfs:
> _check_xfs_filesystem: filesystem on /dev/mapper/test-scratch1 is 
> inconsistent (r)
> *** xfs_repair -n output ***
> Phase 1 - find and verify superblock...
> Phase 2 - using internal log
> - zero log...
> - scan filesystem freespace and inode maps...
> - found root inode chunk
> Phase 3 - for each AG...
> - scan (but don't clear) agi unlinked lists...
> - process known inodes and perform inode discovery...
> - agno = 0
> - agno = 1
> - agno = 2
> bad symlink header ino 8409190, file block 0, disk block 1051147
> problem with symbolic link in inode 8409190
> would have cleared inode 8409190
> - agno = 3
> - process newly discovered inodes...
> Phase 4 - check for duplicate blocks...
> - setting up duplicate extent list...
> - check for inodes claiming duplicate blocks...
> - agno = 0
> - agno = 1
> - agno = 3
> - agno = 2
> entry "lb" in shortform directory 8409188 references free inode 8409190
> would have junked entry "lb" in directory inode 8409188
> bad symlink header ino 8409190, file block 0, disk block 1051147
> problem with symbolic link in inode 8409190
> would have cleared inode 8409190
> No modify flag set, skipping phase 5
> Phase 6 - check inode connectivity...
> - traversing filesystem ...
> entry "lb" in shortform directory inode 8409188 points to free inode 8409190
> would junk entry
> - traversal finished ...
> - moving disconnected inodes to lost+found ...
> Phase 7 - verify link counts...
> Maximum metadata LSN (1:396) is ahead of log (1:63).

That warning implies a write ordering problem - there's a write with
an LSN on disk that does not yet exist in the log. i.e. the last FUA
write to the log had 1:63 in it, yet there's metadata on disk that
could only be *issued* after a REQ_FLUSH|REQ_FUA log write with
1:396 in it was *completed*. If we've only replayed up to the
FUA write with 1:63 in it, then no metadata writes should have been
*issued* with 1:396 in it as the LSN that is stamped into metadata
is only updated on log IO completion

On first glance, this implies a bug in the underlying device write
replay code

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 19/19] fs: handle inode->i_version more efficiently

2018-01-09 Thread Dave Chinner
On Tue, Jan 09, 2018 at 09:10:59AM -0500, Jeff Layton wrote:
> From: Jeff Layton <jlay...@redhat.com>
> 
> Since i_version is mostly treated as an opaque value, we can exploit that
> fact to avoid incrementing it when no one is watching. With that change,
> we can avoid incrementing the counter on writes, unless someone has
> queried for it since it was last incremented. If the a/c/mtime don't
> change, and the i_version hasn't changed, then there's no need to dirty
> the inode metadata on a write.
> 
> Convert the i_version counter to an atomic64_t, and use the lowest order
> bit to hold a flag that will tell whether anyone has queried the value
> since it was last incremented.
> 
> When we go to maybe increment it, we fetch the value and check the flag
> bit.  If it's clear then we don't need to do anything if the update
> isn't being forced.
> 
> If we do need to update, then we increment the counter by 2, and clear
> the flag bit, and then use a CAS op to swap it into place. If that
> works, we return true. If it doesn't then do it again with the value
> that we fetch from the CAS operation.
> 
> On the query side, if the flag is already set, then we just shift the
> value down by 1 bit and return it. Otherwise, we set the flag in our
> on-stack value and again use cmpxchg to swap it into place if it hasn't
> changed. If it has, then we use the value from the cmpxchg as the new
> "old" value and try again.
> 
> This method allows us to avoid incrementing the counter on writes (and
> dirtying the metadata) under typical workloads. We only need to increment
> if it has been queried since it was last changed.
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>
> Reviewed-by: Jan Kara <j...@suse.cz>

Documentation helps a lot in understanding all this. Thanks for
adding it into the patch!

Acked-by: Dave Chinner <dchin...@redhat.com>
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 17/19] xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing

2018-01-09 Thread Dave Chinner
On Tue, Jan 09, 2018 at 09:10:57AM -0500, Jeff Layton wrote:
> From: Jeff Layton <jlay...@redhat.com>
> 
> If XFS_ILOG_CORE is already set then go ahead and increment it.
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>
> Acked-by: Darrick J. Wong <darrick.w...@oracle.com>

Acked-by: Dave Chinner <dchin...@redhat.com>

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 14/19] xfs: convert to new i_version API

2018-01-09 Thread Dave Chinner
On Tue, Jan 09, 2018 at 09:10:54AM -0500, Jeff Layton wrote:
> From: Jeff Layton <jlay...@redhat.com>
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>
> Acked-by: Darrick J. Wong <darrick.w...@oracle.com>

Looks ok, but I haven't tested it at all.

Acked-by: Dave Chinner <dchin...@redhat.com>

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata

2018-01-03 Thread Dave Chinner
On Wed, Jan 03, 2018 at 02:59:21PM +0100, Jan Kara wrote:
> On Wed 03-01-18 13:32:19, Dave Chinner wrote:
> > I think we could probably block ->write_metadata if necessary via a
> > completion/wakeup style notification when a specific LSN is reached
> > by the log tail, but realistically if there's any amount of data
> > needing to be written it'll throttle data writes because the IO
> > pipeline is being kept full by background metadata writes
> 
> So the problem I'm concerned about is a corner case. Consider a situation
> when you have no dirty data, only dirty metadata but enough of them to
> trigger background writeback. How should metadata writeback behave for XFS
> in this case? Who should be responsible that wb_writeback() just does not
> loop invoking ->write_metadata() as fast as CPU allows until xfsaild makes
> enough progress?
>
> Thinking about this today, I think this looping prevention belongs to
> wb_writeback().

Well, backgroudn data writeback can block in two ways. One is during
IO submission when the request queue is full, the other is when all
dirty inodes have had some work done on them and have all been moved
to b_more_io - wb_writeback waits for the __I_SYNC bit to be cleared
on the last(?) inode on that list, hence backing off before
submitting more IO.

IOws, there's a "during writeback" blocking mechanism as well as a
"between cycles" block mechanism.

> Sadly we don't have much info to decide how long to sleep
> before trying more writeback so we'd have to just sleep for
>  if we found no writeback happened in the last writeback
> round before going through the whole writeback loop again.

Right - I don't think we can provide a generic "between cycles"
blocking mechanism for XFS, but I'm pretty sure we can emulate a
"during writeback" blocking mechanism to avoid busy looping inside
the XFS code.

e.g. if we get a writeback call that asks for 5% to be written,
and we already have a metadata writeback target of 5% in place,
that means we should block for a while. That would emulate request
queue blocking and prevent busy looping in this case

> And
> ->write_metadata() for XFS would need to always return 0 (as in "no progress
> made") to make sure this busyloop avoidance logic in wb_writeback()
> triggers. ext4 and btrfs would return number of bytes written from
> ->write_metadata (or just 1 would be enough to indicate some progress in
> metadata writeback was made and busyloop avoidance is not needed).

Well, if we block for a little while, we can indicate that progress
has been made and this whole mess would go away, right?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata

2018-01-02 Thread Dave Chinner
On Tue, Jan 02, 2018 at 11:13:06AM -0500, Josef Bacik wrote:
> On Wed, Dec 20, 2017 at 03:30:55PM +0100, Jan Kara wrote:
> > On Wed 20-12-17 08:35:05, Dave Chinner wrote:
> > > On Tue, Dec 19, 2017 at 01:07:09PM +0100, Jan Kara wrote:
> > > > On Wed 13-12-17 09:20:04, Dave Chinner wrote:
> > > > > IOWs, treating metadata like it's one great big data inode doesn't
> > > > > seem to me to be the right abstraction to use for this - in most
> > > > > fileystems it's a bunch of objects with a complex dependency tree
> > > > > and unknown write ordering, not an inode full of data that can be
> > > > > sequentially written.
> > > > > 
> > > > > Maybe we need multiple ops with well defined behaviours. e.g.
> > > > > ->writeback_metadata() for background writeback, ->sync_metadata() for
> > > > > sync based operations. That way different filesystems can ignore the
> > > > > parts they don't need simply by not implementing those operations,
> > > > > and the writeback code doesn't need to try to cater for all
> > > > > operations through the one op. The writeback code should be cleaner,
> > > > > the filesystem code should be cleaner, and we can tailor the work
> > > > > guidelines for each operation separately so there's less mismatch
> > > > > between what writeback is asking and how filesystems track dirty
> > > > > metadata...
> > > > 
> > > > I agree that writeback for memory cleaning and writeback for data 
> > > > integrity
> > > > are two very different things especially for metadata. In fact for data
> > > > integrity writeback we already have ->sync_fs operation so there the
> > > > functionality gets duplicated. What we could do is that in
> > > > writeback_sb_inodes() we'd call ->write_metadata only when
> > > > work->for_kupdate or work->for_background is set. That way 
> > > > ->write_metadata
> > > > would be called only for memory cleaning purposes.
> > > 
> > > That makes sense, but I still think we need a better indication of
> > > how much writeback we need to do than just "writeback this chunk of
> > > pages". That "writeback a chunk" interface is necessary to share
> > > writeback bandwidth across numerous data inodes so that we don't
> > > starve any one inode of writeback bandwidth. That's unnecessary for
> > > metadata writeback on a superblock - we don't need to share that
> > > bandwidth around hundreds or thousands of inodes. What we actually
> > > need to know is how much writeback we need to do as a total of all
> > > the dirty metadata on the superblock.
> > > 
> > > Sure, that's not ideal for btrfs and mayext4, but we can write a
> > > simple generic helper that converts "flush X percent of dirty
> > > metadata" to a page/byte chunk as the current code does. DOing it
> > > this way allows filesystems to completely internalise the accounting
> > > that needs to be done, rather than trying to hack around a
> > > writeback accounting interface with large impedance mismatches to
> > > how the filesystem accounts for dirty metadata and/or tracks
> > > writeback progress.
> > 
> > Let me think loud on how we could tie this into how memory cleaning
> > writeback currently works - the one with for_background == 1 which is
> > generally used to get amount of dirty pages in the system under control.
> > We have a queue of inodes to write, we iterate over this queue and ask each
> > inode to write some amount (e.g. 64 M - exact amount depends on measured

It's a maximum of 1024 pages per inode.

> > writeback bandwidth etc.). Some amount from that inode gets written and we
> > continue with the next inode in the queue (put this one at the end of the
> > queue if it still has dirty pages). We do this until:
> > 
> > a) the number of dirty pages in the system is below background dirty limit
> >and the number dirty pages for this device is below background dirty
> >limit for this device.
> > b) run out of dirty inodes on this device
> > c) someone queues different type of writeback
> > 
> > And we need to somehow incorporate metadata writeback into this loop. I see
> > two questions here:
> > 
> > 1) When / how often should we ask for metadata writeback?
> > 2) How much to ask to write in one go?
> > 
> > The second question is especially tricky in t

Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata

2017-12-19 Thread Dave Chinner
On Tue, Dec 19, 2017 at 01:07:09PM +0100, Jan Kara wrote:
> On Wed 13-12-17 09:20:04, Dave Chinner wrote:
> > On Tue, Dec 12, 2017 at 01:05:35PM -0500, Josef Bacik wrote:
> > > On Tue, Dec 12, 2017 at 10:36:19AM +1100, Dave Chinner wrote:
> > > > On Mon, Dec 11, 2017 at 04:55:31PM -0500, Josef Bacik wrote:
> > > This is just one of those things that's going to be slightly shitty.  
> > > It's the
> > > same for memory reclaim, all of those places use pages so we just take
> > > METADATA_*_BYTES >> PAGE_SHIFT to get pages and figure it's close enough.
> > 
> > Ok, so that isn't exactly easy to deal with, because all our
> > metadata writeback is based on log sequence number targets (i.e. how
> > far to push the tail of the log towards the current head). We've
> > actually got no idea how pages/bytes actually map to a LSN target
> > because while we might account a full buffer as dirty for memory
> > reclaim purposes (up to 64k in size), we might have only logged 128
> > bytes of it.
> > 
> > i.e. if we are asked to push 2MB of metadata and we treat that as
> > 2MB of log space (i.e. push target of tail LSN + 2MB) we could have
> > logged several tens of megabytes of dirty metadata in that LSN
> > range and have to flush it all. OTOH, if the buffers are fully
> > logged, then that same target might only flush 1.5MB of metadata
> > once all the log overhead is taken into account.
> > 
> > So there's a fairly large disconnect between the "flush N bytes of
> > metadata" API and the "push to a target LSN" that XFS uses for
> > flushing metadata in aged order. I'm betting that extN and otehr
> > filesystems might have similar mismatches with their journal
> > flushing...
> 
> Well, for ext4 it isn't as bad since we do full block logging only. So if
> we are asked to flush N pages, we can easily translate that to number of fs
> blocks and flush that many from the oldest transaction.
> 
> Couldn't XFS just track how much it has cleaned (from reclaim perspective)
> when pushing items from AIL (which is what I suppose XFS would do in
> response to metadata writeback request) and just stop pushing when it has
> cleaned as much as it was asked to?

If only it were that simple :/

To start with, flushing the dirty objects (such as inodes) to their
backing buffers do not mean the the object is clean once the
writeback completes. XFS has decoupled in-memory objects with
logical object logging rather than logging physical buffers, and
so can be modified and dirtied while the inode buffer
is being written back. Hence if we just count things like "buffer
size written" it's not actually a correct account of the amount of
dirty metadata we've cleaned. If we don't get that right, it'll
result in accounting errors and incorrect behaviour.

The bigger problem, however, is that we have no channel to return
flush information from the AIL pushing to whatever caller asked for
the push. Pushing metadata is completely decoupled from every other
subsystem. i.e. the caller asked the xfsaild to push to a specific
LSN (e.g. to free up a certain amount of log space for new
transactions), and *nothing* has any idea of how much metadata we'll
need to write to push the tail of the log to that LSN.

It's also completely asynchronous - there's no mechanism for waiting
on a push to a specific LSN. Anything that needs a specific amount
of log space to be available waits in ordered ticket queues on the
log tail moving forwards. The only interfaces that have access to
the log tail ticket waiting is the transaction reservation
subsystem, which cannot be used during metadata writeback because
that's a guaranteed deadlock vector

Saying "just account for bytes written" assumes directly connected,
synchronous dispatch metadata writeback infrastructure which we
simply don't have in XFS. "just clean this many bytes" doesn't
really fit at all because we have no way of referencing that to the
distance we need to push the tail of the log. An interface that
tells us "clean this percentage of dirty metadata" is much more
useful because we can map that easily to a log sequence number
based push target

> > IOWs, treating metadata like it's one great big data inode doesn't
> > seem to me to be the right abstraction to use for this - in most
> > fileystems it's a bunch of objects with a complex dependency tree
> > and unknown write ordering, not an inode full of data that can be
> > sequentially written.
> > 
> > Maybe we need multiple ops with well defined behaviours. e.g.
> > ->writeback_metadata() for background writeback, ->sync_metadata() for
> > sync based operations. That way different filesystems can igno

Re: [PATCH v3 19/19] fs: handle inode->i_version more efficiently

2017-12-18 Thread Dave Chinner
On Mon, Dec 18, 2017 at 02:35:20PM -0500, Jeff Layton wrote:
> [PATCH] SQUASH: add memory barriers around i_version accesses

Why explicit memory barriers rather than annotating the operations
with the required semantics and getting the barriers the arch
requires automatically?  I suspect this should be using
atomic_read_acquire() and atomic_cmpxchg_release(), because AFAICT
the atomic_cmpxchg needs to have release semantics to match the
acquire semantics needed for the load of the current value.

>From include/linux/atomics.h:

 * For compound atomics performing both a load and a store, ACQUIRE
 * semantics apply only to the load and RELEASE semantics only to the
 * store portion of the operation. Note that a failed cmpxchg_acquire
 * does -not- imply any memory ordering constraints.

Memory barriers hurt my brain. :/

At minimum, shouldn't the atomic op specific barriers be used rather
than full memory barriers? i.e:

> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index a9fbf99709df..39ec9aa9e08e 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -87,6 +87,25 @@ static inline void
>  inode_set_iversion_raw(struct inode *inode, const u64 val)
>  {
>   atomic64_set(>i_version, val);
> + smp_wmb();

smp_mb__after_atomic();
.
> +static inline u64
> +inode_peek_iversion_raw(const struct inode *inode)
> +{
> + smp_rmb();

smp_mb__before_atomic();

> + return atomic64_read(>i_version);
>  }

And, of course, these will require comments explaining what they
match with and what they are ordering.

> @@ -152,7 +171,7 @@ inode_maybe_inc_iversion(struct inode *inode, bool force)
>  {
>   u64 cur, old, new;
>  
> - cur = (u64)atomic64_read(>i_version);
> + cur = inode_peek_iversion_raw(inode);
>   for (;;) {
>   /* If flag is clear then we needn't do anything */
>   if (!force && !(cur & I_VERSION_QUERIED))

cmpxchg in this loop needs a release barrier so everyone will
see the change?

> @@ -183,23 +202,6 @@ inode_inc_iversion(struct inode *inode)
>   inode_maybe_inc_iversion(inode, true);
>  }
>  
> -/**
> - * inode_peek_iversion_raw - grab a "raw" iversion value
> - * @inode: inode from which i_version should be read
> - *
> - * Grab a "raw" inode->i_version value and return it. The i_version is not
> - * flagged or converted in any way. This is mostly used to access a 
> self-managed
> - * i_version.
> - *
> - * With those filesystems, we want to treat the i_version as an entirely
> - * opaque value.
> - */
> -static inline u64
> -inode_peek_iversion_raw(const struct inode *inode)
> -{
> - return atomic64_read(>i_version);
> -}
> -
>  /**
>   * inode_iversion_need_inc - is the i_version in need of being incremented?
>   * @inode: inode to check
> @@ -248,7 +250,7 @@ inode_query_iversion(struct inode *inode)
>  {
>   u64 cur, old, new;
>  
> - cur = atomic64_read(>i_version);
> + cur = inode_peek_iversion_raw(inode);
>   for (;;) {
>   /* If flag is already set, then no need to swap */
>   if (cur & I_VERSION_QUERIED)

cmpxchg in this loop needs a release barrier so everyone will
see the change on load?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 01/19] fs: new API for handling inode->i_version

2017-12-16 Thread Dave Chinner
On Sat, Dec 16, 2017 at 08:46:38AM -0500, Jeff Layton wrote:
> From: Jeff Layton <jlay...@redhat.com>
> 
> Add a documentation blob that explains what the i_version field is, how
> it is expected to work, and how it is currently implemented by various
> filesystems.
> 
> We already have inode_inc_iversion. Add several other functions for
> manipulating and accessing the i_version counter. For now, the
> implementation is trivial and basically works the way that all of the
> open-coded i_version accesses work today.
> 
> Future patches will convert existing users of i_version to use the new
> API, and then convert the backend implementation to do things more
> efficiently.
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>
> ---
>  include/linux/fs.h | 200 
> ++---
>  1 file changed, 192 insertions(+), 8 deletions(-)

Just a random sunday morning coffee musing

I was just wondering if it would be better to split this stuff out
into it's own header file now? include/linux/fs.h is aleady a
massive header file (~3500 lines) and changes cause tree-wide
rebuilds, so maybe it would be better to split relatively isolated
functionality like this out while it's being reworked and you're
already touching every file that uses it?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata

2017-12-12 Thread Dave Chinner
On Tue, Dec 12, 2017 at 01:05:35PM -0500, Josef Bacik wrote:
> On Tue, Dec 12, 2017 at 10:36:19AM +1100, Dave Chinner wrote:
> > On Mon, Dec 11, 2017 at 04:55:31PM -0500, Josef Bacik wrote:
> > > From: Josef Bacik <jba...@fb.com>
> > > 
> > > Now that we have metadata counters in the VM, we need to provide a way to 
> > > kick
> > > writeback on dirty metadata.  Introduce super_operations->write_metadata. 
> > >  This
> > > allows file systems to deal with writing back any dirty metadata we need 
> > > based
> > > on the writeback needs of the system.  Since there is no inode to key off 
> > > of we
> > > need a list in the bdi for dirty super blocks to be added.  From there we 
> > > can
> > > find any dirty sb's on the bdi we are currently doing writeback on and 
> > > call into
> > > their ->write_metadata callback.
> > > 
> > > Signed-off-by: Josef Bacik <jba...@fb.com>
> > > Reviewed-by: Jan Kara <j...@suse.cz>
> > > Reviewed-by: Tejun Heo <t...@kernel.org>
> > > ---
> > >  fs/fs-writeback.c| 72 
> > > 
> > >  fs/super.c   |  6 
> > >  include/linux/backing-dev-defs.h |  2 ++
> > >  include/linux/fs.h   |  4 +++
> > >  mm/backing-dev.c |  2 ++
> > >  5 files changed, 80 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > index 987448ed7698..fba703dff678 100644
> > > --- a/fs/fs-writeback.c
> > > +++ b/fs/fs-writeback.c
> > > @@ -1479,6 +1479,31 @@ static long writeback_chunk_size(struct 
> > > bdi_writeback *wb,
> > >   return pages;
> > >  }
> > >  
> > > +static long writeback_sb_metadata(struct super_block *sb,
> > > +   struct bdi_writeback *wb,
> > > +   struct wb_writeback_work *work)
> > > +{
> > > + struct writeback_control wbc = {
> > > + .sync_mode  = work->sync_mode,
> > > + .tagged_writepages  = work->tagged_writepages,
> > > + .for_kupdate= work->for_kupdate,
> > > + .for_background = work->for_background,
> > > + .for_sync   = work->for_sync,
> > > + .range_cyclic   = work->range_cyclic,
> > > + .range_start= 0,
> > > + .range_end  = LLONG_MAX,
> > > + };
> > > + long write_chunk;
> > > +
> > > + write_chunk = writeback_chunk_size(wb, work);
> > > + wbc.nr_to_write = write_chunk;
> > > + sb->s_op->write_metadata(sb, );
> > > + work->nr_pages -= write_chunk - wbc.nr_to_write;
> > > +
> > > + return write_chunk - wbc.nr_to_write;
> > 
> > Ok, writeback_chunk_size() returns a page count. We've already gone
> > through the "metadata is not page sized" dance on the dirty
> > accounting side, so how are we supposed to use pages to account for
> > metadata writeback?
> > 
> 
> This is just one of those things that's going to be slightly shitty.  It's the
> same for memory reclaim, all of those places use pages so we just take
> METADATA_*_BYTES >> PAGE_SHIFT to get pages and figure it's close enough.

Ok, so that isn't exactly easy to deal with, because all our
metadata writeback is based on log sequence number targets (i.e. how
far to push the tail of the log towards the current head). We've
actually got no idea how pages/bytes actually map to a LSN target
because while we might account a full buffer as dirty for memory
reclaim purposes (up to 64k in size), we might have only logged 128
bytes of it.

i.e. if we are asked to push 2MB of metadata and we treat that as
2MB of log space (i.e. push target of tail LSN + 2MB) we could have
logged several tens of megabytes of dirty metadata in that LSN
range and have to flush it all. OTOH, if the buffers are fully
logged, then that same target might only flush 1.5MB of metadata
once all the log overhead is taken into account.

So there's a fairly large disconnect between the "flush N bytes of
metadata" API and the "push to a target LSN" that XFS uses for
flushing metadata in aged order. I'm betting that extN and otehr
filesystems might have similar mismatches with their journal
flushing...

> > And, from what I can tell, if work->sync_mode = WB_SYNC_ALL or
> > work->tagged_writepages is set, this will basically tell us to f

Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

2017-12-11 Thread Dave Chinner
On Mon, Dec 11, 2017 at 02:12:28PM -0800, Joe Perches wrote:
> On Tue, 2017-12-12 at 08:43 +1100, Dave Chinner wrote:
> > On Sat, Dec 09, 2017 at 09:00:18AM -0800, Joe Perches wrote:
> > > On Sat, 2017-12-09 at 09:36 +1100, Dave Chinner wrote:
> > > > 1. Using lockdep_set_novalidate_class() for anything other
> > > > than device->mutex will throw checkpatch warnings. Nice. (*)
> > > []
> > > > (*) checkpatch.pl is considered mostly harmful round here, too,
> > > > but that's another rant
> > > 
> > > How so?
> > 
> > Short story is that it barfs all over the slightly non-standard
> > coding style used in XFS.
> []
> > This sort of stuff is just lowest-common-denominator noise - great
> > for new code and/or inexperienced developers, but not for working
> > with large bodies of existing code with slightly non-standard
> > conventions.
> 
> Completely reasonable.  Thanks.
> 
> Do you get many checkpatch submitters for fs/xfs?

We used to. Not recently, though.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata

2017-12-11 Thread Dave Chinner
On Mon, Dec 11, 2017 at 04:55:31PM -0500, Josef Bacik wrote:
> From: Josef Bacik <jba...@fb.com>
> 
> Now that we have metadata counters in the VM, we need to provide a way to kick
> writeback on dirty metadata.  Introduce super_operations->write_metadata.  
> This
> allows file systems to deal with writing back any dirty metadata we need based
> on the writeback needs of the system.  Since there is no inode to key off of 
> we
> need a list in the bdi for dirty super blocks to be added.  From there we can
> find any dirty sb's on the bdi we are currently doing writeback on and call 
> into
> their ->write_metadata callback.
> 
> Signed-off-by: Josef Bacik <jba...@fb.com>
> Reviewed-by: Jan Kara <j...@suse.cz>
> Reviewed-by: Tejun Heo <t...@kernel.org>
> ---
>  fs/fs-writeback.c| 72 
> 
>  fs/super.c   |  6 
>  include/linux/backing-dev-defs.h |  2 ++
>  include/linux/fs.h   |  4 +++
>  mm/backing-dev.c |  2 ++
>  5 files changed, 80 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 987448ed7698..fba703dff678 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1479,6 +1479,31 @@ static long writeback_chunk_size(struct bdi_writeback 
> *wb,
>   return pages;
>  }
>  
> +static long writeback_sb_metadata(struct super_block *sb,
> +   struct bdi_writeback *wb,
> +   struct wb_writeback_work *work)
> +{
> + struct writeback_control wbc = {
> + .sync_mode  = work->sync_mode,
> + .tagged_writepages  = work->tagged_writepages,
> + .for_kupdate= work->for_kupdate,
> + .for_background = work->for_background,
> + .for_sync   = work->for_sync,
> + .range_cyclic   = work->range_cyclic,
> + .range_start= 0,
> + .range_end  = LLONG_MAX,
> + };
> + long write_chunk;
> +
> + write_chunk = writeback_chunk_size(wb, work);
> + wbc.nr_to_write = write_chunk;
> + sb->s_op->write_metadata(sb, );
> + work->nr_pages -= write_chunk - wbc.nr_to_write;
> +
> + return write_chunk - wbc.nr_to_write;

Ok, writeback_chunk_size() returns a page count. We've already gone
through the "metadata is not page sized" dance on the dirty
accounting side, so how are we supposed to use pages to account for
metadata writeback?

And, from what I can tell, if work->sync_mode = WB_SYNC_ALL or
work->tagged_writepages is set, this will basically tell us to flush
the entire dirty metadata cache because write_chunk will get set to
LONG_MAX.

IOWs, this would appear to me to change sync() behaviour quite
dramatically on filesystems where ->write_metadata is implemented.
That is, instead of leaving all the metadata dirty in memory and
just forcing the journal to stable storage, filesystems will be told
to also write back all their dirty metadata before sync() returns,
even though it is not necessary to provide correct sync()
semantics

Mind you, writeback invocation is so convoluted now I could easily
be mis-interpretting this code, but it does seem to me like this
code is going to have some unintended behaviours

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

2017-12-11 Thread Dave Chinner
On Sun, Dec 10, 2017 at 08:23:15PM -0800, Matthew Wilcox wrote:
> On Mon, Dec 11, 2017 at 10:57:45AM +1100, Dave Chinner wrote:
> > i.e.  the fact the cmpxchg failed may not have anything to do with a
> > race condtion - it failed because the slot wasn't empty like we
> > expected it to be. There can be any number of reasons the slot isn't
> > empty - the API should not "document" that the reason the insert
> > failed was a race condition. It should document the case that we
> > "couldn't insert because there was an existing entry in the slot".
> > Let the surrounding code document the reason why that might have
> > happened - it's not for the API to assume reasons for failure.
> > 
> > i.e. this API and potential internal implementation makes much
> > more sense:
> > 
> > int
> > xa_store_iff_empty(...)
> > {
> > curr = xa_cmpxchg(>pag_ici_xa, agino, NULL, ip, GFP_NOFS);
> > if (!curr)
> > return 0;   /* success! */
> > if (!IS_ERR(curr))
> > return -EEXIST; /* failed - slot not empty */
> > return PTR_ERR(curr);   /* failed - XA internal issue */
> > }
> > 
> > as it replaces the existing preload and insert code in the XFS code
> > paths whilst letting us handle and document the "insert failed
> > because slot not empty" case however we want. It implies nothing
> > about *why* the slot wasn't empty, just that we couldn't do the
> > insert because it wasn't empty.
> 
> Yeah, OK.  So, over the top of the recent changes I'm looking at this:
> 
> I'm not in love with xa_store_empty() as a name.  I almost want
> xa_store_weak(), but after my MAP_FIXED_WEAK proposed name got shot
> down, I'm leery of it.  "empty" is at least a concept we already have
> in the API with the comment for xa_init() talking about an empty array
> and xa_empty().  I also considered xa_store_null and xa_overwrite_null
> and xa_replace_null().  Naming remains hard.
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 941f38bb94a4..586b43836905 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -451,7 +451,7 @@ xfs_iget_cache_miss(
>   int flags,
>   int lock_flags)
>  {
> - struct xfs_inode*ip, *curr;
> + struct xfs_inode*ip;
>   int error;
>   xfs_agino_t agino = XFS_INO_TO_AGINO(mp, ino);
>   int iflags;
> @@ -498,8 +498,7 @@ xfs_iget_cache_miss(
>   xfs_iflags_set(ip, iflags);
>  
>   /* insert the new inode */
> - curr = xa_cmpxchg(>pag_ici_xa, agino, NULL, ip, GFP_NOFS);
> - error = __xa_race(curr, -EAGAIN);
> + error = xa_store_empty(>pag_ici_xa, agino, ip, GFP_NOFS, -EAGAIN);
>   if (error)
>   goto out_unlock;

Can we avoid encoding the error to be returned into the function
call? No other generic/library API does this, so this seems like a
highly unusual special snowflake approach. I just don't see how
creating a whole new error specification convention is justified
for the case where it *might* save a line or two of error checking
code in a caller?

I'd much prefer that the API defines the error on failure, and let
the caller handle that specific error however they need to like
every other library interface we have...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

2017-12-11 Thread Dave Chinner
On Sat, Dec 09, 2017 at 09:00:18AM -0800, Joe Perches wrote:
> On Sat, 2017-12-09 at 09:36 +1100, Dave Chinner wrote:
> > 1. Using lockdep_set_novalidate_class() for anything other
> > than device->mutex will throw checkpatch warnings. Nice. (*)
> []
> > (*) checkpatch.pl is considered mostly harmful round here, too,
> > but that's another rant
> 
> How so?

Short story is that it barfs all over the slightly non-standard
coding style used in XFS.  It generates enough noise on incidental
things we aren't important that it complicates simple things. e.g. I
just moved a block of defines from one header to another, and
checkpatch throws about 10 warnings on that because of whitespace.
I'm just moving code - I don't want to change it and it doesn't need
to be modified because it is neat and easy to read and is obviously
correct. A bunch of prototypes I added another parameter to gets
warnings because it uses "unsigned" for an existing parameter that
I'm not changing. And so on.

This sort of stuff is just lowest-common-denominator noise - great
for new code and/or inexperienced developers, but not for working
with large bodies of existing code with slightly non-standard
conventions. Churning *lots* of code we otherwise wouldn't touch
just to shut up checkpatch warnings takes time, risks regressions
and makes it harder to trace the history of the code when we are
trying to track down bugs.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

2017-12-10 Thread Dave Chinner
On Fri, Dec 08, 2017 at 03:01:31PM -0800, Matthew Wilcox wrote:
> On Thu, Dec 07, 2017 at 11:38:43AM +1100, Dave Chinner wrote:
> > > > cmpxchg is for replacing a known object in a store - it's not really
> > > > intended for doing initial inserts after a lookup tells us there is
> > > > nothing in the store.  The radix tree "insert only if empty" makes
> > > > sense here, because it naturally takes care of lookup/insert races
> > > > via the -EEXIST mechanism.
> > > > 
> > > > I think that providing xa_store_excl() (which would return -EEXIST
> > > > if the entry is not empty) would be a better interface here, because
> > > > it matches the semantics of lookup cache population used all over
> > > > the kernel
> > > 
> > > I'm not thrilled with xa_store_excl(), but I need to think about that
> > > a bit more.
> > 
> > Not fussed about the name - I just think we need a function that
> > matches the insert semantics of the code
> 
> I think I have something that works better for you than returning -EEXIST
> (because you don't actually want -EEXIST, you want -EAGAIN):
> 
> /* insert the new inode */
> -   spin_lock(>pag_ici_lock);
> -   error = radix_tree_insert(>pag_ici_root, agino, ip);
> -   if (unlikely(error)) {
> -   WARN_ON(error != -EEXIST);
> -   XFS_STATS_INC(mp, xs_ig_dup);
> -   error = -EAGAIN;
> -   goto out_preload_end;
> -   }
> -   spin_unlock(>pag_ici_lock);
> -   radix_tree_preload_end();
> +   curr = xa_cmpxchg(>pag_ici_xa, agino, NULL, ip, GFP_NOFS);
> +   error = __xa_race(curr, -EAGAIN);
> +   if (error)
> +   goto out_unlock;
> 
> ...
> 
> -out_preload_end:
> -   spin_unlock(>pag_ici_lock);
> -   radix_tree_preload_end();
> +out_unlock:
> +   if (error == -EAGAIN)
> +   XFS_STATS_INC(mp, xs_ig_dup);
> 
> I've changed the behaviour slightly in that returning an -ENOMEM used to
> hit a WARN_ON, and I don't think that's the right response -- GFP_NOFS
> returning -ENOMEM probably gets you a nice warning already from the
> mm code.

It's been a couple of days since I first looked at this, and my
initial reaction of "that's horrible!" hasn't changed.

What you are proposing here might be a perfectly reasonable
*internal implemention* of xa_store_excl(), but it makes for a
terrible external API because the sematics and behaviour are so
vague. e.g. what does "race" mean here with respect to an insert
failure?

i.e.  the fact the cmpxchg failed may not have anything to do with a
race condtion - it failed because the slot wasn't empty like we
expected it to be. There can be any number of reasons the slot isn't
empty - the API should not "document" that the reason the insert
failed was a race condition. It should document the case that we
"couldn't insert because there was an existing entry in the slot".
Let the surrounding code document the reason why that might have
happened - it's not for the API to assume reasons for failure.

i.e. this API and potential internal implementation makes much
more sense:

int
xa_store_iff_empty(...)
{
curr = xa_cmpxchg(>pag_ici_xa, agino, NULL, ip, GFP_NOFS);
if (!curr)
return 0;   /* success! */
if (!IS_ERR(curr))
return -EEXIST; /* failed - slot not empty */
return PTR_ERR(curr);   /* failed - XA internal issue */
}

as it replaces the existing preload and insert code in the XFS code
paths whilst letting us handle and document the "insert failed
because slot not empty" case however we want. It implies nothing
about *why* the slot wasn't empty, just that we couldn't do the
insert because it wasn't empty.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Lockdep is less useful than it was

2017-12-08 Thread Dave Chinner
On Fri, Dec 08, 2017 at 10:14:38AM -0800, Matthew Wilcox wrote:
> At the moment, the radix tree actively disables the RCU checking that
> enabling lockdep would give us.  It has to, because it has no idea what
> lock protects any individual access to the radix tree.  The XArray can
> use the RCU checking because it knows that every reference is protected
> by either the spinlock or the RCU lock.
> 
> Dave was saying that he has a tree which has to be protected by a mutex
> because of where it is in the locking hierarchy, and I was vigorously
> declining his proposal of allowing him to skip taking the spinlock.

Oh, I wasn't suggesting that you remove the internal tree locking
because we need external locking.

I was trying to point out that the internal locking doesn't remove
the need for external locking,  and that there are cases where
smearing the internal lock outside the XA tree doesn't work, either.
i.e. internal locking doesn't replace all the cases where external
locking is required, and so it's less efficient than the existing
radix tree code.

What I was questioning was the value of replacing the radix tree
code with a less efficient structure just to add lockdep validation
to a tree that doesn't actually need any extra locking validation...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

2017-12-08 Thread Dave Chinner
On Fri, Dec 08, 2017 at 12:35:07PM -0500, Alan Stern wrote:
> On Fri, 8 Dec 2017, Byungchul Park wrote:
> 
> > I'm sorry to hear that.. If I were you, I would also get
> > annoyed. And.. thanks for explanation.
> > 
> > But, I think assigning lock classes properly and checking
> > relationship of the classes to detect deadlocks is reasonable.
> > 
> > In my opinion about the common lockdep stuff, there are 2
> > problems on it.
> > 
> > 1) Firstly, it's hard to assign lock classes *properly*. By
> > default, it relies on the caller site of lockdep_init_map(),
> > but we need to assign another class manually, where ordering
> > rules are complicated so cannot rely on the caller site. That
> > *only* can be done by experts of the subsystem.

Sure, but that's not the issue here. The issue here is the lack of
communication with subsystem experts and that the annotation
complexity warnings given immediately by the subsystem experts were
completely ignored...

> > I think if they want to get benifit from lockdep, they have no
> > choice but to assign classes manually with the domain knowledge,
> > or use *lockdep_set_novalidate_class()* to invalidate locks
> > making the developers annoyed and not want to use the checking
> > for them.
> 
> Lockdep's no_validate class is used when the locking patterns are too
> complicated for lockdep to understand.  Basically, it tells lockdep to
> ignore those locks.

Let me just point out two things here:

1. Using lockdep_set_novalidate_class() for anything other
than device->mutex will throw checkpatch warnings. Nice. (*)

2. lockdep_set_novalidate_class() is completely undocumented
- it's the first I've ever heard of this functionality. i.e.
nobody has ever told us there is a mechanism to turn off
validation of an object; we've *always* been told to "change
your code and/or fix your annotations" when discussing
lockdep deficiencies. (**)

> The device core uses that class.  The tree of struct devices, each with
> its own lock, gets used in many different and complicated ways.  
> Lockdep can't understand this -- it doesn't have the ability to
> represent an arbitrarily deep hierarchical tree of locks -- so we tell
^^

That largely describes the in-memory structure of XFS, except we
have a forest of lock trees, not just one

> it to ignore the device locks.
> 
> It sounds like XFS may need to do the same thing with its semaphores.

Who-ever adds semaphore checking to lockdep can add those
annotations. The externalisation of the development cost of new
lockdep functionality is one of the problems here.

-Dave.

(*) checkpatch.pl is considered mostly harmful round here, too,
but that's another rant

(**) the frequent occurrence of "core code/devs aren't held to the
same rules/standard as everyone else" is another rant I have stored
up for a rainy day.

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

2017-12-07 Thread Dave Chinner
On Fri, Dec 08, 2017 at 01:45:52PM +0900, Byungchul Park wrote:
> On Fri, Dec 08, 2017 at 09:22:16AM +1100, Dave Chinner wrote:
> > On Thu, Dec 07, 2017 at 11:06:34AM -0500, Theodore Ts'o wrote:
> > > On Wed, Dec 06, 2017 at 06:06:48AM -0800, Matthew Wilcox wrote:
> > > > > Unfortunately for you, I don't find arguments along the lines of
> > > > > "lockdep will save us" at all convincing.  lockdep already throws
> > > > > too many false positives to be useful as a tool that reliably and
> > > > > accurately points out rare, exciting, complex, intricate locking
> > > > > problems.
> > > > 
> > > > But it does reliably and accurately point out "dude, you forgot to take
> > > > the lock".  It's caught a number of real problems in my own testing that
> > > > you never got to see.
> > > 
> > > The problem is that if it has too many false positives --- and it's
> > > gotten *way* worse with the completion callback "feature", people will
> > > just stop using Lockdep as being too annyoing and a waste of developer
> > > time when trying to figure what is a legitimate locking bug versus
> > > lockdep getting confused.
> > > 
> > > I can't even disable the new Lockdep feature which is throwing
> > > lots of new false positives --- it's just all or nothing.
> > > 
> > > Dave has just said he's already stopped using Lockdep, as a result.
> > 
> > This is compeltely OT, but FYI I stopped using lockdep a long time
> > ago.  We've spend orders of magnitude more time and effort to shut
> > up lockdep false positives in the XFS code than we ever have on
> > locking problems that lockdep has uncovered. And still lockdep
> > throws too many false positives on XFS workloads to be useful to me.
> > 
> > But it's more than that: I understand just how much lockdep *doesn't
> > check* and that means *I know I can't rely on lockdep* for potential
> > deadlock detection. e.g.  it doesn't cover semaphores, which means
> 
> Hello,
> 
> I'm careful in saying the following since you seem to feel not good at
> crossrelease and even lockdep. Now that cross-release has been
> introduced, semaphores can be covered as you might know. Actually, all
> general waiters can.

And all it will do is create a whole bunch more work for us XFS guys
to shut up all the the false positive crap that falls out from it
because the locking model we have is far more complex than any of
the lockdep developers thought was necessary to support, just like
happened with the XFS inode annotations all those years ago.

e.g. nobody has ever bothered to ask us what is needed to describe
XFS's semaphore locking model.  If you did that, you'd know that we
nest *thousands* of locked semaphores in compeltely random lock
order during metadata buffer writeback. And that this lock order
does not reflect the actual locking order rules we have for locking
buffers during transactions.

Oh, and you'd also know that a semaphore's lock order and context
can change multiple times during the life time of the buffer.  Say
we free a block and the reallocate it as something else before it is
reclaimed - that buffer now might have a different lock order. Or
maybe we promote a buffer to be a root btree block as a result of a
join - it's now the first buffer in a lock run, rather than a child.
Or we split a tree, and the root is now a node and so no longer is
the first buffer in a lock run. Or that we walk sideways along the
leaf nodes siblings during searches.  IOWs, there is no well defined
static lock ordering at all for buffers - and therefore semaphores -
in XFS at all.

And knowing that, you wouldn't simply mention that lockdep can
support semaphores now as though that is necessary to "make it work"
for XFS.  It's going to be much simpler for us to just turn off
lockdep and ignore whatever crap it sends our way than it is to
spend unplanned weeks of our time to try to make lockdep sorta work
again. Sure, we might get there in the end, but it's likely to take
months, if not years like it did with the XFS inode annotations.

> > it has zero coverage of the entire XFS metadata buffer subsystem and
> > the complex locking orders we have for metadata updates.
> > 
> > Put simply: lockdep doesn't provide me with any benefit, so I don't
> > use it...
> 
> Sad..

I don't think you understand. I'll try to explain.

The lockdep infrastructure by itself doesn't make lockdep a useful
tool - it mostly generates false positives because it has no
concept of locking models that don't match it's internal tracking
assumptions and/or limitations.

That means if we can't suppress the false positives, then l

Re: Lockdep is less useful than it was

2017-12-07 Thread Dave Chinner
On Thu, Dec 07, 2017 at 02:38:03PM -0800, Matthew Wilcox wrote:
> On Thu, Dec 07, 2017 at 11:06:34AM -0500, Theodore Ts'o wrote:
> > The problem is that if it has too many false positives --- and it's
> > gotten *way* worse with the completion callback "feature", people will
> > just stop using Lockdep as being too annyoing and a waste of developer
> > time when trying to figure what is a legitimate locking bug versus
> > lockdep getting confused.
> > 
> > I can't even disable the new Lockdep feature which is throwing
> > lots of new false positives --- it's just all or nothing.
> 
> You *can* ... but it's way more hacking Kconfig than you ought to have
> to do (which is a separate rant ...)
> 
> You need to get LOCKDEP_CROSSRELEASE off.  I'd revert patches
> e26f34a407aec9c65bce2bc0c838fabe4f051fc6 and
> b483cf3bc249d7af706390efa63d6671e80d1c09
> 
> I think it was a mistake to force these on for everybody; they have a
> much higher false-positive rate than the rest of lockdep, so as you say
> forcing them on leads to fewer people using *any* of lockdep.
> 
> The bug you're hitting isn't Byungchul's fault; it's an annotation
> problem.  The same kind of annotation problem that we used to have with
> dozens of other places in the kernel which are now fixed.

That's one of the fundamental problem with lockdep - it throws the
difficulty of solving all these new false positives onto the
developers who know nothing about lockdep and don't follow it's
development. And until they do solve them - especially in critical
subsystems that everyone uses like the storage stack - lockdep is
essentially worthless.

> If you didn't
> have to hack Kconfig to get rid of this problem, you'd be happier, right?

I'd be much happier if it wasn't turned on by default in the first
place.  We gave plenty of warnings that there were still unsolved
false positive problems with the new checks in the storage stack.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

2017-12-07 Thread Dave Chinner
On Thu, Dec 07, 2017 at 11:06:34AM -0500, Theodore Ts'o wrote:
> On Wed, Dec 06, 2017 at 06:06:48AM -0800, Matthew Wilcox wrote:
> > > Unfortunately for you, I don't find arguments along the lines of
> > > "lockdep will save us" at all convincing.  lockdep already throws
> > > too many false positives to be useful as a tool that reliably and
> > > accurately points out rare, exciting, complex, intricate locking
> > > problems.
> > 
> > But it does reliably and accurately point out "dude, you forgot to take
> > the lock".  It's caught a number of real problems in my own testing that
> > you never got to see.
> 
> The problem is that if it has too many false positives --- and it's
> gotten *way* worse with the completion callback "feature", people will
> just stop using Lockdep as being too annyoing and a waste of developer
> time when trying to figure what is a legitimate locking bug versus
> lockdep getting confused.
> 
> I can't even disable the new Lockdep feature which is throwing
> lots of new false positives --- it's just all or nothing.
> 
> Dave has just said he's already stopped using Lockdep, as a result.

This is compeltely OT, but FYI I stopped using lockdep a long time
ago.  We've spend orders of magnitude more time and effort to shut
up lockdep false positives in the XFS code than we ever have on
locking problems that lockdep has uncovered. And still lockdep
throws too many false positives on XFS workloads to be useful to me.

But it's more than that: I understand just how much lockdep *doesn't
check* and that means *I know I can't rely on lockdep* for potential
deadlock detection. e.g.  it doesn't cover semaphores, which means
it has zero coverage of the entire XFS metadata buffer subsystem and
the complex locking orders we have for metadata updates.

Put simply: lockdep doesn't provide me with any benefit, so I don't
use it...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

2017-12-06 Thread Dave Chinner
On Wed, Dec 06, 2017 at 06:06:48AM -0800, Matthew Wilcox wrote:
> On Wed, Dec 06, 2017 at 07:44:04PM +1100, Dave Chinner wrote:
> > On Tue, Dec 05, 2017 at 08:45:49PM -0800, Matthew Wilcox wrote:
> > > That said, using xa_cmpxchg() in the dquot code looked like the right
> > > thing to do?  Since we'd dropped the qi mutex and the ILOCK, it looks
> > > entirely reasonable for another thread to come in and set up the dquot.
> > > But I'm obviously quite ignorant of the XFS internals, so maybe there's
> > > something else going on that makes this essentially a "can't happen".
> > 
> > It's no different to the inode cache code, which drops the RCU
> > lock on lookup miss, instantiates the new inode (maybe reading it
> > off disk), then locks the tree and attempts to insert it. Both cases
> > use "insert if empty, otherwise retry lookup from start" semantics.
> 
> Ah.  I had my focus set a little narrow on the inode cache code and didn't
> recognise the pattern.
> 
> Why do you sleep for one jiffy after encountering a miss, then seeing
> someone else insert the inode for you?

The sleep is a backoff that allows whatever we raced with to
complete, be it a hit that raced with an inode being reclaimed and
removed, or a miss that raced with another insert. Ideally we'd
sleep on the XFS_INEW bit, similar to the vfs I_NEW flag, but it's
not quite that simple with the reclaim side of things...

> > cmpxchg is for replacing a known object in a store - it's not really
> > intended for doing initial inserts after a lookup tells us there is
> > nothing in the store.  The radix tree "insert only if empty" makes
> > sense here, because it naturally takes care of lookup/insert races
> > via the -EEXIST mechanism.
> > 
> > I think that providing xa_store_excl() (which would return -EEXIST
> > if the entry is not empty) would be a better interface here, because
> > it matches the semantics of lookup cache population used all over
> > the kernel
> 
> I'm not thrilled with xa_store_excl(), but I need to think about that
> a bit more.

Not fussed about the name - I just think we need a function that
matches the insert semantics of the code

> > > I'm quite happy to have normal API variants that don't save/restore
> > > interrupts.  Just need to come up with good names ... I don't think
> > > xa_store_noirq() is a good name, but maybe you do?
> > 
> > I'd prefer not to have to deal with such things at all. :P
> > 
> > How many subsystems actually require irq safety in the XA locking
> > code? Make them use irqsafe versions, not make everyone else use
> > "noirq" versions, as is the convention for the rest of the kernel
> > code
> 
> Hard to say how many existing radix tree users require the irq safety.

The mapping tree requires it because it gets called from IO
completion contexts to clear page writeback state, but I don't know
about any of the others.

> Also hard to say how many potential users (people currently using
> linked lists, people using resizable arrays, etc) need irq safety.
> My thinking was "make it safe by default and let people who know better
> have a way to opt out", but there's definitely something to be said for
> "make it fast by default and let people who need the unusual behaviour
> type those extra few letters".
> 
> So, you're arguing for providing xa_store(), xa_store_irq(), xa_store_bh()
> and xa_store_irqsafe()?  (at least on demand, as users come to light?)
> At least the read side doesn't require any variants; everybody can use
> RCU for read side protection.

That would follow the pattern of the rest of the kernel APIs, though
I think it might be cleaner to simply state the locking requirement
to xa_init() and keep all those details completely internal rather
than encoding them into API calls. After all, the "irqsafe-ness" of
the locking needs to be consistent across the entire XA instance

> ("safe", not "save" because I wouldn't make the caller provide the
> "flags" argument).
> 
> > > At least, not today.  One of the future plans is to allow xa_nodes to
> > > be allocated from ZONE_MOVABLE.  In order to do that, we have to be
> > > able to tell which lock protects any given node.  With the XArray,
> > > we can find that out (xa_node->root->xa_lock); with the radix tree,
> > > we don't even know what kind of lock protects the tree.
> > 
> > Yup, this is a prime example of why we shouldn't be creating
> > external dependencies by smearing the locking context outside the XA
> > structure itself. It's not a stretch to se

Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

2017-12-06 Thread Dave Chinner
On Tue, Dec 05, 2017 at 08:45:49PM -0800, Matthew Wilcox wrote:
> On Wed, Dec 06, 2017 at 02:14:56PM +1100, Dave Chinner wrote:
> > > The other conversions use the normal API instead of the advanced API, so
> > > all of this gets hidden away.  For example, the inode cache does this:
> > 
> > Ah, OK, that's not obvious from the code changes. :/
> 
> Yeah, it's a lot easier to understand (I think!) if you build the
> docs in that tree and look at
> file:///home/willy/kernel/xarray-3/Documentation/output/core-api/xarray.html
> (mutatis mutandi).  I've tried to tell a nice story about how to put
> all the pieces together from the normal to the advanced API.
> 
> > However, it's probably overkill for XFS. In all the cases, when we
> > insert there should be no entry in the tree - the
> > radix tree insert error handling code there was simply catching
> > "should never happen" cases and handling it without crashing.
> 
> I thought it was probably overkill to be using xa_cmpxchg() in the
> pag_ici patch.  I didn't want to take away your error handling as part
> of the conversion, but I think a rational person implementing it today
> would just call xa_store() and not even worry about the return value
> except to check it for IS_ERR().

*nod*

> That said, using xa_cmpxchg() in the dquot code looked like the right
> thing to do?  Since we'd dropped the qi mutex and the ILOCK, it looks
> entirely reasonable for another thread to come in and set up the dquot.
> But I'm obviously quite ignorant of the XFS internals, so maybe there's
> something else going on that makes this essentially a "can't happen".

It's no different to the inode cache code, which drops the RCU
lock on lookup miss, instantiates the new inode (maybe reading it
off disk), then locks the tree and attempts to insert it. Both cases
use "insert if empty, otherwise retry lookup from start" semantics.

cmpxchg is for replacing a known object in a store - it's not really
intended for doing initial inserts after a lookup tells us there is
nothing in the store.  The radix tree "insert only if empty" makes
sense here, because it naturally takes care of lookup/insert races
via the -EEXIST mechanism.

I think that providing xa_store_excl() (which would return -EEXIST
if the entry is not empty) would be a better interface here, because
it matches the semantics of lookup cache population used all over
the kernel

> > Now that I've looked at this, I have to say that having a return
> > value of NULL meaning "success" is quite counter-intuitive. That's
> > going to fire my "that looks so wrong" detector every time I look at
> > the code and notice it's erroring out on a non-null return value
> > that isn't a PTR_ERR case
> 
> It's the same convention as cmpxchg().  I think it's triggering your
> "looks so wrong" detector because it's fundamentally not the natural
> thing to write.

Most definitely the case, and this is why it's a really bad
interface for the semantics we have. This how we end up with code
that makes it easy for programmers to screw up pointer checks in
error handling... :/

> I'm quite happy to have normal API variants that don't save/restore
> interrupts.  Just need to come up with good names ... I don't think
> xa_store_noirq() is a good name, but maybe you do?

I'd prefer not to have to deal with such things at all. :P

How many subsystems actually require irq safety in the XA locking
code? Make them use irqsafe versions, not make everyone else use
"noirq" versions, as is the convention for the rest of the kernel
code

> > > It's the design pattern I've always intended to use.  Naturally, the
> > > xfs radix trees weren't my initial target; it was the page cache, and
> > > the page cache does the same thing; uses the tree_lock to protect both
> > > the radix tree and several other fields in that same data structure.
> > > 
> > > I'm open to argument on this though ... particularly if you have a better
> > > design pattern in mind!
> > 
> > I don't mind structures having internal locking - I have a problem
> > with leaking them into contexts outside the structure they protect.
> > That way lies madness - you can't change the internal locking in
> > future because of external dependencies, and the moment you need
> > something different externally we've got to go back to an external
> > lock anyway.
> > 
> > This is demonstrated by the way you converted the XFS dquot tree -
> > you didn't replace the dquot tree lock with the internal xa_lock
> > because it's a mutex and we have to sleep holding it. IOWs, we've
> > added another layer of locking here, not simplif

Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

2017-12-05 Thread Dave Chinner
On Tue, Dec 05, 2017 at 06:02:08PM -0800, Matthew Wilcox wrote:
> On Wed, Dec 06, 2017 at 12:36:48PM +1100, Dave Chinner wrote:
> > > - if (radix_tree_preload(GFP_NOFS))
> > > - return -ENOMEM;
> > > -
> > >   INIT_LIST_HEAD(>list_node);
> > >   elem->key = key;
> > >  
> > > - spin_lock(>lock);
> > > - error = radix_tree_insert(>store, key, elem);
> > > - radix_tree_preload_end();
> > > - if (!error)
> > > - _xfs_mru_cache_list_insert(mru, elem);
> > > - spin_unlock(>lock);
> > > + do {
> > > + xas_lock();
> > > + xas_store(, elem);
> > > + error = xas_error();
> > > + if (!error)
> > > + _xfs_mru_cache_list_insert(mru, elem);
> > > + xas_unlock();
> > > + } while (xas_nomem(, GFP_NOFS));
> > 
> > Ok, so why does this have a retry loop on ENOMEM despite the
> > existing code handling that error? And why put such a loop in this
> > code and not any of the other XFS code that used
> > radix_tree_preload() and is arguably much more important to avoid
> > ENOMEM on insert (e.g. the inode cache)?
> 
> If we need more nodes in the tree, xas_store() will try to allocate them
> with GFP_NOWAIT | __GFP_NOWARN.  If that fails, it signals it in xas_error().
> xas_nomem() will notice that we're in an ENOMEM situation, and allocate
> a node using your preferred GFP flags (NOIO in your case).  Then we retry,
> guaranteeing forward progress. [1]
> 
> The other conversions use the normal API instead of the advanced API, so
> all of this gets hidden away.  For example, the inode cache does this:
> 
> +   curr = xa_cmpxchg(>pag_ici_xa, agino, NULL, ip, GFP_NOFS);
> 
> and xa_cmpxchg internally does:
> 
> do {
> xa_lock_irqsave(xa, flags);
> curr = xas_create();
> if (curr == old)
> xas_store(, entry);
> xa_unlock_irqrestore(xa, flags);
> } while (xas_nomem(, gfp));

Ah, OK, that's not obvious from the code changes. :/

However, it's probably overkill for XFS. In all the cases, when we
insert there should be no entry in the tree - the
radix tree insert error handling code there was simply catching
"should never happen" cases and handling it without crashing.

Now that I've looked at this, I have to say that having a return
value of NULL meaning "success" is quite counter-intuitive. That's
going to fire my "that looks so wrong" detector every time I look at
the code and notice it's erroring out on a non-null return value
that isn't a PTR_ERR case

Also, there's no need for irqsave/restore() locking contexts here as
we never access these caches from interrupt contexts. That's just
going to be extra overhead, especially on workloads that run 10^6
inodes inodes through the cache every second. That's a problem
caused by driving the locks into the XA structure and then needing
to support callers that require irq safety

> > Also, I really don't like the pattern of using xa_lock()/xa_unlock()
> > to protect access to an external structure. i.e. the mru->lock
> > context is protecting multiple fields and operations in the MRU
> > structure, not just the radix tree operations. Turning that around
> > so that a larger XFS structure and algorithm is now protected by an
> > opaque internal lock from generic storage structure the forms part
> > of the larger structure seems like a bad design pattern to me...
> 
> It's the design pattern I've always intended to use.  Naturally, the
> xfs radix trees weren't my initial target; it was the page cache, and
> the page cache does the same thing; uses the tree_lock to protect both
> the radix tree and several other fields in that same data structure.
> 
> I'm open to argument on this though ... particularly if you have a better
> design pattern in mind!

I don't mind structures having internal locking - I have a problem
with leaking them into contexts outside the structure they protect.
That way lies madness - you can't change the internal locking in
future because of external dependencies, and the moment you need
something different externally we've got to go back to an external
lock anyway.

This is demonstrated by the way you converted the XFS dquot tree -
you didn't replace the dquot tree lock with the internal xa_lock
because it's a mutex and we have to sleep holding it. IOWs, we've
added another layer of locking here, not simplified the code.

What I really see here is that  we have inconsistent locking
patterns w.r.t. XA stores inside XFS - some have an external mutex
to cover a wider scope, some use xa_

Re: [PATCH v4 00/73] XArray version 4

2017-12-05 Thread Dave Chinner
On Tue, Dec 05, 2017 at 06:05:15PM -0800, Matthew Wilcox wrote:
> On Wed, Dec 06, 2017 at 12:45:49PM +1100, Dave Chinner wrote:
> > On Tue, Dec 05, 2017 at 04:40:46PM -0800, Matthew Wilcox wrote:
> > > From: Matthew Wilcox <mawil...@microsoft.com>
> > > 
> > > I looked through some notes and decided this was version 4 of the XArray.
> > > Last posted two weeks ago, this version includes a *lot* of changes.
> > > I'd like to thank Dave Chinner for his feedback, encouragement and
> > > distracting ideas for improvement, which I'll get to once this is merged.
> > 
> > BTW, you need to fix the "To:" line on your patchbombs:
> > 
> > > To: unlisted-recipients: ;, no To-header on input 
> > > <@gmail-pop.l.google.com> 
> > 
> > This bad email address getting quoted to the cc line makes some MTAs
> > very unhappy.
> 
> I know :-(  I was unhappy when I realised what I'd done.
> 
> https://marc.info/?l=git=151252237912266=2
> 
> > I'll give this a quick burn this afternoon and see what catches fire...
> 
> All of the things ... 0day gave me a 90% chance of hanging in one
> configuration.  Need to drill down on it more and find out what stupid
> thing I've done wrong this time.

Yup, Bad Stuff happened on boot:

[   24.548039] INFO: rcu_preempt detected stalls on CPUs/tasks:
[   24.548978]  1-...!: (0 ticks this GP) idle=688/0/0 softirq=143/143 fqs=0
[   24.549926]  5-...!: (0 ticks this GP) idle=db8/0/0 softirq=120/120 fqs=0
[   24.550864]  6-...!: (0 ticks this GP) idle=d58/0/0 softirq=111/111 fqs=0
[   24.551802]  8-...!: (5 GPs behind) idle=514/0/0 softirq=189/189 fqs=0
[   24.552722]  10-...!: (84 GPs behind) idle=ac0/0/0 softirq=80/80 fqs=0
[   24.553617]  11-...!: (8 GPs behind) idle=cfc/0/0 softirq=95/95 fqs=0
[   24.554496]  13-...!: (8 GPs behind) idle=b0c/0/0 softirq=82/82 fqs=0
[   24.555382]  14-...!: (38 GPs behind) idle=a7c/0/0 softirq=93/93 fqs=0
[   24.556305]  15-...!: (4 GPs behind) idle=b18/0/0 softirq=88/88 fqs=0
[   24.557190]  (detected by 9, t=5252 jiffies, g=-178, c=-179, q=994)
[   24.558051] Sending NMI from CPU 9 to CPUs 1:
[   24.558703] NMI backtrace for cpu 1 skipped: idling at 
native_safe_halt+0x2/0x10
[   24.559654] Sending NMI from CPU 9 to CPUs 5:
[   24.559675] NMI backtrace for cpu 5 skipped: idling at 
native_safe_halt+0x2/0x10
[   24.560654] Sending NMI from CPU 9 to CPUs 6:
[   24.560689] NMI backtrace for cpu 6 skipped: idling at 
native_safe_halt+0x2/0x10
[   24.561655] Sending NMI from CPU 9 to CPUs 8:
[   24.561701] NMI backtrace for cpu 8 skipped: idling at 
native_safe_halt+0x2/0x10
[   24.562654] Sending NMI from CPU 9 to CPUs 10:
[   24.562675] NMI backtrace for cpu 10 skipped: idling at 
native_safe_halt+0x2/0x10
[   24.563653] Sending NMI from CPU 9 to CPUs 11:
[   24.563669] NMI backtrace for cpu 11 skipped: idling at 
native_safe_halt+0x2/0x10
[   24.564653] Sending NMI from CPU 9 to CPUs 13:
[   24.564670] NMI backtrace for cpu 13 skipped: idling at 
native_safe_halt+0x2/0x10
[   24.565652] Sending NMI from CPU 9 to CPUs 14:
[   24.565674] NMI backtrace for cpu 14 skipped: idling at 
native_safe_halt+0x2/0x10
[   24.566652] Sending NMI from CPU 9 to CPUs 15:
[   24.59] NMI backtrace for cpu 15 skipped: idling at 
native_safe_halt+0x2/0x10
[   24.567653] rcu_preempt kthread starved for 5256 jiffies! 
g18446744073709551438 c18446744073709551437 f0x0 RCU_GP_WAIT_FQS(3) 
->state=0x402 ->7
[   24.567654] rcu_preempt I15128 9  2 0x8000
[   24.567660] Call Trace:
[   24.567679]  ? __schedule+0x289/0x880
[   24.567681]  schedule+0x2f/0x90
[   24.567682]  schedule_timeout+0x152/0x370
[   24.567686]  ? __next_timer_interrupt+0xc0/0xc0
[   24.567689]  rcu_gp_kthread+0x561/0x880
[   24.567691]  ? force_qs_rnp+0x1a0/0x1a0
[   24.567693]  kthread+0x111/0x130
[   24.567695]  ? __kthread_create_worker+0x120/0x120
[   24.567697]  ret_from_fork+0x24/0x30
[   44.064092] watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kswapd0:854]
[   44.065920] CPU: 0 PID: 854 Comm: kswapd0 Not tainted 4.15.0-rc2-dgc #228
[   44.067769] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1 04/01/2014
[   44.070030] RIP: 0010:smp_call_function_single+0xce/0x100
[   44.071521] RSP: :c90001d2fb20 EFLAGS: 0202 ORIG_RAX: 
ff11
[   44.073592] RAX:  RBX: 88013ab515c8 RCX: c9000350bb20
[   44.075560] RDX: 0001 RSI: c90001d2fb20 RDI: c90001d2fb20
[   44.077531] RBP: c90001d2fb50 R08: 0007 R09: 0080
[   44.079483] R10: c90001d2fb78 R11: c90001d2fb30 R12: c90001d2fc10
[   44.081465] R13: ea000449fc78 R14: ea000449fc58 R15: 88013ba36c40
[   44.083434] FS:  () GS:88013fc0() 
knlGS:
[   44.085683] CS:  0010 DS:  ES:  CR0

Re: [PATCH v4 00/73] XArray version 4

2017-12-05 Thread Dave Chinner
On Wed, Dec 06, 2017 at 01:53:41AM +, Matthew Wilcox wrote:
> Huh, you've caught a couple of problems that 0day hasn't sent me yet.  Try 
> turning on DAX or TRANSPARENT_HUGEPAGE.  Thanks!

Dax is turned on, CONFIG_TRANSPARENT_HUGEPAGE is not.

Looks like nothing is setting CONFIG_RADIX_TREE_MULTIORDER, which is
what xas_set_order() is hidden under.

Ah, CONFIG_ZONE_DEVICE turns it on, not CONFIG_DAX/CONFIG_FS_DAX.

H.  That seems wrong if it's used in fs/dax.c...

$ grep DAX .config
CONFIG_DAX=y
CONFIG_FS_DAX=y
$ grep ZONE_DEVICE .config
CONFIG_ARCH_HAS_ZONE_DEVICE=y
$

So I have DAX enabled, but not ZONE_DEVICE? Shouldn't DAX be
selecting ZONE_DEVICE, not relying on a user to select both of them
so that stuff works properly? Hmmm - there's no menu option to turn
on zone device, so it's selected by something else?  Oh, HMM turns
on ZONE device. But that is "default y", so should be turned on. But
it's not?  And there's no obvious HMM menu config option, either

What a godawful mess Kconfig has turned into.

I'm just going to enable TRANSPARENT_HUGEPAGE - madness awaits me if
I follow the other path down the rat hole

Ok, it build this time.

-Dave.

> 
> > -Original Message-----
> > From: Dave Chinner [mailto:da...@fromorbit.com]
> > Sent: Tuesday, December 5, 2017 8:51 PM
> > To: Matthew Wilcox <wi...@infradead.org>
> > Cc: Matthew Wilcox <mawil...@microsoft.com>; Ross Zwisler
> > <ross.zwis...@linux.intel.com>; Jens Axboe <ax...@kernel.dk>; Rehas
> > Sachdeva <aquan...@gmail.com>; linux...@kvack.org; linux-
> > fsde...@vger.kernel.org; linux-f2fs-de...@lists.sourceforge.net; linux-
> > ni...@vger.kernel.org; linux-btrfs@vger.kernel.org; 
> > linux-...@vger.kernel.org;
> > linux-...@vger.kernel.org; linux-ker...@vger.kernel.org
> > Subject: Re: [PATCH v4 00/73] XArray version 4
> > 
> > On Wed, Dec 06, 2017 at 12:45:49PM +1100, Dave Chinner wrote:
> > > On Tue, Dec 05, 2017 at 04:40:46PM -0800, Matthew Wilcox wrote:
> > > > From: Matthew Wilcox <mawil...@microsoft.com>
> > > >
> > > > I looked through some notes and decided this was version 4 of the 
> > > > XArray.
> > > > Last posted two weeks ago, this version includes a *lot* of changes.
> > > > I'd like to thank Dave Chinner for his feedback, encouragement and
> > > > distracting ideas for improvement, which I'll get to once this is 
> > > > merged.
> > >
> > > BTW, you need to fix the "To:" line on your patchbombs:
> > >
> > > > To: unlisted-recipients: ;, no To-header on input <@gmail-
> > pop.l.google.com>
> > >
> > > This bad email address getting quoted to the cc line makes some MTAs
> > > very unhappy.
> > >
> > > >
> > > > Highlights:
> > > >  - Over 2000 words of documentation in patch 8!  And lots more 
> > > > kernel-doc.
> > > >  - The page cache is now fully converted to the XArray.
> > > >  - Many more tests in the test-suite.
> > > >
> > > > This patch set is not for applying.  0day is still reporting problems,
> > > > and I'd feel bad for eating someone's data.  These patches apply on top
> > > > of a set of prepatory patches which just aren't interesting.  If you
> > > > want to see the patches applied to a tree, I suggest pulling my git 
> > > > tree:
> > > >
> > https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgit.infrade
> > ad.org%2Fusers%2Fwilly%2Flinux-
> > dax.git%2Fshortlog%2Frefs%2Fheads%2Fxarray-2017-12-
> > 04=02%7C01%7Cmawilcox%40microsoft.com%7Ca3e721545f8b4b9dff1
> > 608d53c4bd42f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6364
> > 81218740341312=IXNZXXLTf964OQ0eLDpJt2LCv%2BGGWFW%2FQd4Kc
> > KYu6zo%3D=0
> > > > I also left out the idr_preload removals.  They're still in the git 
> > > > tree,
> > > > but I'm not looking for feedback on them.
> > >
> > > I'll give this a quick burn this afternoon and see what catches fire...
> > 
> > Build warnings/errors:
> > 
> > .
> > lib/radix-tree.c:700:13: warning: ¿radix_tree_free_nodes¿ defined but not 
> > used
> > [-Wunused-function]
> >  static void radix_tree_free_nodes(struct radix_tree_node *node)
> > .
> > lib/xarray.c: In function ¿xas_max¿:
> > lib/xarray.c:291:16: warning: unused variable ¿mask¿
> > [-Wunused-variable]
> >   unsigned long mask, max = xas->xa_index;
> >   ^~~~
> > ..
> > fs/dax.c: In function ¿grab_mapping_entry¿:
> > fs/dax.c:305:2: error: implicit declaration of function ¿xas_set_order¿; 
> > did you
> > mean ¿xas_set_err¿?  [-Werror=implicit-function-declaration]
> >   xas_set_order(, index, size_flag ? PMD_ORDER : 0);
> > ^
> > scripts/Makefile.build:310: recipe for target 'fs/dax.o' failed
> > make[1]: *** [fs/dax.o] Error 1
> > 
> > -Dave.
> > --
> > Dave Chinner
> > da...@fromorbit.com

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 00/73] XArray version 4

2017-12-05 Thread Dave Chinner
On Wed, Dec 06, 2017 at 12:45:49PM +1100, Dave Chinner wrote:
> On Tue, Dec 05, 2017 at 04:40:46PM -0800, Matthew Wilcox wrote:
> > From: Matthew Wilcox <mawil...@microsoft.com>
> > 
> > I looked through some notes and decided this was version 4 of the XArray.
> > Last posted two weeks ago, this version includes a *lot* of changes.
> > I'd like to thank Dave Chinner for his feedback, encouragement and
> > distracting ideas for improvement, which I'll get to once this is merged.
> 
> BTW, you need to fix the "To:" line on your patchbombs:
> 
> > To: unlisted-recipients: ;, no To-header on input <@gmail-pop.l.google.com> 
> 
> This bad email address getting quoted to the cc line makes some MTAs
> very unhappy.
> 
> > 
> > Highlights:
> >  - Over 2000 words of documentation in patch 8!  And lots more kernel-doc.
> >  - The page cache is now fully converted to the XArray.
> >  - Many more tests in the test-suite.
> > 
> > This patch set is not for applying.  0day is still reporting problems,
> > and I'd feel bad for eating someone's data.  These patches apply on top
> > of a set of prepatory patches which just aren't interesting.  If you
> > want to see the patches applied to a tree, I suggest pulling my git tree:
> > http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray-2017-12-04
> > I also left out the idr_preload removals.  They're still in the git tree,
> > but I'm not looking for feedback on them.
> 
> I'll give this a quick burn this afternoon and see what catches fire...

Build warnings/errors:

.
lib/radix-tree.c:700:13: warning: ¿radix_tree_free_nodes¿ defined but not used 
[-Wunused-function]
 static void radix_tree_free_nodes(struct radix_tree_node *node)
.
lib/xarray.c: In function ¿xas_max¿:
lib/xarray.c:291:16: warning: unused variable ¿mask¿
[-Wunused-variable]
  unsigned long mask, max = xas->xa_index;
  ^~~~
..
fs/dax.c: In function ¿grab_mapping_entry¿:
fs/dax.c:305:2: error: implicit declaration of function ¿xas_set_order¿; did 
you mean ¿xas_set_err¿?  [-Werror=implicit-function-declaration]
  xas_set_order(, index, size_flag ? PMD_ORDER : 0);
^
scripts/Makefile.build:310: recipe for target 'fs/dax.o' failed
make[1]: *** [fs/dax.o] Error 1

-Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 00/73] XArray version 4

2017-12-05 Thread Dave Chinner
On Tue, Dec 05, 2017 at 04:40:46PM -0800, Matthew Wilcox wrote:
> From: Matthew Wilcox <mawil...@microsoft.com>
> 
> I looked through some notes and decided this was version 4 of the XArray.
> Last posted two weeks ago, this version includes a *lot* of changes.
> I'd like to thank Dave Chinner for his feedback, encouragement and
> distracting ideas for improvement, which I'll get to once this is merged.

BTW, you need to fix the "To:" line on your patchbombs:

> To: unlisted-recipients: ;, no To-header on input <@gmail-pop.l.google.com> 

This bad email address getting quoted to the cc line makes some MTAs
very unhappy.

> 
> Highlights:
>  - Over 2000 words of documentation in patch 8!  And lots more kernel-doc.
>  - The page cache is now fully converted to the XArray.
>  - Many more tests in the test-suite.
> 
> This patch set is not for applying.  0day is still reporting problems,
> and I'd feel bad for eating someone's data.  These patches apply on top
> of a set of prepatory patches which just aren't interesting.  If you
> want to see the patches applied to a tree, I suggest pulling my git tree:
> http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray-2017-12-04
> I also left out the idr_preload removals.  They're still in the git tree,
> but I'm not looking for feedback on them.

I'll give this a quick burn this afternoon and see what catches fire...

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

2017-12-05 Thread Dave Chinner
On Tue, Dec 05, 2017 at 04:41:58PM -0800, Matthew Wilcox wrote:
> From: Matthew Wilcox <mawil...@microsoft.com>
> 
> This eliminates a call to radix_tree_preload().

.

>  void
> @@ -431,24 +424,24 @@ xfs_mru_cache_insert(
>   unsigned long   key,
>   struct xfs_mru_cache_elem *elem)
>  {
> + XA_STATE(xas, >store, key);
>   int error;
>  
>   ASSERT(mru && mru->lists);
>   if (!mru || !mru->lists)
>   return -EINVAL;
>  
> - if (radix_tree_preload(GFP_NOFS))
> - return -ENOMEM;
> -
>   INIT_LIST_HEAD(>list_node);
>   elem->key = key;
>  
> - spin_lock(>lock);
> - error = radix_tree_insert(>store, key, elem);
> - radix_tree_preload_end();
> - if (!error)
> - _xfs_mru_cache_list_insert(mru, elem);
> - spin_unlock(>lock);
> + do {
> + xas_lock();
> + xas_store(, elem);
> + error = xas_error();
> + if (!error)
> + _xfs_mru_cache_list_insert(mru, elem);
> + xas_unlock();
> + } while (xas_nomem(, GFP_NOFS));

Ok, so why does this have a retry loop on ENOMEM despite the
existing code handling that error? And why put such a loop in this
code and not any of the other XFS code that used
radix_tree_preload() and is arguably much more important to avoid
ENOMEM on insert (e.g. the inode cache)?

Also, I really don't like the pattern of using xa_lock()/xa_unlock()
to protect access to an external structure. i.e. the mru->lock
context is protecting multiple fields and operations in the MRU
structure, not just the radix tree operations. Turning that around
so that a larger XFS structure and algorithm is now protected by an
opaque internal lock from generic storage structure the forms part
of the larger structure seems like a bad design pattern to me...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] iomap: report collisions between directio and buffered writes to userspace

2017-11-21 Thread Dave Chinner
On Tue, Nov 21, 2017 at 04:52:53AM -0800, Matthew Wilcox wrote:
> On Tue, Nov 21, 2017 at 05:48:15PM +1100, Dave Chinner wrote:
> > On Mon, Nov 20, 2017 at 08:32:40PM -0800, Matthew Wilcox wrote:
> > > On Mon, Nov 20, 2017 at 05:37:53PM -0800, Darrick J. Wong wrote:
> > > > On Tue, Nov 21, 2017 at 09:27:49AM +1100, Dave Chinner wrote:
> > > > > First thing I noticed was that "xa" as a prefix is already quite
> > > > > widely used in XFS - it's shorthand for "XFS AIL". Indeed, xa_lock
> > >
> > > The X stands for 'eXpandable' or 'eXtending'.  I really don't want to
> > > use more than a two-letter acronym for whatever the XArray ends up being
> > > called.  One of the problems with the radix tree is that everything has
> > > that 11-character 'radix_tree_' prefix; just replacing that with 'xa_'
> > > makes a huge improvement to readability.
> > 
> > Yeah, understood. That's why
> > we have very little clear
> > prefix namespace left :/
> > 
> > [ somedays I write something that looks sorta like a haiku, and from
> > that point on everything just starts falling out of my brain that
> > way. I blame Eric for this today! :P ]
> 
> When the namespace is
> tight we must consider the
> competing users.
> 
> The earliest us'r
> has a claim to a prefix
> we are used to it.
> 
> Also a more wide-
> spread user has a claim to
> a shorter prefix.
> 
> Would you mind changing
> your prefix to one only
> one letter longer?

We can do something
like that, though Darrick has the
final say in it.

> ... ok, I give up ;-)

Top marks for effort :)

> All your current usage of the xa_ prefix looks somewhat like this:
> 
> fs/xfs/xfs_trans_ail.c: spin_lock(>xa_lock);
> 
> with honourable mentions to:
> fs/xfs/xfs_log.c: spin_lock(>m_ail->xa_lock);
> 
> Would you mind if I bolt a patch on to the front of the series called
> something like "free up xa_ namespace" that renamed your xa_* to ail_*?
> There are no uses of the 'ail_' prefix in the kernel today.
> 
> I don't think that
>   spin_lock(>ail_lock);
> loses any readability.

Not sure that's going to work - there's an "xa_ail" member for the
AIL list head. That would now read in places:

if (list_empty(>ail_ail))

I'd be inclined to just drop the "xa_" prefix from the XFS
structure.  There is no information loss by removing the prefix in
the XFS code because the pointer name tells us what structure it is
pointing at.

> 
> By the way, what does AIL stand for?  It'd be nice if it were spelled out
> in at least one of the header files, maybe fs/xfs/xfs_trans_priv.h?

Active Item List. See the section "Tracking Changes" in
Documentation/filesystems/xfs-delayed-logging-design.txt for the
full rundown, but in short:

"The log item is already used to track the log items that have been
written to the log but not yet written to disk. Such log items are
considered "active" and as such are stored in the Active Item List
(AIL) which is a LSN-ordered double linked list. Items are inserted
into this list during log buffer IO completion, after which they are
unpinned and can be written to disk."

The lack of comments describing the AIL is historic - it's never
been documented in the code, nor has the whole relogging concept it
implements. I wrote the document above when introducing the CIL
(Commited Item List) because almost no-one actively working on XFS
understood how the whole journalling subsystem worked in any detail.

> > Zoetrope Array.
> > Labyrinth of illusion.
> > Structure never ends.
> 
> Thank you for making me look up zoetrope ;-)

My pleasure :)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] iomap: report collisions between directio and buffered writes to userspace

2017-11-20 Thread Dave Chinner
On Mon, Nov 20, 2017 at 08:32:40PM -0800, Matthew Wilcox wrote:
> On Mon, Nov 20, 2017 at 05:37:53PM -0800, Darrick J. Wong wrote:
> > On Tue, Nov 21, 2017 at 09:27:49AM +1100, Dave Chinner wrote:
> > > First thing I noticed was that "xa" as a prefix is already quite
> > > widely used in XFS - it's shorthand for "XFS AIL". Indeed, xa_lock
> > > already exists and is quite widely used, so having a generic
> > > interface using the same prefixes and lock names is going to be
> > > quite confusing in the XFS code. Especially considering there's
> > > fair bit of radix tree use in XFS (e.g. the internal inode and
> > > dquot caches).
> > > 
> > > FYI, from fs/xfs/xfs_trans_priv.h:
> > > 
> > > /*
> > >  * Private AIL structures.
> > >  *
> > >  * Eventually we need to drive the locking in here as well.
> > >  */
> > > struct xfs_ail {
> > > struct xfs_mount*xa_mount;
> > > struct task_struct  *xa_task;
> > > struct list_headxa_ail;
> > > xfs_lsn_t   xa_target;
> > > xfs_lsn_t   xa_target_prev;
> > > struct list_headxa_cursors;
> > > spinlock_t  xa_lock;
> > > xfs_lsn_t   xa_last_pushed_lsn;
> > > int xa_log_flush;
> > > struct list_headxa_buf_list;
> > > wait_queue_head_t   xa_empty;
> > > };
> > > 
> > > FWIW, why is it named "XArray"?  "X" stands for what?  It still
> > > looks like a tree structure to me, but without a design doc I'm a
> > > bit lost to how it differs to the radix tree (apart from the API)
> > > and why it's considered an "array".
> > 
> > /me nominates 'xarr' for the prefix because pirates. :P
> 
> The X stands for 'eXpandable' or 'eXtending'.  I really don't want to
> use more than a two-letter acronym for whatever the XArray ends up being
> called.  One of the problems with the radix tree is that everything has
> that 11-character 'radix_tree_' prefix; just replacing that with 'xa_'
> makes a huge improvement to readability.

Yeah, understood. That's why
we have very little clear
prefix namespace left :/

[ somedays I write something that looks sorta like a haiku, and from
that point on everything just starts falling out of my brain that
way. I blame Eric for this today! :P ]

> I'm open to renaming it, but I want a good name.  I think it needs to
> be *something* array, so we're looking for a prefix used nowhere in the
> kernel.  Running 'git grep' in turn for '\<[a-z]a_' really only allows
> for ja, ya and za.  'ja_' and 'ya_' only have one user, while 'za_'
> has none.  So ... anyone got a good retcon for JArray, YArray or ZArray?

A Yellow Array.
Colour is irrelevant.
The bikeshed looks nice.

> It's considered an array because it behaves "as if" it's an infinite
> array.

Infinite Array.
Namespace prefix collision
this haiku can't solve.

> The fact that we chop it up into chunks of 64 entries, and then
> arrange those chunks into a tree is not something the average user needs
> to concern themselves with.

Jumbo Array. Many
pieces hidden behind walls.
Will anyone notice?

> We have a lot of places in the kernel that
> fuss around with "allocate N entries, whoops, we exceeded N, kmalloc some
> more, oh no kmalloc failed, vmalloc it instead".  So the idea is to give
> these places an interface that acts like an automatically-resizing array.

Zoetrope Array.
Labyrinth of illusion.
Structure never ends.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] iomap: report collisions between directio and buffered writes to userspace

2017-11-20 Thread Dave Chinner
On Mon, Nov 20, 2017 at 01:51:00PM -0800, Matthew Wilcox wrote:
> On Tue, Nov 21, 2017 at 07:26:06AM +1100, Dave Chinner wrote:
> > On Mon, Nov 20, 2017 at 08:18:29AM -0800, Matthew Wilcox wrote:
> > > On Fri, Nov 17, 2017 at 11:39:25AM -0800, Darrick J. Wong wrote:
> > > > If two programs simultaneously try to write to the same part of a file
> > > > via direct IO and buffered IO, there's a chance that the post-diowrite
> > > > pagecache invalidation will fail on the dirty page.  When this happens,
> > > > the dio write succeeded, which means that the page cache is no longer
> > > > coherent with the disk!
> > > 
> > > This seems like a good opportunity to talk about what I've been working
> > > on for solving this problem.  The XArray is going to introduce a set
> > > of entries which can be stored to locations in the page cache that I'm
> > > calling 'wait entries'.
> > 
> > What's this XArray thing you speak of?
> 
> Ah, right, you were on sabbatical at LSFMM this year where I talked
> about it.  Briefly, it's a new API for the radix tree.  The data structure
> is essentially unchanged (minor enhancements), but I'm rationalising
> existing functionality and adding new abilities.  And getting rid of
> misfeatures like the preload API and implicit GFP flags.
> 
> My current working tree is here:
> 
> http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray-2017-11-20

First thing I noticed was that "xa" as a prefix is already quite
widely used in XFS - it's shorthand for "XFS AIL". Indeed, xa_lock
already exists and is quite widely used, so having a generic
interface using the same prefixes and lock names is going to be
quite confusing in the XFS code. Especially considering there's
fair bit of radix tree use in XFS (e.g. the internal inode and
dquot caches).

FYI, from fs/xfs/xfs_trans_priv.h:

/*
 * Private AIL structures.
 *
 * Eventually we need to drive the locking in here as well.
 */
struct xfs_ail {
struct xfs_mount*xa_mount;
struct task_struct  *xa_task;
struct list_headxa_ail;
xfs_lsn_t   xa_target;
xfs_lsn_t   xa_target_prev;
struct list_headxa_cursors;
spinlock_t  xa_lock;
xfs_lsn_t   xa_last_pushed_lsn;
int xa_log_flush;
struct list_headxa_buf_list;
wait_queue_head_t   xa_empty;
};


> Ignoring the prep patches, the excitement is all to be found with the
> commits which start 'xarray:'

FWIW, why is it named "XArray"?  "X" stands for what?  It still
looks like a tree structure to me, but without a design doc I'm a
bit lost to how it differs to the radix tree (apart from the API)
and why it's considered an "array".

> If you want an example of it in use, I'm pretty happy with this patch
> that switches the brd driver entirely from the radix tree API to the
> xarray API:
> 
> http://git.infradead.org/users/willy/linux-dax.git/commitdiff/dbf96ae943e43563cbbaa26e21b656b6fe8f4b0f

Looks pretty neat, but I'll reserve judgement for when I see the
conversion of the XFS radix tree code

> I've been pretty liberal with the kernel-doc, but I haven't written out
> a good .rst file to give an overview of how to use it.

Let me know when you've written it :)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] iomap: report collisions between directio and buffered writes to userspace

2017-11-20 Thread Dave Chinner
On Mon, Nov 20, 2017 at 08:18:29AM -0800, Matthew Wilcox wrote:
> On Fri, Nov 17, 2017 at 11:39:25AM -0800, Darrick J. Wong wrote:
> > If two programs simultaneously try to write to the same part of a file
> > via direct IO and buffered IO, there's a chance that the post-diowrite
> > pagecache invalidation will fail on the dirty page.  When this happens,
> > the dio write succeeded, which means that the page cache is no longer
> > coherent with the disk!
> 
> This seems like a good opportunity to talk about what I've been working
> on for solving this problem.  The XArray is going to introduce a set
> of entries which can be stored to locations in the page cache that I'm
> calling 'wait entries'.

What's this XArray thing you speak of?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/6] writeback: allow for dirty metadata accounting

2017-11-09 Thread Dave Chinner
On Thu, Nov 09, 2017 at 02:30:57PM -0500, Josef Bacik wrote:
> From: Josef Bacik <jba...@fb.com>
> 
> Provide a mechanism for file systems to indicate how much dirty metadata they
> are holding.  This introduces a few things
> 
> 1) Zone stats for dirty metadata, which is the same as the NR_FILE_DIRTY.
> 2) WB stat for dirty metadata.  This way we know if we need to try and call 
> into
> the file system to write out metadata.  This could potentially be used in the
> future to make balancing of dirty pages smarter.

Ok, so when you have 64k page size and 4k metadata block size and
you're using kmalloc() to allocate the storage for the metadata,
how do we make use of all this page-based metadata accounting
stuff?

We could use dirty metadata accounting infrastructure in
XFS so the mm/ subsystem can push on dirty metadata before we
get into reclaim situations, but I just can't see how this code
works when raw pages are not used to back the metadata cache.

That is, XFS can use various different sizes of metadata buffers in
the same filesystem. For example, we use sector sized buffers for
static AG metadata, filesystem blocks for per-AG metadata, some
multiple (1-16) of filesystem blocks for inode buffers, and some
multiple (1-128) of filesytem blocks for directory blocks.

This means we have a combination of buffers  we need to account for
that are made up of:
heap memory when buffer size < page size,
single pages when buffer size == page size, and
multiple pages when buffer size > page size.

The default filesystem config on a 4k page machine with 512 byte
sectors will create buffers in all three categories above which
tends to indicate we can't use this new generic infrastructure as
proposeda.

Any thoughts about how we could efficiently support accounting for
variable sized, non-page based metadata with this generic
infrastructure?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] fsperf: a simple fs/block performance testing framework

2017-10-09 Thread Dave Chinner
On Mon, Oct 09, 2017 at 09:00:51AM -0400, Josef Bacik wrote:
> On Mon, Oct 09, 2017 at 04:17:31PM +1100, Dave Chinner wrote:
> > On Sun, Oct 08, 2017 at 10:25:10PM -0400, Josef Bacik wrote:
> > > > Integrating into fstests means it will be immediately available to
> > > > all fs developers, it'll run on everything that everyone already has
> > > > setup for filesystem testing, and it will have familiar mkfs/mount
> > > > option setup behaviour so there's no new hoops for everyone to jump
> > > > through to run it...
> > > > 
> > > 
> > > TBF I specifically made it as easy as possible because I know we all hate 
> > > trying
> > > to learn new shit.
> > 
> > Yeah, it's also hard to get people to change their workflows to add
> > a whole new test harness into them. It's easy if it's just a new
> > command to an existing workflow :P
> > 
> 
> Agreed, so if you probably won't run this outside of fstests then I'll add it 
> to
> xfstests.  I envision this tool as being run by maintainers to verify their 
> pull
> requests haven't regressed since the last set of patches, as well as by 
> anybody
> trying to fix performance problems.  So it's way more important to me that 
> you,
> Ted, and all the various btrfs maintainers will run it than anybody else.
> 
> > > I figured this was different enough to warrant a separate
> > > project, especially since I'm going to add block device jobs so Jens can 
> > > test
> > > block layer things.  If we all agree we'd rather see this in fstests then 
> > > I'm
> > > happy to do that too.  Thanks,
> > 
> > I'm not fussed either way - it's a good discussion to have, though.
> > 
> > If I want to add tests (e.g. my time-honoured fsmark tests), where
> > should I send patches?
> > 
> 
> I beat you to that!  I wanted to avoid adding fs_mark to the suite because it
> means parsing another different set of outputs, so I added a new ioengine to 
> fio
> for this
> 
> http://www.spinics.net/lists/fio/msg06367.html
> 
> and added a fio job to do 500k files
> 
> https://github.com/josefbacik/fsperf/blob/master/tests/500kemptyfiles.fio
> 
> The test is disabled by default for now because obviously the fio support 
> hasn't
> landed yet.

That seems  misguided. fio is good, but it's not a universal
solution.

> I'd _like_ to expand fio for cases we come up with that aren't possible, as
> there's already a ton of measurements that are taken, especially around
> latencies.

To be properly useful it needs to support more than just fio to run
tests. Indeed, it's largely useless to me if that's all it can do or
it's a major pain to add support for different tools like fsmark.

e.g.  my typical perf regression test that you see the concurrnet
fsmark create workload is actually a lot more. It does:

fsmark to create 50m zero length files
umount,
run parallel xfs_repair (excellent mmap_sem/page fault punisher)
mount
run parallel find -ctime (readdir + lookup traversal)
unmount, mount
run parallel ls -R (readdir + dtype traversal)
unmount, mount
parallel rm -rf of 50m files

I have variants that use small 4k files or large files rather than
empty files, taht use different fsync patterns to stress the
log, use grep -R to traverse the data as well as
the directory/inode structure instead of find, etc.

> That said I'm not opposed to throwing new stuff in there, it just
> means we have to add stuff to parse the output and store it in the database 
> in a
> consistent way, which seems like more of a pain than just making fio do what 
> we
> need it to.  Thanks,

fio is not going to be able to replace the sort of perf tests I run
from week to week. If that's all it's going to do then it's not
directly useful to me...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] fsperf: a simple fs/block performance testing framework

2017-10-08 Thread Dave Chinner
On Sun, Oct 08, 2017 at 10:25:10PM -0400, Josef Bacik wrote:
> On Mon, Oct 09, 2017 at 11:51:37AM +1100, Dave Chinner wrote:
> > On Fri, Oct 06, 2017 at 05:09:57PM -0400, Josef Bacik wrote:
> > > Hello,
> > > 
> > > One thing that comes up a lot every LSF is the fact that we have no 
> > > general way
> > > that we do performance testing.  Every fs developer has a set of scripts 
> > > or
> > > things that they run with varying degrees of consistency, but nothing 
> > > central
> > > that we all use.  I for one am getting tired of finding regressions when 
> > > we are
> > > deploying new kernels internally, so I wired this thing up to try and 
> > > address
> > > this need.
> > > 
> > > We all hate convoluted setups, the more brain power we have to put in to 
> > > setting
> > > something up the less likely we are to use it, so I took the xfstests 
> > > approach
> > > of making it relatively simple to get running and relatively easy to add 
> > > new
> > > tests.  For right now the only thing this framework does is run fio 
> > > scripts.  I
> > > chose fio because it already gathers loads of performance data about it's 
> > > runs.
> > > We have everything we need there, latency, bandwidth, cpu time, and all 
> > > broken
> > > down by reads, writes, and trims.  I figure most of us are familiar 
> > > enough with
> > > fio and how it works to make it relatively easy to add new tests to the
> > > framework.
> > > 
> > > I've posted my code up on github, you can get it here
> > > 
> > > https://github.com/josefbacik/fsperf
> > > 
> > > All (well most) of the results from fio are stored in a local sqlite 
> > > database.
> > > Right now the comparison stuff is very crude, it simply checks against the
> > > previous run and it only checks a few of the keys by default.  You can 
> > > check
> > > latency if you want, but while writing this stuff up it seemed that 
> > > latency was
> > > too variable from run to run to be useful in a "did my thing regress or 
> > > improve"
> > > sort of way.
> > > 
> > > The configuration is brain dead simple, the README has examples.  All you 
> > > need
> > > to do is make your local.cfg, run ./setup and then run ./fsperf and you 
> > > are good
> > > to go.
> > 
> > Why re-invent the test infrastructure? Why not just make it a
> > tests/perf subdir in fstests?
> > 
> 
> Probably should have led with that shouldn't I have?  There's nothing keeping 
> me
> from doing it, but I didn't want to try and shoehorn in a python thing into
> fstests.  I need python to do the sqlite and the json parsing to dump into the
> sqlite database.
> 
> Now if you (and others) are not opposed to this being dropped into tests/perf
> then I'll work that up.  But it's definitely going to need to be done in 
> python.
> I know you yourself have said you aren't opposed to using python in the past, 
> so
> if that's still the case then I can definitely wire it all up.

I have no problems with people using python for stuff like this but,
OTOH, I'm not the fstests maintainer anymore :P

> > > The plan is to add lots of workloads as we discover regressions and such. 
> > >  We
> > > don't want anything that takes too long to run otherwise people won't run 
> > > this,
> > > so the existing tests don't take much longer than a few minutes each.  I 
> > > will be
> > > adding some more comparison options so you can compare against averages 
> > > of all
> > > previous runs and such.
> > 
> > Yup, that fits exactly into what fstests is for... :P
> > 
> > Integrating into fstests means it will be immediately available to
> > all fs developers, it'll run on everything that everyone already has
> > setup for filesystem testing, and it will have familiar mkfs/mount
> > option setup behaviour so there's no new hoops for everyone to jump
> > through to run it...
> > 
> 
> TBF I specifically made it as easy as possible because I know we all hate 
> trying
> to learn new shit.

Yeah, it's also hard to get people to change their workflows to add
a whole new test harness into them. It's easy if it's just a new
command to an existing workflow :P

> I figured this was different enough to warrant a separate
> project, especially since I'm going to add block device jobs so Jens can test
> block layer things.  If we all agree we'd rather see this in fstests then I'm
> happy to do that too.  Thanks,

I'm not fussed either way - it's a good discussion to have, though.

If I want to add tests (e.g. my time-honoured fsmark tests), where
should I send patches?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] fsperf: a simple fs/block performance testing framework

2017-10-08 Thread Dave Chinner
On Fri, Oct 06, 2017 at 05:09:57PM -0400, Josef Bacik wrote:
> Hello,
> 
> One thing that comes up a lot every LSF is the fact that we have no general 
> way
> that we do performance testing.  Every fs developer has a set of scripts or
> things that they run with varying degrees of consistency, but nothing central
> that we all use.  I for one am getting tired of finding regressions when we 
> are
> deploying new kernels internally, so I wired this thing up to try and address
> this need.
> 
> We all hate convoluted setups, the more brain power we have to put in to 
> setting
> something up the less likely we are to use it, so I took the xfstests approach
> of making it relatively simple to get running and relatively easy to add new
> tests.  For right now the only thing this framework does is run fio scripts.  
> I
> chose fio because it already gathers loads of performance data about it's 
> runs.
> We have everything we need there, latency, bandwidth, cpu time, and all broken
> down by reads, writes, and trims.  I figure most of us are familiar enough 
> with
> fio and how it works to make it relatively easy to add new tests to the
> framework.
> 
> I've posted my code up on github, you can get it here
> 
> https://github.com/josefbacik/fsperf
> 
> All (well most) of the results from fio are stored in a local sqlite database.
> Right now the comparison stuff is very crude, it simply checks against the
> previous run and it only checks a few of the keys by default.  You can check
> latency if you want, but while writing this stuff up it seemed that latency 
> was
> too variable from run to run to be useful in a "did my thing regress or 
> improve"
> sort of way.
> 
> The configuration is brain dead simple, the README has examples.  All you need
> to do is make your local.cfg, run ./setup and then run ./fsperf and you are 
> good
> to go.

Why re-invent the test infrastructure? Why not just make it a
tests/perf subdir in fstests?

> The plan is to add lots of workloads as we discover regressions and such.  We
> don't want anything that takes too long to run otherwise people won't run 
> this,
> so the existing tests don't take much longer than a few minutes each.  I will 
> be
> adding some more comparison options so you can compare against averages of all
> previous runs and such.

Yup, that fits exactly into what fstests is for... :P

Integrating into fstests means it will be immediately available to
all fs developers, it'll run on everything that everyone already has
setup for filesystem testing, and it will have familiar mkfs/mount
option setup behaviour so there's no new hoops for everyone to jump
through to run it...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Something like ZFS Channel Programs for BTRFS & probably XFS or even VFS?

2017-10-03 Thread Dave Chinner
On Tue, Oct 03, 2017 at 01:40:51PM -0700, Matthew Wilcox wrote:
> On Wed, Oct 04, 2017 at 07:10:35AM +1100, Dave Chinner wrote:
> > On Tue, Oct 03, 2017 at 03:19:18PM +0200, Martin Steigerwald wrote:
> > > [repost. I didn´t notice autocompletion gave me wrong address for 
> > > fsdevel, 
> > > blacklisted now]
> > > 
> > > Hello.
> > > 
> > > What do you think of
> > > 
> > > http://open-zfs.org/wiki/Projects/ZFS_Channel_Programs
> > 
> > Domain not found.
> 
> Must be an Australian problem ...

Probably, I forgot to stand on my head so everything must have been
sent to the server upside down

Though it is a curious failure - it failed until I went to
"openzfs.org" and that redirected to "open-zfs.org" and now it all
works. Somewhat bizarre.

> A ZFS channel program (ZCP) is a small script written in a domain specific
> language that manipulate ZFS internals in a single, atomically-visible
> operation.  For instance, to delete all snapshots of a filesystem a ZCP
> could be written which 1) generates the list of snapshots, 2) traverses
> that list, and 3) destroys each snapshot unconditionally. Because
> each of these statements would be evaluated from within the kernel,
> ZCPs can guarantee safety from interference with other concurrent ZFS
> modifications. Executing from inside the kernel allows us to guarantee
> atomic visibility of these operations (correctness) and allows them to
> be performed in a single transaction group (performance).
>
> A successful implementation of ZCP will:
> 
> 1. Support equivalent functionality for all of the current ZFS commands
> with improved performance and correctness from the point of view of the
> user of ZFS.
> 
> 2. Facilitate the quick addition of new and useful commands as
> ZCP enables the implementation of more powerful operations which
> previously would have been unsafe to implement in user programs, or
> would require modifications to the kernel for correctness. Since the
> ZCP layer guarantees the atomicity of each ZCP, we only need to write
> new sync_tasks for individual simple operations, then can use ZCPs to
> chain those simple operations together into more complicated operations.
> 
> 3. Allow ZFS users to safely implement their own ZFS operations without
> performing operations they don’t have the privileges for.
> 
> 4. Improve the performance and correctness of existing applications
> built on ZFS operations.

/me goes and looks at the slides

Seems like they are trying to solve a problem of their own making,
in that admin operations are run by the kernel from a separate task
that is really, really slow. So this scripting is a method of aggregating
multiple "sync tasks" into a single operation so there isn't delays
between tasks.

/me chokes on slide 8/8

"Add a Lua interpreter to the kernel, implement ZFS intrinsics (...)
as extensions to the Lua language"

Somehow, I don't see that happening in Linux.

Yes, I can see us potentially adding some custom functionality in
filesystems with eBPF (e.g. custom allocation policies), but I think
admin operations need to be done from userspace through a clear,
stable interface that supports all the necessary primitives to
customise admin operations for different needs.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Something like ZFS Channel Programs for BTRFS & probably XFS or even VFS?

2017-10-03 Thread Dave Chinner
On Tue, Oct 03, 2017 at 03:19:18PM +0200, Martin Steigerwald wrote:
> [repost. I didn´t notice autocompletion gave me wrong address for fsdevel, 
> blacklisted now]
> 
> Hello.
> 
> What do you think of
> 
> http://open-zfs.org/wiki/Projects/ZFS_Channel_Programs

Domain not found.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: btrfs/150 regression test for reading compressed data

2017-09-20 Thread Dave Chinner
On Wed, Sep 20, 2017 at 05:52:43PM -0600, Liu Bo wrote:
> We had a bug in btrfs compression code which could end up with a
> kernel panic.
> 
> This is adding a regression test for the bug and I've also sent a
> kernel patch to fix the bug.
> 
> The patch is "Btrfs: fix kernel oops while reading compressed data".
> 
> Signed-off-by: Liu Bo <bo.li@oracle.com>
> ---
>  tests/btrfs/150 | 102 
> 
>  tests/btrfs/150.out |   3 ++
>  tests/btrfs/group   |   1 +
>  3 files changed, 106 insertions(+)
>  create mode 100755 tests/btrfs/150
>  create mode 100644 tests/btrfs/150.out
> 
> diff --git a/tests/btrfs/150 b/tests/btrfs/150
> new file mode 100755
> index 000..834be51
> --- /dev/null
> +++ b/tests/btrfs/150
> @@ -0,0 +1,102 @@
> +#! /bin/bash
> +# FS QA Test btrfs/150
> +#
> +# This is a regression test which ends up with a kernel oops in btrfs.

group += dangerous

> +# It occurs when btrfs's read repair happens while reading a compressed
> +# extent.
> +# The patch for this is 
> +# x

Incomplete?

> +#
> +#---
> +# Copyright (c) 2017 Liu Bo.  All Rights Reserved.

You're signing off this patch an Oracle employee, but claiming
personal copyright. Please clarify who owns the copyright - if it's
your personal copyright then please sign off with a personal email
address, not your employer's...

Also, I note that these recently added tests from you:

tests/btrfs/140:# Copyright (c) 2017 Liu Bo.  All Rights Reserved.
tests/btrfs/141:# Copyright (c) 2017 Liu Bo.  All Rights Reserved.
tests/btrfs/142:# Copyright (c) 2017 Liu Bo.  All Rights Reserved.
tests/btrfs/143:# Copyright (c) 2017 Liu Bo.  All Rights Reserved.
tests/generic/406:# Copyright (c) 2017 Liu Bo.  All Rights Reserved.

all have this same ambiguity - personal copyright with employer
signoff in the commit. This definitely needs clarification and
fixing if it is wrong


> +disable_io_failure()
> +{
> +echo 0 > $SYSFS_BDEV/make-it-fail
> +echo 0 > $DEBUGFS_MNT/fail_make_request/probability
> +echo 0 > $DEBUGFS_MNT/fail_make_request/times
> +}
> +
> +_scratch_pool_mkfs "-d raid1 -b 1G" >> $seqres.full 2>&1
> +
> +# It doesn't matter which compression algorithm we use.
> +_scratch_mount -ocompress
> +
> +# Create a file with all data being compressed
> +$XFS_IO_PROG -f -c "pwrite -W 0 8K" $SCRATCH_MNT/foobar | _filter_xfs_io

needs an fsync to reach disk.

> +# Raid1 consists of two copies and btrfs decides which copy to read by 
> reader's
> +# %pid.  Now we inject errors to copy #1 and copy #0 is good.  We want to 
> read
> +# the bad copy to trigger read-repair.
> +while true; do
> + disable_io_failure
> + # invalidate the page cache
> + $XFS_IO_PROG -f -c "fadvise -d 0 128K" $SCRATCH_MNT/foobar | 
> _filter_xfs_io
> +
> + enable_io_failure
> + od -x $SCRATCH_MNT/foobar > /dev/null &

why are you using od to read the data when the output is piped to
dev/null? why not just xfs_io -c "pread 0 8k" ?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-04 Thread Dave Chinner
On Mon, Apr 03, 2017 at 04:00:55PM +0200, Jan Kara wrote:
> On Sun 02-04-17 09:05:26, Dave Chinner wrote:
> > On Thu, Mar 30, 2017 at 12:12:31PM -0400, J. Bruce Fields wrote:
> > > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> > > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> > > > > Because if above is acceptable we could make reported i_version to be 
> > > > > a sum
> > > > > of "superblock crash counter" and "inode i_version". We increment
> > > > > "superblock crash counter" whenever we detect unclean filesystem 
> > > > > shutdown.
> > > > > That way after a crash we are guaranteed each inode will report new
> > > > > i_version (the sum would probably have to look like "superblock crash
> > > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible
> > > > > i_version numbers we gave away but did not write to disk but 
> > > > > still...).
> > > > > Thoughts?
> > > 
> > > How hard is this for filesystems to support?  Do they need an on-disk
> > > format change to keep track of the crash counter?
> > 
> > Yes. We'll need version counter in the superblock, and we'll need to
> > know what the increment semantics are. 
> > 
> > The big question is how do we know there was a crash? The only thing
> > a journalling filesystem knows at mount time is whether it is clean
> > or requires recovery. Filesystems can require recovery for many
> > reasons that don't involve a crash (e.g. root fs is never unmounted
> > cleanly, so always requires recovery). Further, some filesystems may
> > not even know there was a crash at mount time because their
> > architecture always leaves a consistent filesystem on disk (e.g. COW
> > filesystems)
> 
> What filesystems can or cannot easily do obviously differs. Ext4 has a
> recovery flag set in superblock on RW mount/remount and cleared on
> umount/RO remount.

Even this doesn't help. A recent bug that was reported to the XFS
list - turns out that systemd can't remount-ro the root
filesystem sucessfully on shutdown because there are open write fds
on the root filesystem when it attempts the remount. So it just
reboots without a remount-ro. This uncovered a bug in grub in
that it (still!) thinks sync(1) is sufficient to get all the
metadata that points to a kernel image onto disk in places it can
read. XFS, like ext4, leaves it in the journal and so the system then fails to
boot because systemd didn't remount-ro the root fs and hence the
journal was never flushed before reboot and so grub can't find the
kernel and so everything fails

> This flag being set on mount would imply incrementing the crash
> counter. It should be pretty easy for each filesystem to implement
> such flag and the counter but I agree it requires an on-disk
> format change.

Yup, anything we want that is persistent and consistent across
filesystems will need on-disk format changes. Hence we need a solid
specification first, not to mention tests to validate correct
behaviour across all filesystems in xfstests...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-01 Thread Dave Chinner
On Thu, Mar 30, 2017 at 12:12:31PM -0400, J. Bruce Fields wrote:
> On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> > > Because if above is acceptable we could make reported i_version to be a 
> > > sum
> > > of "superblock crash counter" and "inode i_version". We increment
> > > "superblock crash counter" whenever we detect unclean filesystem shutdown.
> > > That way after a crash we are guaranteed each inode will report new
> > > i_version (the sum would probably have to look like "superblock crash
> > > counter" * 65536 + "inode i_version" so that we avoid reusing possible
> > > i_version numbers we gave away but did not write to disk but still...).
> > > Thoughts?
> 
> How hard is this for filesystems to support?  Do they need an on-disk
> format change to keep track of the crash counter?

Yes. We'll need version counter in the superblock, and we'll need to
know what the increment semantics are. 

The big question is how do we know there was a crash? The only thing
a journalling filesystem knows at mount time is whether it is clean
or requires recovery. Filesystems can require recovery for many
reasons that don't involve a crash (e.g. root fs is never unmounted
cleanly, so always requires recovery). Further, some filesystems may
not even know there was a crash at mount time because their
architecture always leaves a consistent filesystem on disk (e.g. COW
filesystems)

> I wonder if repeated crashes can lead to any odd corner cases.

WIthout defined, locked down behavour of the superblock counter, the
almost certainly corner cases will exist...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-29 Thread Dave Chinner
On Wed, Mar 29, 2017 at 01:54:31PM -0400, Jeff Layton wrote:
> On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote:
> > On Tue 21-03-17 14:46:53, Jeff Layton wrote:
> > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > > > - It's durable; the above comparison still works if there were 
> > > > > > reboots
> > > > > >   between the two i_version checks.
> > > > > > - I don't know how realistic this is--we may need to figure out
> > > > > >   if there's a weaker guarantee that's still useful.  Do
> > > > > >   filesystems actually make ctime/mtime/i_version changes
> > > > > >   atomically with the changes that caused them?  What if a
> > > > > >   change attribute is exposed to an NFS client but doesn't make
> > > > > >   it to disk, and then that value is reused after reboot?
> > > > > > 
> > > > > 
> > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > > > > the inode dirty and I think that will end up with the new i_version at
> > > > > least being journalled before __mark_inode_dirty returns.
> > > > 
> > > > So you think the filesystem can provide the atomicity?  In more detail:
> > > > 
> > > 
> > > Sorry, I hit send too quickly. That should have read:
> > > 
> > > "Yeah, there could be atomicity issues there."
> > > 
> > > I think providing that level of atomicity may be difficult, though
> > > maybe there's some way to make the querying of i_version block until
> > > the inode update has been journalled?
> > 
> > Just to complement what Dave said from ext4 side - similarly as with XFS
> > ext4 doesn't guarantee atomicity unless fsync() has completed on the file.
> > Until that you can see arbitrary combination of data & i_version after the
> > crash. We do take care to keep data and metadata in sync only when there
> > are security implications to that (like exposing uninitialized disk blocks)
> > and if not, we are as lazy as we can to improve performance...
> > 
> > 
> 
> Yeah, I think what we'll have to do here is ensure that those
> filesystems do an fsync prior to reporting the i_version getattr
> codepath. It's not pretty, but I don't see a real alternative.

I think that's even more problematic. ->getattr currently runs
completely unlocked for performance reasons - it's racy w.r.t. to
ongoing modifications to begin with, so /nothing/ that is returned
to userspace via stat/statx can be guaranteed to be "coherent".
Linus will be very unhappy if you make his git workload (which is
/very/ stat heavy) run slower by adding any sort of locking in this
hot path.

Even if we did put an fsync() into ->getattr() (and dealt with all
the locking issues that entails), by the time the statx syscall
returns to userspace the i_version value may not match the
data/metadata in the inode(*).  IOWs, by the time i_version gets
to userspace, it is out of date and any use of it for data
versioning from userspace is going to be prone to race conditions.

Cheers,

Dave.

(*) fiemap has exactly the same "stale the moment internal fs
locks are released" race conditions, which is why it cannot safely
be used for mapping holes when copying file data

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread Dave Chinner
On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > - It's durable; the above comparison still works if there were reboots
> >   between the two i_version checks.
> > - I don't know how realistic this is--we may need to figure out
> >   if there's a weaker guarantee that's still useful.  Do
> >   filesystems actually make ctime/mtime/i_version changes
> >   atomically with the changes that caused them?  What if a
> >   change attribute is exposed to an NFS client but doesn't make
> >   it to disk, and then that value is reused after reboot?
> > 
> 
> Yeah, there could be atomicity there. If we bump i_version, we'll mark
> the inode dirty and I think that will end up with the new i_version at
> least being journalled before __mark_inode_dirty returns.

The change may be journalled, but it isn't guaranteed stable until
fsync is run on the inode.

NFS server operations commit the metadata changed by a modification
through ->commit_metadata or sync_inode_metadata() before the
response is sent back to the client, hence guaranteeing that
i_version changes through the NFS server are stable and durable.

This is not the case for normal operations done through the POSIX
API - the journalling is asynchronous and the only durability
guarantees are provided by fsync()

> That said, I suppose it is possible for us to bump the counter, hand
> that new counter value out to a NFS client and then the box crashes
> before it makes it to the journal.

Yup, this has aways been a problem when you mix posix applications
running on the NFS server modifying the same files as the NFS
clients are accessing and requiring synchronisation.

> Not sure how big a problem that really is.

This coherency problem has always existed on the server side...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/8] nowait aio: return on congested block device

2017-03-16 Thread Dave Chinner
On Wed, Mar 15, 2017 at 04:51:04PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgold...@suse.com>
> 
> A new flag BIO_NOWAIT is introduced to identify bio's
> orignating from iocb with IOCB_NOWAIT. This flag indicates
> to return immediately if a request cannot be made instead
> of retrying.

So this makes a congested block device run the bio IO completion
callback with an -EAGAIN error present? Are all the filesystem
direct IO submission and completion routines OK with that? i.e. does
such a congestion case cause filesystems to temporarily expose stale
data to unprivileged users when the IO is requeued in this way?

e.g. filesystem does allocation without blocking, submits bio,
device is congested, runs IO completion with error, so nothing
written to allocated blocks, write gets queued, so other read
comes in while the write is queued, reads data from uninitialised
blocks that were allocated during the write

Seems kinda problematic to me to have a undocumented design
constraint (i.e a landmine) where we submit the AIO only to have it
error out and then expect the filesystem to do something special and
different /without blocking/ on EAGAIN.

Why isn't the congestion check at a higher layer like we do for page
cache readahead? i.e. using the bdi*congested() API at the time we
are doing all the other filesystem blocking checks.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*

2017-02-06 Thread Dave Chinner
On Mon, Feb 06, 2017 at 07:47:43PM +0100, Michal Hocko wrote:
> On Mon 06-02-17 10:32:37, Darrick J. Wong wrote:
> > On Mon, Feb 06, 2017 at 06:44:15PM +0100, Michal Hocko wrote:
> > > On Mon 06-02-17 07:39:23, Matthew Wilcox wrote:
> > > > On Mon, Feb 06, 2017 at 03:07:16PM +0100, Michal Hocko wrote:
> > > > > +++ b/fs/xfs/xfs_buf.c
> > > > > @@ -442,17 +442,17 @@ _xfs_buf_map_pages(
> > > > >   bp->b_addr = NULL;
> > > > >   } else {
> > > > >   int retried = 0;
> > > > > - unsigned noio_flag;
> > > > > + unsigned nofs_flag;
> > > > >  
> > > > >   /*
> > > > >* vm_map_ram() will allocate auxillary structures (e.g.
> > > > >* pagetables) with GFP_KERNEL, yet we are likely to be 
> > > > > under
> > > > >* GFP_NOFS context here. Hence we need to tell memory 
> > > > > reclaim
> > > > > -  * that we are in such a context via PF_MEMALLOC_NOIO 
> > > > > to prevent
> > > > > +  * that we are in such a context via PF_MEMALLOC_NOFS 
> > > > > to prevent
> > > > >* memory reclaim re-entering the filesystem here and
> > > > >* potentially deadlocking.
> > > > >*/
> > > > 
> > > > This comment feels out of date ... how about:
> > > 
> > > which part is out of date?
> > > 
> > > > 
> > > > /*
> > > >  * vm_map_ram will allocate auxiliary structures (eg 
> > > > page
> > > >  * tables) with GFP_KERNEL.  If that tries to reclaim 
> > > > memory
> > > >  * by calling back into this filesystem, we may 
> > > > deadlock.
> > > >  * Prevent that by setting the NOFS flag.
> > > >  */
> > > 
> > > dunno, the previous wording seems clear enough to me. Maybe little bit
> > > more chatty than yours but I am not sure this is worth changing.
> > 
> > I prefer to keep the "...yet we are likely to be under GFP_NOFS..."
> > wording of the old comment because it captures the uncertainty of
> > whether or not we actually are already under NOFS.  If someone actually
> > has audited this code well enough to know for sure then yes let's change
> > the comment, but I haven't gone that far.
> 
> I believe we can drop the memalloc_nofs_save then as well because either
> we are called from a potentially dangerous context and thus we are in
> the nofs scope we we do not need the protection at all.

No, absolutely not. "Belief" is not a sufficient justification for
removing low level deadlock avoidance infrastructure. This code
needs to remain in _xfs_buf_map_pages() until a full audit of the
caller paths is done and we're 100% certain that there are no
lurking deadlocks.

For example, I'm pretty sure we can call into _xfs_buf_map_pages()
outside of a transaction context but with an inode ILOCK held
exclusively. If we then recurse into memory reclaim and try to run a
transaction during reclaim, we have an inverted ILOCK vs transaction
locking order. i.e. we are not allowed to call xfs_trans_reserve()
with an ILOCK held as that can deadlock the log:  log full, locked
inode pins tail of log, inode cannot be flushed because ILOCK is
held by caller waiting for log space to become available

i.e. there are certain situations where holding a ILOCK is a
deadlock vector. See xfs_lock_inodes() for an example of the lengths
we go to avoid ILOCK based log deadlocks like this...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/9] xfs: introduce and use KM_NOLOCKDEP to silence reclaim lockdep false positives

2016-12-20 Thread Dave Chinner
On Mon, Dec 19, 2016 at 02:06:19PM -0800, Darrick J. Wong wrote:
> On Tue, Dec 20, 2016 at 08:24:13AM +1100, Dave Chinner wrote:
> > On Thu, Dec 15, 2016 at 03:07:08PM +0100, Michal Hocko wrote:
> > > From: Michal Hocko <mho...@suse.com>
> > > 
> > > Now that the page allocator offers __GFP_NOLOCKDEP let's introduce
> > > KM_NOLOCKDEP alias for the xfs allocation APIs. While we are at it
> > > also change KM_NOFS users introduced by b17cb364dbbb ("xfs: fix missing
> > > KM_NOFS tags to keep lockdep happy") and use the new flag for them
> > > instead. There is really no reason to make these allocations contexts
> > > weaker just because of the lockdep which even might not be enabled
> > > in most cases.
> > > 
> > > Signed-off-by: Michal Hocko <mho...@suse.com>
> > 
> > I'd suggest that it might be better to drop this patch for now -
> > it's not necessary for the context flag changeover but does
> > introduce a risk of regressions if the conversion is wrong.
> 
> I was just about to write in that while I didn't see anything obviously
> wrong with the NOFS removals, I also don't know for sure that we can't
> end up recursively in those code paths (specifically the directory
> traversal thing).

The issue is with code paths that can be called from both inside and
outside transaction context - lockdep complains when it sees an
allocation path that is used with both GFP_NOFS and GFP_KERNEL
context, as it doesn't know that the GFP_KERNEL usage is safe or
not.

So things like the directory buffer path, which can be called from
readdir without a transaction context, have various KM_NOFS flags
scattered through it so that lockdep doesn't get all upset every
time readdir is called...

There are other cases like this - btree manipulation via bunmapi()
can be called without transaction context to remove delayed alloc
extents, and that puts all of the btree cursor and  incore extent
list handling in the same boat (all those allocations are KM_NOFS),
etc.

So it's not really recursion that is the problem here - it's
different allocation contexts that lockdep can't know about unless
it's told about them. We've done that with KM_NOFS in the past; in
future we should use this KM_NOLOCKDEP flag, though I'd prefer a
better name for it. e.g. KM_NOTRANS to indicate that the allocation
can occur both inside and outside of transaction context

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/9] xfs: introduce and use KM_NOLOCKDEP to silence reclaim lockdep false positives

2016-12-19 Thread Dave Chinner
On Thu, Dec 15, 2016 at 03:07:08PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mho...@suse.com>
> 
> Now that the page allocator offers __GFP_NOLOCKDEP let's introduce
> KM_NOLOCKDEP alias for the xfs allocation APIs. While we are at it
> also change KM_NOFS users introduced by b17cb364dbbb ("xfs: fix missing
> KM_NOFS tags to keep lockdep happy") and use the new flag for them
> instead. There is really no reason to make these allocations contexts
> weaker just because of the lockdep which even might not be enabled
> in most cases.
> 
> Signed-off-by: Michal Hocko <mho...@suse.com>

I'd suggest that it might be better to drop this patch for now -
it's not necessary for the context flag changeover but does
introduce a risk of regressions if the conversion is wrong.

Hence I think this is better as a completely separate series
which audits and changes all the unnecessary KM_NOFS allocations
in one go. I've never liked whack-a-mole style changes like this -
do it once, do it properly

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] Btrfs: add mount option for dax

2016-12-09 Thread Dave Chinner
On Fri, Dec 09, 2016 at 10:41:34AM -0800, Liu Bo wrote:
> On Fri, Dec 09, 2016 at 03:47:20PM +1100, Dave Chinner wrote:
> > On Wed, Dec 07, 2016 at 01:45:05PM -0800, Liu Bo wrote:
> > > Signed-off-by: Liu Bo <bo.li@oracle.com>
> > > ---
> > >  fs/btrfs/ctree.h |  1 +
> > >  fs/btrfs/super.c | 40 +++-
> > >  2 files changed, 40 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > > index 0b8ce2b..e54c6e6 100644
> > > --- a/fs/btrfs/ctree.h
> > > +++ b/fs/btrfs/ctree.h
> > > @@ -1317,6 +1317,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct 
> > > btrfs_root *root)
> > >  #define BTRFS_MOUNT_FRAGMENT_METADATA(1 << 25)
> > >  #define BTRFS_MOUNT_FREE_SPACE_TREE  (1 << 26)
> > >  #define BTRFS_MOUNT_NOLOGREPLAY  (1 << 27)
> > > +#define BTRFS_MOUNT_DAX  (1 << 28)
> > >  
> > >  #define BTRFS_DEFAULT_COMMIT_INTERVAL(30)
> > >  #define BTRFS_DEFAULT_MAX_INLINE (2048)
> > > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> > > index 74ed5aa..9b18f3d 100644
> > > --- a/fs/btrfs/super.c
> > > +++ b/fs/btrfs/super.c
> > > @@ -323,7 +323,7 @@ enum {
> > >   Opt_commit_interval, Opt_barrier, Opt_nodefrag, Opt_nodiscard,
> > >   Opt_noenospc_debug, Opt_noflushoncommit, Opt_acl, Opt_datacow,
> > >   Opt_datasum, Opt_treelog, Opt_noinode_cache, Opt_usebackuproot,
> > > - Opt_nologreplay, Opt_norecovery,
> > > + Opt_nologreplay, Opt_norecovery, Opt_dax,
> > 
> > Can we please not create more filesystems with a DAX mount option?
> > This was only even an enabler, and not meant to be a permanent
> > thing. The permanent functionality for DAX is supposed to be
> > per-inode inheritable DAX flags - not mount options - so that
> > applications can choose on a per file basis to enable/disable DAX
> > access as they see fit.
> > 
> > This also enables the filesystem to reject the attempt to turn on
> > DAX if the set of contexts for the file are not DAX compatible
> 
> Sounds good, I'll try to update it to use inode DAX flag directly.

xfs_io already has chattr/lsattr support (+/-x) for the FS_XFLAG_DAX
flag in FS_IOC_FS{GS}ETXATTR, and you can have a look at the XFS
code in xfs_ioctl.c for the operations that are needed to
dynamically change the S_DAX flag on an inode (e.g
xfs_ioctl_setattr_dax_invalidate())

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] Btrfs: add DAX support for nocow btrfs

2016-12-08 Thread Dave Chinner
On Wed, Dec 07, 2016 at 01:45:08PM -0800, Liu Bo wrote:
> Since I haven't figure out how to map multiple devices to userspace without
> pagecache, this DAX support is only for single-device, and I don't think
> DAX(Direct Access) can work with cow, this is limited to nocow case.  I made
> this by setting nodatacow in dax mount option.

DAX can be made to work with COW quite easily - it's already been
done, in fact. Go look up Nova for how it works with DAX:

https://github.com/Andiry/nova

Essentially, it has a set of "temporary pages" it links to the inode
where writes are done directly, and when a synchronisation event
occurs it pulls them from the per-inode list, does whatever
transformations are needed (e.g. CRC calculation, mirroring, etc)
and marks them them as current in the inode extent list.

When a new overwrite comes along, it allocates a new block in the
temporary page list, copies the existing data into it, and then uses
that block for DAX until the next synchronisation event occurs.

For XFS, CoW for DAX through read/write isn't really any different
to the direct IO path we currently already have. And for page write
faults on shared extents, instead of zeroing the newly allocated
block we simply copy the original data into the new block before the
allocation returns. It does mean, however, that XFS does not have
the capability for data transformations in the IO path. This limits
us to atomic write devices (software raid 0 or hardware redundancy
such as DIMM mirroring), but we can still do out-of-band online data
transformations and movement (e.g. dedupe, defrag) with DAX.

Yes, I know these methods are very different to how btrfs uses COW.
However, my point is that DAX and CoW and/or mulitple devices are
not incompatible if the architecture is correctly structured. i.e
DAX should be able to work even with most of btrfs's special magic
still enabled.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] Btrfs: add mount option for dax

2016-12-08 Thread Dave Chinner
On Wed, Dec 07, 2016 at 01:45:05PM -0800, Liu Bo wrote:
> Signed-off-by: Liu Bo <bo.li@oracle.com>
> ---
>  fs/btrfs/ctree.h |  1 +
>  fs/btrfs/super.c | 40 +++-
>  2 files changed, 40 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 0b8ce2b..e54c6e6 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1317,6 +1317,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct 
> btrfs_root *root)
>  #define BTRFS_MOUNT_FRAGMENT_METADATA(1 << 25)
>  #define BTRFS_MOUNT_FREE_SPACE_TREE  (1 << 26)
>  #define BTRFS_MOUNT_NOLOGREPLAY  (1 << 27)
> +#define BTRFS_MOUNT_DAX  (1 << 28)
>  
>  #define BTRFS_DEFAULT_COMMIT_INTERVAL(30)
>  #define BTRFS_DEFAULT_MAX_INLINE (2048)
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 74ed5aa..9b18f3d 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -323,7 +323,7 @@ enum {
>   Opt_commit_interval, Opt_barrier, Opt_nodefrag, Opt_nodiscard,
>   Opt_noenospc_debug, Opt_noflushoncommit, Opt_acl, Opt_datacow,
>   Opt_datasum, Opt_treelog, Opt_noinode_cache, Opt_usebackuproot,
> - Opt_nologreplay, Opt_norecovery,
> + Opt_nologreplay, Opt_norecovery, Opt_dax,

Can we please not create more filesystems with a DAX mount option?
This was only even an enabler, and not meant to be a permanent
thing. The permanent functionality for DAX is supposed to be
per-inode inheritable DAX flags - not mount options - so that
applications can choose on a per file basis to enable/disable DAX
access as they see fit.

This also enables the filesystem to reject the attempt to turn on
DAX if the set of contexts for the file are not DAX compatible

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/10] fstests: common: Introduce function to check qgroup correctness

2016-11-29 Thread Dave Chinner
On Wed, Nov 30, 2016 at 08:56:03AM +0800, Qu Wenruo wrote:
> 
> 
> At 11/30/2016 05:01 AM, Dave Chinner wrote:
> >On Tue, Nov 29, 2016 at 03:32:54PM +0800, Qu Wenruo wrote:
> >>Old btrfs qgroup test cases uses fix golden output numbers, which limits
> >>the coverage since they can't handle mount options like compress or
> >>inode_map, and cause false alert.
> >>
> >>Introduce _btrfs_check_scratch_qgroup() function to check qgroup
> >>correctness using "btrfs check --qgroup-report" function, which will
> >>follow the way kernel handle qgroup and are proved very reliable.
> >>
> >>Signed-off-by: Qu Wenruo <quwen...@cn.fujitsu.com>
> >>---
> >> common/rc | 19 +++
> >> 1 file changed, 19 insertions(+)
> >>
> >>diff --git a/common/rc b/common/rc
> >>index 8c99306..35d2d56 100644
> >>--- a/common/rc
> >>+++ b/common/rc
> >>@@ -3018,6 +3018,25 @@ _require_deletable_scratch_dev_pool()
> >>done
> >> }
> >>
> >>+# We check if "btrfs check" support to check qgroup correctness
> >>+# Old fixed golden output can cover case like compress and inode_map
> >>+# mount options, which limits the coverage
> >>+_require_btrfs_check_qgroup()
> >>+{
> >>+   _require_command "$BTRFS_UTIL_PROG" btrfs
> >>+   output=$($BTRFS_UTIL_PROG check --help | grep "qgroup-report")
> >>+   if [ -z "$output" ]; then
> >>+   _notrun "$BTRFS_UTIL_PROG too old (must support 'check 
> >>--qgroup-report')"
> >>+   fi
> >>+}
> >
> >Why wouldn't this just set a global variable that you then
> >check in _check_scratch_fs and run the _btrfs_check_scratch_qgroup()
> >call then?
> 
> The problem is, "btrfs check --qgroup-report" will do force report,
> even for case like qgroup rescan still running.
>
> Some test, like btrfs/114 which tests rescan, false report will
> cause problem.

So for those specific tests, you aren't going to be running "btrfs
check --qgroup-report", right?

In which case, those tests should not call
_require_btrfs_check_qgroup(), and then _check_scratch_fs() will not
run the quota check. i.e. there will be no difference to the current
behaviour.

> So here I choose the manually checking other than always do it at
> _check_scratch_fs().

I don't see what the problem you are avoiding is.  Either it is safe
to run the quota check or it isn't, and triggering it to run in
_check_scratch_fs() via a _requires rule makes no difference to that.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/10] fstests: common: Introduce function to check qgroup correctness

2016-11-29 Thread Dave Chinner
On Tue, Nov 29, 2016 at 03:32:54PM +0800, Qu Wenruo wrote:
> Old btrfs qgroup test cases uses fix golden output numbers, which limits
> the coverage since they can't handle mount options like compress or
> inode_map, and cause false alert.
> 
> Introduce _btrfs_check_scratch_qgroup() function to check qgroup
> correctness using "btrfs check --qgroup-report" function, which will
> follow the way kernel handle qgroup and are proved very reliable.
> 
> Signed-off-by: Qu Wenruo <quwen...@cn.fujitsu.com>
> ---
>  common/rc | 19 +++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/common/rc b/common/rc
> index 8c99306..35d2d56 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -3018,6 +3018,25 @@ _require_deletable_scratch_dev_pool()
>   done
>  }
>  
> +# We check if "btrfs check" support to check qgroup correctness
> +# Old fixed golden output can cover case like compress and inode_map
> +# mount options, which limits the coverage
> +_require_btrfs_check_qgroup()
> +{
> + _require_command "$BTRFS_UTIL_PROG" btrfs
> + output=$($BTRFS_UTIL_PROG check --help | grep "qgroup-report")
> + if [ -z "$output" ]; then
> + _notrun "$BTRFS_UTIL_PROG too old (must support 'check 
> --qgroup-report')"
> + fi
> +}

Why wouldn't this just set a global variable that you then
check in _check_scratch_fs and run the _btrfs_check_scratch_qgroup()
call then?

What about all the tests that currently run without this
functionality being present? They will now notrun rather than use
the golden output match - this seems like a regression to me,
especially for distro QE testing older kernel/progs combinations...

> +
> +_btrfs_check_scratch_qgroup()
> +{
> + _require_btrfs_check_qgroup

This needs to go in the test itself before the test is run,
not get hidden in a function call at the end of the test.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe

2016-11-23 Thread Dave Chinner
On Wed, Nov 23, 2016 at 06:14:47PM -0500, Zygo Blaxell wrote:
> On Thu, Nov 24, 2016 at 09:13:28AM +1100, Dave Chinner wrote:
> > On Wed, Nov 23, 2016 at 08:55:59AM -0500, Zygo Blaxell wrote:
> > > On Wed, Nov 23, 2016 at 03:26:32PM +1100, Dave Chinner wrote:
> > > > On Tue, Nov 22, 2016 at 09:02:10PM -0500, Zygo Blaxell wrote:
> > > > > On Thu, Nov 17, 2016 at 04:07:48PM -0800, Omar Sandoval wrote:
> > > > > > 3. Both XFS and Btrfs cap each dedupe operation to 16MB, but the
> > > > > >implicit EOF gets around this in the existing XFS 
> > > > > > implementation. I
> > > > > >copied this for the Btrfs implementation.
> > > > > 
> > > > > Somewhat tangential to this patch, but on the dedup topic:  Can we 
> > > > > raise
> > > > > or drop that 16MB limit?
> > > > > 
> > > > > The maximum btrfs extent length is 128MB.  Currently the btrfs dedup
> > > > > behavior for a 128MB extent is to generate 8x16MB shared extent 
> > > > > references
> > > > > with different extent offsets to a single 128MB physical extent.
> > > > > These references no longer look like the original 128MB extent to a
> > > > > userspace dedup tool.  That raises the difficulty level substantially
> > > > > for a userspace dedup tool when it tries to figure out which extents 
> > > > > to
> > > > > keep and which to discard or rewrite.
> > > > 
> > > > That, IMO, is a btrfs design/implementation problem, not a problem
> > > > with the API. Applications are always going to end up doing things
> > > > that aren't perfectly aligned to extent boundaries or sizes
> > > > regardless of the size limit that is placed on the dedupe ranges.
> > > 
> > > Given that XFS doesn't have all the problems btrfs does, why does XFS
> > > have the same aribitrary size limit?  Especially since XFS demonstrably
> > > doesn't need it?
> > 
> > Creating a new-but-slightly-incompatible jsut for XFS makes no
> > sense - we have multiple filesystems that support this functionality
> > and so they all should use the same APIs and present (as far as is
> > possible) the same behaviour to userspace.
> 
> OK.  Let's just remove the limit on all the filesystems then.
> XFS doesn't need it, and btrfs can be fixed.

Yet applications still have to support kernel versions where btrfs
has a limit. IOWs, we can remove the limit for future improvement,
but that doesn't mean userspace is free from having to know about
the existing limit constraints.

That is, once a behaviour has been exposed to userspace through an
API, we can't just change it and act like it was always that way -
apps still have to support kernels that expose the old behaviour.
i.e. the old behaviour is there forever, and this why designing
userspace APIs is /hard/. It's also why it's better to use an
existing, slightly less than ideal API than invent a new one that
will simply have different problems exposed in future...

> > IOWs it's more important to use existing APIs than to invent a new
> > one that does almost the same thing. This way userspace applications
> > don't need to be changed to support new XFS functionality and we
> > make life easier for everyone. 
> 
> Except removing the limit doesn't work that way.  An application that
> didn't impose an undocumented limit on itself wouldn't break when moved
> to a filesystem that imposed no such limit, i.e. if XFS had no limit,
> an application that moved from btrfs to XFS would just work.

It goes /both ways/ though. Write an app on XFS that does not care
about limits and it won't work on btrfs because it gets unexpected
errors.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe

2016-11-23 Thread Dave Chinner
On Wed, Nov 23, 2016 at 08:55:59AM -0500, Zygo Blaxell wrote:
> On Wed, Nov 23, 2016 at 03:26:32PM +1100, Dave Chinner wrote:
> > On Tue, Nov 22, 2016 at 09:02:10PM -0500, Zygo Blaxell wrote:
> > > On Thu, Nov 17, 2016 at 04:07:48PM -0800, Omar Sandoval wrote:
> > > > 3. Both XFS and Btrfs cap each dedupe operation to 16MB, but the
> > > >implicit EOF gets around this in the existing XFS implementation. I
> > > >copied this for the Btrfs implementation.
> > > 
> > > Somewhat tangential to this patch, but on the dedup topic:  Can we raise
> > > or drop that 16MB limit?
> > > 
> > > The maximum btrfs extent length is 128MB.  Currently the btrfs dedup
> > > behavior for a 128MB extent is to generate 8x16MB shared extent references
> > > with different extent offsets to a single 128MB physical extent.
> > > These references no longer look like the original 128MB extent to a
> > > userspace dedup tool.  That raises the difficulty level substantially
> > > for a userspace dedup tool when it tries to figure out which extents to
> > > keep and which to discard or rewrite.
> > 
> > That, IMO, is a btrfs design/implementation problem, not a problem
> > with the API. Applications are always going to end up doing things
> > that aren't perfectly aligned to extent boundaries or sizes
> > regardless of the size limit that is placed on the dedupe ranges.
> 
> Given that XFS doesn't have all the problems btrfs does, why does XFS
> have the same aribitrary size limit?  Especially since XFS demonstrably
> doesn't need it?

Creating a new-but-slightly-incompatible jsut for XFS makes no
sense - we have multiple filesystems that support this functionality
and so they all should use the same APIs and present (as far as is
possible) the same behaviour to userspace.

IOWs it's more important to use existing APIs than to invent a new
one that does almost the same thing. This way userspace applications
don't need to be changed to support new XFS functionality and we
make life easier for everyone. A shiny new API without warts would
be nice, but we've already got to support the existing one forever,
it does the job we need and so it's less burden on everyone if we
just use it as is.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe

2016-11-22 Thread Dave Chinner
On Tue, Nov 22, 2016 at 09:02:10PM -0500, Zygo Blaxell wrote:
> On Thu, Nov 17, 2016 at 04:07:48PM -0800, Omar Sandoval wrote:
> > 3. Both XFS and Btrfs cap each dedupe operation to 16MB, but the
> >implicit EOF gets around this in the existing XFS implementation. I
> >copied this for the Btrfs implementation.
> 
> Somewhat tangential to this patch, but on the dedup topic:  Can we raise
> or drop that 16MB limit?
> 
> The maximum btrfs extent length is 128MB.  Currently the btrfs dedup
> behavior for a 128MB extent is to generate 8x16MB shared extent references
> with different extent offsets to a single 128MB physical extent.
> These references no longer look like the original 128MB extent to a
> userspace dedup tool.  That raises the difficulty level substantially
> for a userspace dedup tool when it tries to figure out which extents to
> keep and which to discard or rewrite.

That, IMO, is a btrfs design/implementation problem, not a problem
with the API. Applications are always going to end up doing things
that aren't perfectly aligned to extent boundaries or sizes
regardless of the size limit that is placed on the dedupe ranges.

> XFS may not have this problem--I haven't checked.

It doesn't - it tracks shared blocks exactly and merges adjacent
extent records whenever possible.

> Even if we want to keep the 16MB limit, there's also no way to query the
> kernel from userspace to find out what the limit is, other than by trial
> and error.  It's not even in a header file, userspace just has to *know*.

So add a define to the API to make it visible to applications and
document it in the man page.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: Block btrfs from test case generic/372

2016-11-16 Thread Dave Chinner
(Did you forget to cc fste...@vger.kernel.org?)

On Tue, Nov 15, 2016 at 04:13:32PM +0800, Qu Wenruo wrote:
> Since btrfs always return the whole extent even part of it is shared
> with other files, so the hole/extent counts differs for "file1" in this
> test case.
> 
> For example:
> 
>  /-- File 1 Extent 0-\
> / \
> |<--Extent A-->|
> \ /  \ /
>  \ File 2/\ File 2/
>Ext 0~4KExt 64k~68K
> 
> In that case, fiemap on File 1 will only return 1 large extent A with
> SHARED flag.
> While XFS will split it into 3 extents,  first and last 4K with SHARED
> flag while the rest without SHARED flag.

fiemap should behave the same across all filesystems if at all
possible. This test failure indicates btrfs doesn't report an
accurate representation of shared extents which, IMO, is a btrfs
issue that needs fixing, not a test problem

Regardless of this

> This makes the test case meaningless as btrfs doesn't follow such
> assumption.
> So black list btrfs for this test case to avoid false alert.

...  we are not going to add ad-hoc filesystem blacklists for
random tests.

Adding "blacklists" without any explanation of why something has
been blacklisted is simply a bad practice. We use _require rules
to specifically document what functionality is required for the
test and check that it provided.  i.e. this:

_require_explicit_shared_extents()
{
if [ $FSTYP == "btrfs" ]; then
_not_run "btrfs can't report accurate shared extent ranges in 
fiemap"
fi
}

documents /exactly/ why this test is not run on btrfs.

And, quite frankly, while this is /better/ it still ignores the
fact we have functions like _within_tolerance for allowing a range
of result values to be considered valid rather than just a fixed
value. IOWs, changing the check of the extent count of file 1 post
reflink to use a _within_tolerance range would mean the test would
validate file1 on all reflink supporting filesystems and we don't
need to exclude btrfs at all...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: btrfs: Check false ENOSPC bug caused by incorrect metadata reserve

2016-11-09 Thread Dave Chinner
On Thu, Nov 10, 2016 at 11:24:34AM +0800, Eryu Guan wrote:
> On Thu, Nov 10, 2016 at 10:42:36AM +0800, Qu Wenruo wrote:
> > 
> > 
> > At 11/10/2016 10:19 AM, Dave Chinner wrote:
> > > On Thu, Nov 10, 2016 at 08:34:20AM +0800, Qu Wenruo wrote:
> > > > At 11/09/2016 05:43 PM, Eryu Guan wrote:
> > > > > On Wed, Nov 09, 2016 at 08:24:38AM +0800, Qu Wenruo wrote:
> > > > > > At 11/08/2016 06:58 PM, Eryu Guan wrote:
> > > > > > > On Tue, Nov 08, 2016 at 05:15:15PM +0800, Qu Wenruo wrote:
> > > > > We should use helpers and not open-code when helpers are available. So
> > > > > we should really use _scratch_mkfs_sized here.
> > > > > 
> > > > > And "-n 64k" is only to make bug easier to reproduce, but I don't 
> > > > > think
> > > > > it's necessary. In my testings, 50%-75% runs hit the ENOSPC failure on
> > > > > my x86_64 test vm, even I removed the "-n 64k" mkfs option from the
> > > > > test.
> > > > > 
> > > > > So I'd say remove "-n 64k" and test whatever mkfs options user
> > > > > specified.
> > > > 
> > > > I really don't like the idea to allow user to override any mount
> > > > option to reduce the possibility.
> > > 
> > > That's not the point. Overriding mount options reduces test coverage
> > > because it limits the test to only the exact configuration that
> > > reproduced the bug that was seen. If a user is testing a specific
> > > configuration, then we really want to run the test on that config.
> > > 
> > > > And further more, this testcase is not a generic test, but a
> > > > regression/pinpoint test to expose one specific bug.
> > > 
> > > If you want to make sure that the bug doesn't return, then you need
> > > to run the /entire test suite/ with the configuration that exposes
> > > the problem. You wouldn't be suggesting this specific set of mount
> > > options if users weren't using that configuration. Hence you really
> > > need to run the entire test suite with that configuration to make
> > > sure you haven't broken those user's systems
> > > 
> > > And, yes, I test lots of different XFS configurations in their
> > > entirity every day on every change I make or review, so I'm not
> > > suggesting that you should do anything I don't already do.
> > 
> > OK, most of your points makes sense.
> > I'll update the case.
> > 
> > And I want to make it clear, doesn it mean, newly submitted test case
> > shouldn't specify any mkfs/mount option?
> 
> My understanding is that if test needs a extremely rare configuration to
> hit a bug, then it should specify mkfs/mount options, e.g. xfs/297 tests
> a bug in XFS only in very specific configuration:
> 
> _scratch_mkfs_xfs -d agcount=16,su=256k,sw=12 -l su=256k,size=5120b 
> >/dev/null 2>&1

Yes, but there's a key reason we can do this for /XFS/. I point this
out every time this comes up:

We can do this with /XFS/ because the *implementation of
_scratch_mkfs_xfs()* has a fallback strategy on failure.

That is, _scratch_mkfs_xfs will first try to make the filesystems
with $MKFS_OPTIONS + $TEST_SUPPLIED_OPTIONS. If that fails, it then
runs mkfs with just $TEST_SUPPLIED_OPTIONS.

IOWs, XFS always tries to use the global options, and only removes
them when there is a problem combining them with
$TEST_SUPPLIED_OPTIONS.

This is how custom test options should be applied for all
filesystems, not just XFS. It gets rid of the need for any test to
care about MKFS_OPTIONS.

What we are also missing is that we need to apply this process to
scratch mount options as we don't do it for any filesystem.
That will get rid of the need for tests to care about what is in
MOUNT_OPTIONS, too...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: btrfs: Check false ENOSPC bug caused by incorrect metadata reserve

2016-11-09 Thread Dave Chinner
On Thu, Nov 10, 2016 at 08:34:20AM +0800, Qu Wenruo wrote:
> At 11/09/2016 05:43 PM, Eryu Guan wrote:
> >On Wed, Nov 09, 2016 at 08:24:38AM +0800, Qu Wenruo wrote:
> >>At 11/08/2016 06:58 PM, Eryu Guan wrote:
> >>>On Tue, Nov 08, 2016 at 05:15:15PM +0800, Qu Wenruo wrote:
> >We should use helpers and not open-code when helpers are available. So
> >we should really use _scratch_mkfs_sized here.
> >
> >And "-n 64k" is only to make bug easier to reproduce, but I don't think
> >it's necessary. In my testings, 50%-75% runs hit the ENOSPC failure on
> >my x86_64 test vm, even I removed the "-n 64k" mkfs option from the
> >test.
> >
> >So I'd say remove "-n 64k" and test whatever mkfs options user
> >specified.
> 
> I really don't like the idea to allow user to override any mount
> option to reduce the possibility.

That's not the point. Overriding mount options reduces test coverage
because it limits the test to only the exact configuration that
reproduced the bug that was seen. If a user is testing a specific
configuration, then we really want to run the test on that config.

> And further more, this testcase is not a generic test, but a
> regression/pinpoint test to expose one specific bug.

If you want to make sure that the bug doesn't return, then you need
to run the /entire test suite/ with the configuration that exposes
the problem. You wouldn't be suggesting this specific set of mount
options if users weren't using that configuration. Hence you really
need to run the entire test suite with that configuration to make
sure you haven't broken those user's systems

And, yes, I test lots of different XFS configurations in their
entirity every day on every change I make or review, so I'm not
suggesting that you should do anything I don't already do.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] generic: create and delete files repeatedly to exercise ENOSPC behaviour

2016-11-01 Thread Dave Chinner
On Tue, Nov 01, 2016 at 07:19:30PM +0800, Wang Xiaoguang wrote:
> In btrfs, sometimes though the number of created files' consumed disk space
> are not larger than fs's free space, we can still get some ENOSPC error, it
> may be that btrfs does not try hard to reclaim disk space(I have sent kernel
> patch to resolve this kind of enospc error. Note, this false enospc error
> will not always happen even in kernel without my fixing patch).
> 
> Currently only in btrfs, I get this ENOSPC error, xfs and ext4 work well.
.
> +RUN_TIME=$((600 * $TIME_FACTOR))
> +fs_size=$((15 * 1024 * 1024 * 1024))
> +_scratch_mkfs_sized $fs_size > $seqres.full 2>&1
> +_scratch_mount > $seqres.full 2>&1
> +
> +testfile1=$SCRATCH_MNT/testfile1
> +testfile2=$SCRATCH_MNT/testfile2
> +filesize1=$(((fs_size * 80) / 100))
> +filesize2=$(((fs_size * 5) / 100))
> +
> +do_test()
> +{
> + while [ -f $SCRATCH_MNT/run ]; do
> + $XFS_IO_PROG -fc "pwrite 0 $filesize1" $testfile1 > /dev/null
> + $XFS_IO_PROG -fc "pwrite 0 $filesize2" $testfile2 > /dev/null
> + rm -f $testfile1 $testfile2 
> + done
> +}

What are you trying to test here that other ENOSPC tests
don't exercise? Why cant you just use preallocation to trigger
ENOSPC repeatedly instead of writing data? That would allow multiple
iterations per second, not one every few minutes...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] generic: make 17[1-4] work well when btrfs compression is enabled

2016-11-01 Thread Dave Chinner
On Tue, Nov 01, 2016 at 04:49:34PM +0800, Wang Xiaoguang wrote:
> hi Darrick,
> 
> Common/populate needs xfs_io supports falloc and fpunch,
> so I didn't put _fill_fs() in common/populate.

Tests will include common/rc first, and so pick up the functionality
_fill_fs requires before it's included from common/populate. So
there is no reason to put it in common/rc.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] fs: don't set *REFERENCED unless we are on the lru list

2016-10-27 Thread Dave Chinner
On Thu, Oct 27, 2016 at 09:13:44AM -0400, Josef Bacik wrote:
> On 10/26/2016 08:30 PM, Dave Chinner wrote:
> >On Wed, Oct 26, 2016 at 11:11:35AM -0400, Josef Bacik wrote:
> >>On 10/25/2016 06:01 PM, Dave Chinner wrote:
> >>>On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote:
> >>>>With anything that populates the inode/dentry cache with a lot of one 
> >>>>time use
> >>>>inodes we can really put a lot of pressure on the system for things we 
> >>>>don't
> >>>>need to keep in cache.  It takes two runs through the LRU to evict these 
> >>>>one use
> >>>>entries, and if you have a lot of memory you can end up with 10's of 
> >>>>millions of
> >>>>entries in the dcache or icache that have never actually been touched 
> >>>>since they
> >>>>were first instantiated, and it will take a lot of CPU and a lot of 
> >>>>pressure to
> >>>>evict all of them.
> >>>>
> >>>>So instead do what we do with pagecache, only set the *REFERENCED flags 
> >>>>if we
> >>>>are being used after we've been put onto the LRU.  This makes a 
> >>>>significant
> >>>>difference in the system's ability to evict these useless cache entries.  
> >>>>With a
> >>>>fs_mark workload that creates 40 million files we get the following 
> >>>>results (all
> >>>>in files/sec)
> >>>
> >>>What's the workload, storage, etc?
> >>
> >>Oops sorry I thought I said it.  It's fs_mark creating 20 million
> >>empty files on a single NVME drive.
> >
> >How big is the drive/filesystem (e.g. has impact on XFS allocation
> >concurrency)?  And multiple btrfs subvolumes, too, by the sound of
> >it. How did you set those up?  What about concurrency, directory
> >sizes, etc?  Can you post the fsmark command line as these details
> >do actually matter...
> >
> >Getting the benchmark configuration to reproduce posted results
> >should not require playing 20 questions!
> 
> This is the disk
> 
> Disk /dev/nvme0n1: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
> 
> This is the script
> 
> https://paste.fedoraproject.org/461874/73910147/1

Link no workee. This does, though:

https://paste.fedoraproject.org/461874/73910147

> 
> It's on a 1 socket 8 core cpu with 16gib of ram.

Thanks!

> >Is there an order difference in reclaim as a result of earlier
> >reclaim? Is btrfs_evict_inode() blocking on a sleeping lock in btrfs
> >rather than contending on spinning locks? Is it having to wait for
> >transaction commits or some other metadata IO because reclaim is
> >happening earlier? i.e. The question I'm asking is what, exactly,
> >leads to such a marked performance improvement in steady state
> >behaviour?
> 
> I would have seen this in my traces.  There's tons of places to
> improve btrfs's performance and behavior here no doubt.  But simply
> moving from pagecache to a slab shrinker shouldn't have drastically
> changed how we perform in this test.

Not sure what you mean by this. Do you mean that because of the
preceeding patches that change the page cache accounting for btrfs
metadata there is now not enough pressure being placed on the btrfs
inode cache shrinker?

FWIW, this zero length file fsmark workload produces zero page
cache pressure on XFS. If the case is that btrfs is no longer using the pa

> I feel like the shrinkers need
> to be used differently, but I completely destroyed vmscan.c trying
> different approaches and none of them made a difference like this
> patch made.  From what I was seeing in my trace we were simply
> reclaiming less per kswapd scan iteration with the old approach vs.
> the new approach.

Yup, that's what default_seeks is supposed to be used to tune - the
relative pressure that should be placed on the slab compared to
everything else.

> >I want to know because if there's behavioural changes in LRU reclaim
> >order having a significant effect on affecting btrfs, then there is
> >going to be some effects on other filesystems, too. Maybe not in
> >this benchmark, but we can't anticipate potential problems if we
> >don't understand exactly what is going on here.
> 
> So I'll just describe to you what I was seeing and maybe we can work
> out where we think the problem is.
> 
> 1) We go at X speed until we fill up all of the memory with the various 
> caches.
> 2) We lost about 15% when kswapd kicks in and it never recovered.

That's expected behaviour (i.e. that's exactly what I see when XFS
fill memory a

Re: bio linked list corruption.

2016-10-26 Thread Dave Chinner
On Tue, Oct 25, 2016 at 08:27:52PM -0400, Dave Jones wrote:
> DaveC: Do these look like real problems, or is this more "looks like
> random memory corruption" ?  It's been a while since I did some stress
> testing on XFS, so these might not be new..
> 
> XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 
> 2938
> [ cut here ]
> kernel BUG at fs/xfs/xfs_message.c:113!
> invalid opcode:  [#1] PREEMPT SMP
> CPU: 1 PID: 6227 Comm: trinity-c9 Not tainted 4.9.0-rc1-think+ #6 
> task: 8804f4658040 task.stack: 88050568c000
> RIP: 0010:[] 
>   [] assfail+0x1b/0x20 [xfs]
> RSP: :88050568f9e8  EFLAGS: 00010282
> RAX: ffea RBX: 0046 RCX: 0001
> RDX: ffc0 RSI: 000a RDI: a02fe34d
> RBP: 88050568f9e8 R08:  R09: 
> R10: 000a R11: f000 R12: 88050568fb44
> R13: 00f3 R14: 8804f292bf88 R15: 000e0046
> FS:  7fe2ddfdfb40() GS:88050a00() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 7fe2dbabd000 CR3: 0004f461f000 CR4: 001406e0
> Stack:
>  88050568fa88  a027ccee  fff9  8804f16fd8b0
>  3ffa  0032  8804f292bf40  4976
>  000e0008  04fd  8804  5107
> Call Trace:
>  [] xfs_bmap_add_extent_hole_delay+0x54e/0x620 [xfs]
>  [] xfs_bmapi_reserve_delalloc+0x2b4/0x400 [xfs]
>  [] xfs_file_iomap_begin_delay.isra.12+0x247/0x3c0 [xfs]
>  [] xfs_file_iomap_begin+0x181/0x270 [xfs]
>  [] iomap_apply+0x53/0x100
>  [] iomap_file_buffered_write+0x6b/0x90

All this iomap code is new, so it's entirely possible that this is a
new bug. The assert failure is indicative that the delalloc extent's
metadata reservation grew when we expected it to stay the same or
shrink.

> XFS: Assertion failed: tp->t_blk_res_used <= tp->t_blk_res, file: 
> fs/xfs/xfs_trans.c, line: 309
...
>  [] xfs_trans_mod_sb+0x241/0x280 [xfs]
>  [] xfs_ag_resv_alloc_extent+0x4f/0xc0 [xfs]
>  [] xfs_alloc_ag_vextent+0x23d/0x300 [xfs]
>  [] xfs_alloc_vextent+0x5fb/0x6d0 [xfs]
>  [] xfs_bmap_btalloc+0x304/0x8e0 [xfs]
>  [] ? xfs_iext_bno_to_ext+0xee/0x170 [xfs]
>  [] xfs_bmap_alloc+0x2b/0x40 [xfs]
>  [] xfs_bmapi_write+0x640/0x1210 [xfs]
>  [] xfs_iomap_write_allocate+0x166/0x350 [xfs]
>  [] xfs_map_blocks+0x1b0/0x260 [xfs]
>  [] xfs_do_writepage+0x23b/0x730 [xfs]

And that's indicative of a delalloc metadata reservation being
being too small and so we're allocating unreserved blocks.

Different symptoms, same underlying cause, I think.

I see the latter assert from time to time in my testing, but it's
not common (maybe once a month) and I've never been able to track it
down.  However, it doesn't affect production systems unless they hit
ENOSPC hard enough that it causes the critical reserve pool to be
exhausted iand so the allocation fails. That's extremely rare -
usually takes a several hundred processes all trying to write as had
as they can concurrently and to all slip through the ENOSPC
detection without the correct metadata reservations and all require
multiple metadata blocks to be allocated durign writeback...

If you've got a way to trigger it quickly and reliably, that would
be helpful...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] fs: don't set *REFERENCED unless we are on the lru list

2016-10-26 Thread Dave Chinner
On Wed, Oct 26, 2016 at 11:11:35AM -0400, Josef Bacik wrote:
> On 10/25/2016 06:01 PM, Dave Chinner wrote:
> >On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote:
> >>With anything that populates the inode/dentry cache with a lot of one time 
> >>use
> >>inodes we can really put a lot of pressure on the system for things we don't
> >>need to keep in cache.  It takes two runs through the LRU to evict these 
> >>one use
> >>entries, and if you have a lot of memory you can end up with 10's of 
> >>millions of
> >>entries in the dcache or icache that have never actually been touched since 
> >>they
> >>were first instantiated, and it will take a lot of CPU and a lot of 
> >>pressure to
> >>evict all of them.
> >>
> >>So instead do what we do with pagecache, only set the *REFERENCED flags if 
> >>we
> >>are being used after we've been put onto the LRU.  This makes a significant
> >>difference in the system's ability to evict these useless cache entries.  
> >>With a
> >>fs_mark workload that creates 40 million files we get the following results 
> >>(all
> >>in files/sec)
> >
> >What's the workload, storage, etc?
> 
> Oops sorry I thought I said it.  It's fs_mark creating 20 million
> empty files on a single NVME drive.

How big is the drive/filesystem (e.g. has impact on XFS allocation
concurrency)?  And multiple btrfs subvolumes, too, by the sound of
it. How did you set those up?  What about concurrency, directory
sizes, etc?  Can you post the fsmark command line as these details
do actually matter...

Getting the benchmark configuration to reproduce posted results
should not require playing 20 questions!

> >>The reason Btrfs has a much larger improvement is because it holds a lot 
> >>more
> >>things in memory so benefits more from faster slab reclaim, but across the 
> >>board
> >>is an improvement for each of the file systems.
> >
> >Less than 1% for XFS and ~1.5% for ext4 is well within the
> >run-to-run variation of fsmark. It looks like it might be slightly
> >faster, but it's not a cut-and-dried win for anything other than
> >btrfs.
> >
> 
> Sure the win in this benchmark is clearly benefiting btrfs the most,
> but I think the overall approach is sound and likely to help
> everybody in theory.

Yup, but without an explanation of why it makes such a big change to
btrfs, we can't really say what effect it's really going to have.
Why does cycling the inode a second time through the LRU make any
difference? Once we're in steady state on this workload, one or two
cycles through the LRU should make no difference at all to
performance because all the inodes are instantiated in identical
states (including the referenced bit) and so scanning treats every
inode identically. i.e. the reclaim rate (i.e. how often
evict_inode() is called) should be exactly the same and the only
difference is the length of time between the inode being put on the
LRU and when it is evicted.

Is there an order difference in reclaim as a result of earlier
reclaim? Is btrfs_evict_inode() blocking on a sleeping lock in btrfs
rather than contending on spinning locks? Is it having to wait for
transaction commits or some other metadata IO because reclaim is
happening earlier? i.e. The question I'm asking is what, exactly,
leads to such a marked performance improvement in steady state
behaviour?

I want to know because if there's behavioural changes in LRU reclaim
order having a significant effect on affecting btrfs, then there is
going to be some effects on other filesystems, too. Maybe not in
this benchmark, but we can't anticipate potential problems if we
don't understand exactly what is going on here.

> Inside FB we definitely have had problems
> where the memory pressure induced by some idi^H^H^Hprocess goes
> along and runs find / which causes us to evict real things that are
> being used rather than these one use inodes.

That's one of the problems the IO throttling in the XFS shrinker
tends to avoid. i.e. This is one of the specific cases we expect to see
on all production systems - backup applications are the common cause
of regular full filesystem traversals.

FWIW, there's an element of deja vu in this thread: that XFS inode
cache shrinker IO throttling is exactly what Chris proposed we gut
last week to solve some other FB memory reclaim problem that had no
explanation of the root cause.

(http://www.spinics.net/lists/linux-xfs/msg01541.html)

> This sort of behavior
> could possibly be mitigated by this patch, but I haven't sat down to
> figure out a reliable way to mirror this workload to test that
> theory.  Thanks

I use fsmark to create filesystems with tens of millions of small
fi

Re: [PATCH 5/5] fs: don't set *REFERENCED unless we are on the lru list

2016-10-26 Thread Dave Chinner
On Wed, Oct 26, 2016 at 04:03:54PM -0400, Josef Bacik wrote:
> On 10/25/2016 07:36 PM, Dave Chinner wrote:
> >So, 2-way has not improved. If changing referenced behaviour was an
> >obvious win for btrfs, we'd expect to see that here as well.
> >however, because 4-way improved by 20%, I think all we're seeing is
> >a slight change in lock contention levels inside btrfs. Indeed,
> >looking at the profiles I see that lock contention time was reduced
> >to around 32% of the total CPU used (down by about 20%):
> >
> >  20.79%  [kernel]  [k] __pv_queued_spin_lock_slowpath
> >   3.85%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
> >   3.68%  [kernel]  [k] _raw_spin_lock
> >   3.40%  [kernel]  [k] queued_write_lock_slowpath
> >   .
> >
> >IOWs, the performance increase comes from the fact that changing
> >inode cache eviction patterns causes slightly less lock contention
> >in btrfs inode reclaim. IOWs, the problem that needs fixing is the
> >btrfs lock contention, not the VFS cache LRU algorithms.
> >
> >Root cause analysis needs to be done properly before behavioural
> >changes like this are proposed, people!
> >
> 
> We are doing different things.

Well, no, we're doing the same thing. Except...

> Yes when you do it all into the same
> subvol the lock contention is obviously much worse and the
> bottleneck, but that's not what I'm doing, I'm creating a subvol for
> each thread to reduce the lock contention on the root nodes.

.. you have a non-default filesystem config that you didn't mention
even in response to a /direct question/. Seriously, this is a major
configuration detail that is necessary for anyone who wants to
replicate your results!

The difference is that I'm using is the /default/ btrfs filesystem
configuration and you're using a carefully contrived structure that
is designed to work around fundamental scalability issues the
benchmark normally exposes.

This raises another question: Does this subvol setup reflect
production configurations, or is it simply a means to get the
benchmark to put enough load on the inode cache to demonstrate the
effect of changing the reference bits?

> If you
> retest by doing that then you will see the differences I was seeing.
> Are you suggesting that I just made these numbers up? 

No, I'm suggesting that you haven't explained the problem or how to
reproduce it in sufficient detail.

You haven't explained why reducing scanning (which in and of iteself
has minimal overhead and is not visible to btrfs) has such a marked
effect on the behaviour of btrfs. You haven't given us details on
the workload or storage config so we can reproduce the numbers (and
still haven't!). You're arguing that numbers far below variablity
and accuracy measurement limits are significant (i.e. the patched
numbers are better for XFS and ext4). There's lots of information
you've chosen to omit that we need to know.

To fill in the gaps, I've done some analysis, provided perf numbers
and CPU profiles, and put them into an explanation that explains why
there is a difference in performance for a default btrfs config. I
note that you've not actually addressed that analysis and the
problems it uncovers, but instead just said "I'm using a different
configuration".

What about all the btrfs users that aren't using your configuration
that will see these problems? How does the scanning change actually
benefit them more than fixing the locking issues that exist? Memory
reclaim and LRU algorithms affect everyone, not just a benchmark
configuration. We need to know a lot more about the issue at hand
to be able to evaluate whether this is the /right solution for the
problem./

> Instead of
> assuming I'm an idiot and don't know how to root cause issues please
> instead ask what is different from your run versus my run.  Thanks,

I'm assuming you're a compentent engineer, Josef, who knows that
details about benchmarks and performance measurement are extremely
important, and that someone reviewing code needs to understand why
the behaviour change had the impact it did to be able to evaluate it
properly. I'm assuming that you also know that if you understand the
root cause of a problem, you put it in the commit message so that
everyone else knows it, too.

So if you know what the root cause of the btrfs problem that
reducing reclaim scanning avoids, then please explain it.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] fs: don't set *REFERENCED unless we are on the lru list

2016-10-25 Thread Dave Chinner
On Wed, Oct 26, 2016 at 09:01:13AM +1100, Dave Chinner wrote:
> On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote:
> > With anything that populates the inode/dentry cache with a lot of one time 
> > use
> > inodes we can really put a lot of pressure on the system for things we don't
> > need to keep in cache.  It takes two runs through the LRU to evict these 
> > one use
> > entries, and if you have a lot of memory you can end up with 10's of 
> > millions of
> > entries in the dcache or icache that have never actually been touched since 
> > they
> > were first instantiated, and it will take a lot of CPU and a lot of 
> > pressure to
> > evict all of them.
> > 
> > So instead do what we do with pagecache, only set the *REFERENCED flags if 
> > we
> > are being used after we've been put onto the LRU.  This makes a significant
> > difference in the system's ability to evict these useless cache entries.  
> > With a
> > fs_mark workload that creates 40 million files we get the following results 
> > (all
> > in files/sec)
> 
> What's the workload, storage, etc?
> 
> > Btrfs   Patched Unpatched
> > Average Files/sec:  72209.3 63254.2
> > p50 Files/sec:  70850   57560
> > p90 Files/sec:  68757   53085
> > p99 Files/sec:  68757   53085
> 
> So how much of this is from changing the dentry referenced
> behaviour, and how much from the inode? Can you separate out the two
> changes so we know which one is actually affecting reclaim
> performance?

FWIW, I just went to run my usual zero-length file creation fsmark
test (16-way create, large sparse FS image on SSDs). XFS (with debug
enabled) takes 4m10s to run at an average of ~230k files/s, with a
std deviation of +/-1.7k files/s.

Unfortunately, btrfs turns that into more heat than it ever has done
before. It's only managing 35k files/s and the profile looks like
this:

  58.79%  [kernel]  [k] __pv_queued_spin_lock_slowpath
   5.61%  [kernel]  [k] queued_write_lock_slowpath
   1.65%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
   1.38%  [kernel]  [k] reschedule_interrupt
   1.08%  [kernel]  [k] _raw_spin_lock
   0.92%  [kernel]  [k] __radix_tree_lookup
   0.86%  [kernel]  [k] _raw_spin_lock_irqsave
   0.83%  [kernel]  [k] btrfs_set_lock_blocking_rw

I killed it because this would take too long to run.

I reduced the concurrency down to 4-way, spinlock contention went
down to about 40% of the CPU time.  I reduced the concurrency down
to 2 and saw about 16% of CPU time being spent in lock contention.

Throughput results:
btrfs throughput
2-way   4-way
unpatched   46938.1+/-2.8e+03   40273.4+/-3.1e+03
patched 45697.2+/-2.4e+03   49287.1+/-3e+03


So, 2-way has not improved. If changing referenced behaviour was an
obvious win for btrfs, we'd expect to see that here as well.
however, because 4-way improved by 20%, I think all we're seeing is
a slight change in lock contention levels inside btrfs. Indeed,
looking at the profiles I see that lock contention time was reduced
to around 32% of the total CPU used (down by about 20%):

  20.79%  [kernel]  [k] __pv_queued_spin_lock_slowpath
   3.85%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
   3.68%  [kernel]  [k] _raw_spin_lock
   3.40%  [kernel]  [k] queued_write_lock_slowpath
   .

IOWs, the performance increase comes from the fact that changing
inode cache eviction patterns causes slightly less lock contention
in btrfs inode reclaim. IOWs, the problem that needs fixing is the
btrfs lock contention, not the VFS cache LRU algorithms.

Root cause analysis needs to be done properly before behavioural
changes like this are proposed, people!

-Dave.

PS: I caught this profile on unmount when the 8 million inodes
cached inodes were being reclaimed. evict_inodes() ignores the
referenced bit, so this gives a lot of insight into the work being
done by inode reclaim in a filesystem:

  18.54%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
   9.43%  [kernel]  [k] rb_erase
   8.03%  [kernel]  [k] __btrfs_release_delayed_node
   7.23%  [kernel]  [k] _raw_spin_lock
   6.93%  [kernel]  [k] __list_del_entry
   4.35%  [kernel]  [k] __slab_free
   3.93%  [kernel]  [k] __mutex_lock_slowpath
   2.77%  [kernel]  [k] bit_waitqueue
   2.58%  [kernel]  [k] kmem_cache_alloc
   2.50%  [kernel]  [k] __radix_tree_lookup
   2.44%  [kernel]  [k] _raw_spin_lock_irq
   2.18%  [kernel]  [k] kmem_cache_free
   2.17%  [kernel]  [k] evict   <<<<<<<<<<<
   2.13%  [kernel]  [k] fsnotify_destroy_marks
   1.68%  [kernel]  [k] btrfs_remove_delayed_node
   1.61%  [kernel]  [k] __call_rcu.constprop.70
   1.50%  [kernel

Re: [PATCH 5/5] fs: don't set *REFERENCED unless we are on the lru list

2016-10-25 Thread Dave Chinner
On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote:
> With anything that populates the inode/dentry cache with a lot of one time use
> inodes we can really put a lot of pressure on the system for things we don't
> need to keep in cache.  It takes two runs through the LRU to evict these one 
> use
> entries, and if you have a lot of memory you can end up with 10's of millions 
> of
> entries in the dcache or icache that have never actually been touched since 
> they
> were first instantiated, and it will take a lot of CPU and a lot of pressure 
> to
> evict all of them.
> 
> So instead do what we do with pagecache, only set the *REFERENCED flags if we
> are being used after we've been put onto the LRU.  This makes a significant
> difference in the system's ability to evict these useless cache entries.  
> With a
> fs_mark workload that creates 40 million files we get the following results 
> (all
> in files/sec)

What's the workload, storage, etc?

> Btrfs Patched Unpatched
> Average Files/sec:72209.3 63254.2
> p50 Files/sec:70850   57560
> p90 Files/sec:68757   53085
> p99 Files/sec:68757   53085

So how much of this is from changing the dentry referenced
behaviour, and how much from the inode? Can you separate out the two
changes so we know which one is actually affecting reclaim
performance?

Indeed, I wonder if just changing the superblock shrinker
default_seeks for btrfs would have exactly the same impact because
that canbe used to exactly double the reclaim scan rate for the same
memory pressure.  If that doesn't change performance by a similar
amount (changing defaults seeks is the normal way of changing
shrinker balance), then more digging is required here to explain why
the referenced bits make such an impact to steady state
performance...

> XFS   Patched Unpatched
> Average Files/sec:61025.5 60719.5
> p50 Files/sec:60107   59465
> p90 Files/sec:59300   57966
> p99 Files/sec:59227   57528

You made XFS never use I_REFERENCED at all (hint: not all
filesystems use find_inode/find_inode_fast()), so it's not clear
that the extra scanning (less than 1% difference in average
throughput) is actuallly the cause of things being slower in btrfs.

> The reason Btrfs has a much larger improvement is because it holds a lot more
> things in memory so benefits more from faster slab reclaim, but across the 
> board
> is an improvement for each of the file systems.

Less than 1% for XFS and ~1.5% for ext4 is well within the
run-to-run variation of fsmark. It looks like it might be slightly
faster, but it's not a cut-and-dried win for anything other than
btrfs.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] generic: make 17[1-4] work well when btrfs compression is enabled

2016-10-10 Thread Dave Chinner
On Mon, Oct 10, 2016 at 04:06:17PM +0800, Wang Xiaoguang wrote:
> When enabling btrfs compression, original codes can not fill fs
> correctly, fix this.
> 
> Signed-off-by: Wang Xiaoguang <wangxg.f...@cn.fujitsu.com>
> ---
> V2: In common/, I did't find an existing function suitable for
> these 4 test cases to fill fs, so I still use _pwrite_byte() with
> a big enough file length fo fill fs. Note, for btrfs, metadata space
> still is not full, only data space is full, but it's OK for these
> 4 test cases.
> 
> All these 4 cases pass in xfs and btrfs(without compression), if
> btrfs has compression enabled, these 4 cases will fail for false
> enospc error, I have sent kernel patches to fix this bug.
> ---
>  tests/generic/171 | 7 ---
>  tests/generic/172 | 6 --
>  tests/generic/173 | 7 ---
>  tests/generic/174 | 7 ---
>  4 files changed, 16 insertions(+), 11 deletions(-)
> 
> diff --git a/tests/generic/171 b/tests/generic/171
> index f391685..2d7f683 100755
> --- a/tests/generic/171
> +++ b/tests/generic/171
> @@ -75,9 +75,10 @@ _cp_reflink $testdir/bigfile $testdir/clonefile
>  sync
>  
>  echo "Allocate the rest of the space"
> -nr_free=$(stat -f -c '%f' $testdir)
> -touch $testdir/file0 $testdir/file1
> -_pwrite_byte 0x61 0 $((blksz * nr_free)) $testdir/eat_my_space >> 
> $seqres.full 2>&1
> +# here we try to write big enough data in case some fs support
> +# compression, such as btrfs.
> +len=$((2 ** 36))
> +_pwrite_byte 0x61 0 $len $testdir/eat_my_space >> $seqres.full 2>&1

NACK. This is no better than the last proposal. 2^36 = 64GB, which
will not fill filesystems of any significant size, especially as the
data will compress down to nearly nothing.

Trying to hack around compression artifacts by inflating the size of
the file just doesn't work reliably.  The way to fix this is to
either use one of the "fill filesystem" functions that isn't
affected by compression, or to add the ability to xfs_io to write
incompressible data patterns and otherwise leave the test alone.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] generic/175: disable inline data feature for btrfs

2016-10-10 Thread Dave Chinner
On Mon, Oct 10, 2016 at 01:06:47PM +0800, Wang Xiaoguang wrote:
> For btrfs, if compression is enabled, it may generate inline data for a
> blocksize data range, this inline data is stored in fs tree, will not have
> a individual extent, try to reflink this data range at a not-zero offset
> will return EOPNOTSUPP, so here we disable inline data feature for btrfs.
> 
> Signed-off-by: Wang Xiaoguang <wangxg.f...@cn.fujitsu.com>
> ---
>  tests/generic/175 | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/tests/generic/175 b/tests/generic/175
> index 964580c..b3f90dc 100755
> --- a/tests/generic/175
> +++ b/tests/generic/175
> @@ -50,6 +50,13 @@ rm -f "$seqres.full"
>  
>  echo "Format and mount"
>  _scratch_mkfs > "$seqres.full" 2>&1
> +# For btrfs, if compression is enabled, it may generate inline data for a
> +# blocksize data range, this inline data is stored in fs tree, will not have
> +# a individual extent, try to reflink this data range at a not-zero offset
> +# will return EOPNOTSUPP, so here we disable inline data feature for btrfs.
> +if [ "$FSTYP" = "btrfs" ]; then
> + export MOUNT_OPTIONS="-o max_inline=0 $MOUNT_OPTIONS"
> +fi

Can we /please stop/ putting special case code like this in tests?

This is an unsustainable and unmaintainable practice - it's making a
mess of the test code. If there are specific mount options that
needs to be avoided, then add an option to filter them out. e.g.
something like this:

_scratch_options_filter btrfs compress

so that it removes any compression option from the btrfs mount/mkfs
that is run for that test.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is is possible to submit binary image as fstest test case?

2016-10-09 Thread Dave Chinner
On Fri, Oct 07, 2016 at 06:05:51PM +0200, David Sterba wrote:
> On Fri, Oct 07, 2016 at 08:18:38PM +1100, Dave Chinner wrote:
> > On Thu, Oct 06, 2016 at 04:12:56PM +0800, Qu Wenruo wrote:
> > > So I'm wondering if I can just upload a zipped raw image as part of
> > > the test case?
> > 
> > Preferably not. We've managed to avoid pre-built images in xfstests
> > for 15 years, so there'd have to be a really good reason to start
> > doing this, especially as once we open that floodgate we'll end up
> > with everyone wanting to do this and it will blow out the size of
> > the repository in now time.
> > 
> > If the issue is just timing or being unable to trigger an error
> > at the right time, this is what error injection frameworks or
> > debug-only sysfs hooks are for. The XFS kernel code has both,
> > xfstests use both, and they pretty much do away with the need for
> > custom binary filesystem images for such testing...
> 
> Error injection framework may not be alwasy available, eg. in kernels
> built for production. Yet I'd like to make the test possible.

Why would you ever enable error injection on a production kernel?
We simply don't run the error injection tests when the
infrastructure is not there, jsut like we do with all other tests
that are depenent on optional kernel/fs features

> Also, I find it useful to keep the exact images that are attached to a
> report and not necessarily try to recreate it to trigger a bug. If the
> images are small, hosting them with the sources makes testing easy.
> Big images would probably go to own repository and be linked.
> 
> I don't really expect fstests to store crafted images and would advise
> Qu to not even try to propose that, because the answer was quite
> predictable.

You say that like it's a bad thing?  What's the guarantee that a
specific corrupt image will always be sufficient to trigger the
problem the test is supposed to check? i.e. we could change a
different part of the FS code and that could mask the bug the image
used to trigger. The test then passes, but we haven't actually fix
the bug that the test used to exercise. i.e. we've got a false "we
fixed the problem" report, when all we did is prevent a specific
vector from tripping over it.

Error injection and sysfs hooks into debug code are much more
reliable ways of testing that the root cause of a problem is fixed
and stays fixed.  The reproducing trigger cannot be masked by
changes in other code layers, so we know that if the error
injection/sysfs hook test handles the problem correctly, we have
actually fixed the underlying bug and not just masked it...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Making statfs return qgroup info (for container use case etc.)

2016-10-09 Thread Dave Chinner
On Fri, Oct 07, 2016 at 06:58:47PM +0200, David Sterba wrote:
> On Fri, Oct 07, 2016 at 09:40:11AM +1100, Dave Chinner wrote:
> > XFS does this with directory tree quotas. It was implmented 10 years
> > ago or so, IIRC...
> 
> Sometimes, the line between a historical remark and trolling is thin

All I've done is quickly point out that we've already got an
implementation of the exact functionality being asked for and given
a rough indication of when the commits hit the tree. i.e. so anyone
wanting to implment this in btrfs can search for it easily and work
out how to use the existing VFS infrastructure to report this
information.

You can take that as trolling if you want, but if you think I have
time for playing stupid, petty, idiotic games like that then you
need to ask yourself why you have had such a defensive and
paranoid reaction...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] generic: make 17[1-4] work well when btrfs compression is enabled

2016-10-07 Thread Dave Chinner
On Fri, Oct 07, 2016 at 03:00:42PM +0800, Wang Xiaoguang wrote:
> When enabling btrfs compression, original codes can not fill fs
> correctly, fix this.
> 
> Signed-off-by: Wang Xiaoguang <wangxg.f...@cn.fujitsu.com>
> ---
>  tests/generic/171 | 4 +---
>  tests/generic/172 | 2 +-
>  tests/generic/173 | 4 +---
>  tests/generic/174 | 4 +---
>  4 files changed, 4 insertions(+), 10 deletions(-)
> 
> diff --git a/tests/generic/171 b/tests/generic/171
> index f391685..d2ae8e4 100755
> --- a/tests/generic/171
> +++ b/tests/generic/171
> @@ -75,9 +75,7 @@ _cp_reflink $testdir/bigfile $testdir/clonefile
>  sync
>  
>  echo "Allocate the rest of the space"
> -nr_free=$(stat -f -c '%f' $testdir)
> -touch $testdir/file0 $testdir/file1
> -_pwrite_byte 0x61 0 $((blksz * nr_free)) $testdir/eat_my_space >> 
> $seqres.full 2>&1
> +dd if=/dev/zero of=$testdir/eat_my_space >> $seqres.full 2>&1

Please don't replace xfs_io writes using a specific data pattern
with dd calls that write zeros. Indeed, we don't use dd for new
tests anymore - xfs_io should be used.

Write a function that fills all the remaining free space (one may
already exist) and call it in all these places.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is is possible to submit binary image as fstest test case?

2016-10-07 Thread Dave Chinner
On Fri, Oct 07, 2016 at 05:26:27PM +0800, Qu Wenruo wrote:
> 
> 
> At 10/07/2016 05:18 PM, Dave Chinner wrote:
> >On Thu, Oct 06, 2016 at 04:12:56PM +0800, Qu Wenruo wrote:
> >>Hi,
> >>
> >>Just as the title says, for some case(OK, btrfs again) we need to
> >>catch a file system in special timing.
> >>
> >>In this specific case, we need to grab a btrfs image undergoing
> >>balancing, just before the balance finished.
> >>
> >>Although we can use flakey to drop all write, we still don't have
> >>method to catch the timing of the that transaction.
> >>
> >>
> >>On the other hand, we can tweak our local kernel, adding
> >>msleep()/message and dump the disk during the sleep.
> >>And the image I dumped can easily trigger btrfs kernel and user-space bug.
> >>
> >>So I'm wondering if I can just upload a zipped raw image as part of
> >>the test case?
> >
> >Preferably not. We've managed to avoid pre-built images in xfstests
> >for 15 years, so there'd have to be a really good reason to start
> >doing this, especially as once we open that floodgate we'll end up
> >with everyone wanting to do this and it will blow out the size of
> >the repository in now time.
> 
> Makes sense.
> For btrfs-progs, which includes test images, it already takes about
> 77M, even we have tried our best to reduce image size.
> 
> >
> >If the issue is just timing or being unable to trigger an error
> >at the right time, this is what error injection frameworks or
> >debug-only sysfs hooks are for. The XFS kernel code has both,
> >xfstests use both, and they pretty much do away with the need for
> >custom binary filesystem images for such testing...
> 
> So again, btrfs is lacking infrastructure for debug.
> It seems that we can only rely on images out of xfstest tree,

That's the /wrong answer/. Go and implement debug infrastructure
that btrfs needs - if you wait for someone else to do it, it will
never get done and btrfs will never stabilise

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is is possible to submit binary image as fstest test case?

2016-10-07 Thread Dave Chinner
On Thu, Oct 06, 2016 at 04:12:56PM +0800, Qu Wenruo wrote:
> Hi,
> 
> Just as the title says, for some case(OK, btrfs again) we need to
> catch a file system in special timing.
> 
> In this specific case, we need to grab a btrfs image undergoing
> balancing, just before the balance finished.
> 
> Although we can use flakey to drop all write, we still don't have
> method to catch the timing of the that transaction.
> 
> 
> On the other hand, we can tweak our local kernel, adding
> msleep()/message and dump the disk during the sleep.
> And the image I dumped can easily trigger btrfs kernel and user-space bug.
> 
> So I'm wondering if I can just upload a zipped raw image as part of
> the test case?

Preferably not. We've managed to avoid pre-built images in xfstests
for 15 years, so there'd have to be a really good reason to start
doing this, especially as once we open that floodgate we'll end up
with everyone wanting to do this and it will blow out the size of
the repository in now time.

If the issue is just timing or being unable to trigger an error
at the right time, this is what error injection frameworks or
debug-only sysfs hooks are for. The XFS kernel code has both,
xfstests use both, and they pretty much do away with the need for
custom binary filesystem images for such testing...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Making statfs return qgroup info (for container use case etc.)

2016-10-06 Thread Dave Chinner
On Thu, Oct 06, 2016 at 02:51:58PM +0100, Tim Small wrote:
> Hello,
> 
> I use btrfs for containers (e.g. full OS style containers using lxc/lxd
> etc. rather than application containers).  btrfs de-duplication and
> send/receive are very useful features for this use-case.
> 
> Each container runs in its own subvolume, and I use quotas to stop one
> container exhausting the disk space of the others.
> 
> df shows the total space + free space for the entire filesystem - which
> is confusing for users within the containers.  Worse - they don't have
> any way of knowing how much quota they actually have left (other than du
> -xs / vs. whatever I've told them).
> 
> It'd be really nice if running e.g. statfs("/home", ...) within a
> container could be made to return the qgroup usage + limit for the path
> (e.g. by passing in an option at mount time - such as qgroup level
> maybe?) , instead of the global filesystem data in f_bfree f_blocks etc.

XFS does this with directory tree quotas. It was implmented 10 years
ago or so, IIRC...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Preliminary BTRFS Encryption

2016-09-15 Thread Dave Chinner
On Tue, Sep 13, 2016 at 09:39:46PM +0800, Anand Jain wrote:
> 
> This patchset adds btrfs encryption support.
> 
> The main objective of this series is to have bugs fixed and stability.
> I have verified with fstests to confirm that there is no regression.
> 
> A design write-up is coming next, however here below is the quick example
> on the cli usage. Please try out, let me know if I have missed something.

Yup, that best practices say "do not roll your own encryption
infrastructure".

This is just my 2c worth - take it or leave it, don't other flaming.
Keep in mind that I'm not picking on btrfs here - I asked similar
hard questions about the proposed f2fs encryption implementation.
That was a "copy and snowflake" version of the ext4 encryption code -
they made changes and now we have generic code and common
functionality between ext4 and f2fs.

> Also would like to mention that a review from the security experts is due,
> which is important and I believe those review comments can be accommodated
> without major changes from here.

That's a fairly significant red flag to me - security reviews need
to be done at the design phase against specific threat models -
security review is not a code/implementation review...

The ext4 developers got this right by publishing threat models and
design docs, which got quite a lot of review and feedback before
code was published for review.

https://docs.google.com/document/d/1ft26lUQyuSpiu6VleP70_npaWdRfXFoNnB8JYnykNTg/edit#heading=h.qmnirp22ipew

[small reorder of comments]

> As of now these patch set supports encryption on per subvolume, as
> managing properties on per subvolume is a kind of core to btrfs, which is
> easier for data center solution-ing, seamlessly persistent and easy to
> manage.

We've got dmcrypt for this sort of transparent "device level"
encryption. Do we really need another btrfs layer that re-implements
generic, robust, widely deployed, stable functionality?

What concerns me the most here is that it seems like that nothing
has been learnt from the btrfs RAID5/6 debacle. i.e. the btrfs
reimplementation of existing robust, stable, widely deployed
infrastructure was fatally flawed and despite regular corruption
reports they were ignored for, what, 2 years? And then a /user/
spent the time to isolate the problem, and now several months later
it still hasn't been fixed. I haven't seen any developer interest in
fixing it, either.

This meets the definition of unmaintained software, and it sets a
poor example for how complex new btrfs features might be maintained
in the long term. Encryption simply cannot be treated like this - it
has to be right, and it has to be well maintained.

So what is being done differently ito the RAID5/6 review process
this time that will make the new btrfs-specific encryption
implementation solid and have minimal risk of zero-day fatal flaws?
And how are you going to guarantee that it will be adequately
maintained several years down the track?

> Also yes, thanks for the emails, I hear, per file encryption and inline
> with vfs layer is also important, which is wip among other things in the
> list.

The generic file encryption code is solid, reviewed, tested and
already widely deployed via two separate filesystems. There is a
much wider pool of developers who will maintain it, reveiw changes
and know all the traps that a new implementation might fall into.
There's a much bigger safety net here, which significantly lowers
the risk of zero-day fatal flaws in a new implementation and of
flaws in future modifications and enhancements.

Hence, IMO, the first thing to do is implement and make the generic
file encryption support solid and robust, not tack it on as an
afterthought for the magic btrfs encryption pixies to take care of.

Indeed, with the generic file encryption, btrfs may not even need
the special subvolume encryption pixies. i.e. you can effectively
implement subvolume encryption via configuration of a multi-user
encryption key for each subvolume and apply it to the subvolume tree
root at creation time. Then only users with permission to unlock the
subvolume key can access it.

Once the generic file encryption is solid and fulfils the needs of
most users, then you can look to solving the less common threat
models that neither dmcrypt or per-file encryption address. Only if
the generic code cannot be expanded to address specific threat
models should you then implement something that is unique to
btrfs

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 1/6] fstests: common: Introduce _post_mount_hook for btrfs

2016-09-14 Thread Dave Chinner
On Wed, Sep 14, 2016 at 09:55:22AM +0800, Qu Wenruo wrote:
> Introduce _post_mount_hook(), which will be executed after mounting
> scratch/test.
> 
> It's quite useful for fs(OK, only btrfs yet, again) which needs to
> use ioctl other than mount option to enable some of its feature.

Just implement a _btrfs_mount() function (similar to
_overlay_mount()) to do btrfs specific things at mount time.
> Signed-off-by: Qu Wenruo <quwen...@cn.fujitsu.com>
> ---
>  common/rc | 23 +++
>  1 file changed, 23 insertions(+)
> 
> diff --git a/common/rc b/common/rc
> index 23c007a..631397f 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -321,6 +321,27 @@ _overlay_scratch_unmount()
>   $UMOUNT_PROG $SCRATCH_MNT
>  }
>  
> +_run_btrfs_post_mount_hook()
> +{
> + mnt_point=$1
> + for n in $ALWAYS_ENABLE_BTRFS_FEATURE; do

What's this magic, undefined, undocumented variable?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] writeback: allow for dirty metadata accounting

2016-09-12 Thread Dave Chinner
On Mon, Sep 12, 2016 at 10:56:04AM -0400, Josef Bacik wrote:
> I think that looping through all the sb's in the system would be
> kinda shitty for this tho, we want the "get number of dirty pages"
> part to be relatively fast.  What if I do something like the
> shrinker_control only for dirty objects. So the fs registers some
> dirty_objects_control, we call into each of those and get the counts
> from that.  Does that sound less crappy?  Thanks,

Hmmm - just an off-the-wall thought on this

If you're going to do that, then why wouldn't you simply use a
"shrinker" to do the metadata writeback rather than having a hook to
count dirty objects to pass to some other writeback code that calls
a hook to write the metadata?

That way filesystems can also implement dirty accounting and
"writers" for each cache of objects they currently implement
shrinkers for. i.e. just expanding shrinkers to be able to "count
dirty objects" and "write dirty objects" so that we can tell
filesystems to write back all their different metadata caches
proportionally to the size of the page cache and it's dirty state.
The existing file data and inode writeback could then just be new
generic "superblock shrinker" operations, and the fs could have it's
own private metadata writeback similar to the private sb shrinker
callout we currently have...

And, in doing so, we might be able to completely hide memcg from the
writeback implementations similar to the way memcg is completely
hidden from the shrinker reclaim implementations...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] writeback: allow for dirty metadata accounting

2016-09-12 Thread Dave Chinner
On Mon, Sep 12, 2016 at 10:24:12AM -0400, Josef Bacik wrote:
> Dave your reply got eaten somewhere along the way for me, so all i
> have is this email.  I'm going to respond to your stuff here.

No worries, I'll do a 2-in-1 reply :P

> On 09/12/2016 03:34 AM, Jan Kara wrote:
> >On Mon 12-09-16 10:46:56, Dave Chinner wrote:
> >>On Fri, Sep 09, 2016 at 10:17:43AM +0200, Jan Kara wrote:
> >>>On Mon 22-08-16 13:35:01, Josef Bacik wrote:
> >>>>Provide a mechanism for file systems to indicate how much dirty metadata 
> >>>>they
> >>>>are holding.  This introduces a few things
> >>>>
> >>>>1) Zone stats for dirty metadata, which is the same as the NR_FILE_DIRTY.
> >>>>2) WB stat for dirty metadata.  This way we know if we need to try and 
> >>>>call into
> >>>>the file system to write out metadata.  This could potentially be used in 
> >>>>the
> >>>>future to make balancing of dirty pages smarter.
> >>>
> >>>So I'm curious about one thing: In the previous posting you have mentioned
> >>>that the main motivation for this work is to have a simple support for
> >>>sub-pagesize dirty metadata blocks that need tracking in btrfs. However you
> >>>do the dirty accounting at page granularity. What are your plans to handle
> >>>this mismatch?
> >>>
> >>>The thing is you actually shouldn't miscount by too much as that could
> >>>upset some checks in mm checking how much dirty pages a node has directing
> >>>how reclaim should be done... But it's a question whether NR_METADATA_DIRTY
> >>>should be actually used in the checks in node_limits_ok() or in
> >>>node_pagecache_reclaimable() at all because once you start accounting dirty
> >>>slab objects, you are really on a thin ice...
> >>
> >>The other thing I'm concerned about is that it's a btrfs-only thing,
> >>which means having dirty btrfs metadata on a system with different
> >>filesystems (e.g. btrfs root/home, XFS data) is going to affect how
> >>memory balance and throttling is run on other filesystems. i.e. it's
> >>going ot make a filesystem specific issue into a problem that
> >>affects global behaviour.
> >
> >Umm, I don't think it will be different than currently. Currently btrfs
> >dirty metadata is accounted as dirty page cache because they have this
> >virtual fs inode which owns all metadata pages.

Yes, so effectively they are treated the same as file data pages
w.r.t. throttling, writeback and reclaim

> >It is pretty similar to
> >e.g. ext2 where you have bdev inode which effectively owns all metadata
> >pages and these dirty pages account towards the dirty limits. For ext4
> >things are more complicated due to journaling and thus ext4 hides the fact
> >that a metadata page is dirty until the corresponding transaction is
> >committed.  But from that moment on dirty metadata is again just a dirty
> >pagecache page in the bdev inode.

Yeah, though those filesystems don't suffer from the uncontrolled
explosion of metadata that btrfs is suffering from, so simply
treating them as another dirty inode that needs flushing works just
fine.

> >So current Josef's patch just splits the counter in which btrfs metadata
> >pages would be accounted but effectively there should be no change in the
> >behavior.

Yup, I missed the addition to the node_pagecache_reclaimable() that
ensures reclaim sees the same number or dirty pages...

> >It is just a question whether this approach is workable in the
> >future when they'd like to track different objects than just pages in the
> >counter.

I don't think it can. Because the counters directly influences the
page lru reclaim scanning algorithms, it can only be used to
account for pages that are in the LRUs. Other objects like slab
objects need to be accounted for and reclaimed by the shrinker
infrastructure.

Accounting for metadata writeback is a different issue - it could
track slab objects if we wanted to, but the issue is that these are
often difficult to determine the amount of IO needed to clean them
so generic balancing is hard. (e.g. effect of inode write
clustering).

> +1 to what Jan said.  Btrfs's dirty metadata is always going to
> affect any other file systems in the system, no matter how we deal
> with it.  In fact it's worse with our btree_inode approach as the
> dirtied_when thing will likely screw somebody and make us skip
> writing out dirty metadata when we want to.

XFS takes care of metadata flushing with a periodic background work
controlled by /proc/sys/fs/xfs/xfssyncd_centisecs. We t

  1   2   3   4   5   >