a new kind of "No space left on device" error

2018-10-29 Thread Dave
This is one I have not seen before.

When running a simple, well-tested and well-used script that makes
backups using btrfs send | receive, I got these two errors:

At subvol snapshot
ERROR: rename o131621-1091-0 ->
usr/lib/node_modules/node-gyp/gyp/pylib/gyp/MSVSVersion.py failed: No
space left on device

At subvol snapshot
ERROR: rename o259-1095-0 -> myser/.bash_profile failed: No space left on device

I have run this script many, many times and never seen errors like
this. There is plenty of room on the device:

# btrfs fi df /mnt/
Data, single: total=18.01GiB, used=16.53GiB
System, DUP: total=8.00MiB, used=16.00KiB
Metadata, DUP: total=1.00GiB, used=145.12MiB
GlobalReserve, single: total=24.53MiB, used=0.00B

# df -h /mnt/
Filesystem  Size  Used Avail Use% Mounted on
/dev/sdc254G   17G   36G  33% /mnt

The send | receive appears to have mostly succeeded because the final
expected size is about 17G, as shown above. That will use only about
1/3 of the available disk space, when completed. I don't see any
reason for "No space left on device" errors, but maybe somebody here
can spot a problem I am missing.


btrfs-qgroup-rescan using 100% CPU

2018-10-27 Thread Dave
I'm using btrfs and snapper on a system with an SSD. On this system
when I run `snapper -c root ls` (where `root` is the snapper config
for /), the process takes a very long time and top shows the following
process using 100% of the CPU:

kworker/u8:6+btrfs-qgroup-rescan

I have multiple computers (also with SSD's) set up the same way with
snapper and btrfs. On the other computers, `snapper -c root ls`
completes almost instantly, even on systems with many more snapshots.
This system has 20 total snapshots on `/`.

System info:

4.18.16-arch1-1-ARCH (Arch Linux)
btrfs-progs v4.17.1
scrub started at Sat Oct 27 18:37:21 2018 and finished after 00:04:02
total bytes scrubbed: 75.97GiB with 0 errors

Filesystem   Size  Used Avail Use% Mounted on
/dev/mapper/cryptdv  116G   77G   38G  67% /

Data, single: total=72.01GiB, used=71.38GiB
System, DUP: total=32.00MiB, used=16.00KiB
Metadata, DUP: total=3.50GiB, used=2.22GiB

What other info would be helpful? What troubleshooting steps should I try?


Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)

2018-09-20 Thread Dave Chinner
On Wed, Sep 19, 2018 at 12:12:03AM -0400, Zygo Blaxell wrote:
> On Mon, Sep 10, 2018 at 07:06:46PM +1000, Dave Chinner wrote:
> > On Thu, Sep 06, 2018 at 11:53:06PM -0400, Zygo Blaxell wrote:
> > > On Thu, Sep 06, 2018 at 06:38:09PM +1000, Dave Chinner wrote:
> > > > On Fri, Aug 31, 2018 at 01:10:45AM -0400, Zygo Blaxell wrote:
> > > > > On Thu, Aug 30, 2018 at 04:27:43PM +1000, Dave Chinner wrote:
> > > > > > On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote:
> > > > > For future development I've abandoned the entire dedupe_file_range
> > > > > approach.  I need to be able to read and dedupe the data blocks of
> > > > > the filesystem directly without having to deal with details like which
> > > > > files those blocks belong to, especially on filesystems with lots of
> > > > > existing deduped blocks and snapshots. 
> > > > 
> > > > IOWs, your desired OOB dedupe algorithm is:
> > > > 
> > > > a) ask the filesystem where all it's file data is
> > > 
> > > Actually, it's "ask the filesystem where all the *new* file data is"
> > > since we don't want to read any unique data twice on subsequent runs.
> > 
> > Sorry, how do you read "unique data" twice? By definition, unique
> > data only occurs once
> 
> ...but once it has been read, we don't want to read it again.  Ever.
> Even better would be to read unique data less than 1.0 times on average.
> 
> > Oh, and you still need to hash the old data so you can find
> > collisions with the new data that got written. Unless, of course,
> > you are keeping your hash tree in a persistent database 
> 
> I do that.

OK, so you're slowly re-inventing the typical HSM implementation
that keep a userspace database to keep track of what is happening in
the filesystem. They do this to make better decisions when moving
less frequently accessed data out to a slower teirs of storage to
keep space in the primary storage available as it fills up. You're
approaching dedupe in essentially the same way, for very similar
reasons.

> One big difference I am noticing in our approaches is latency.  ZFS (and
> in-kernel btrfs dedupe) provides minimal dedupe latency (duplicate
> data occupies disk space for zero time as it is never written to disk
> at all) but it requires more RAM for a given dedupe hit rate than any
> other dedupe implementation I've seen.  What you've written tells me
> XFS saves RAM by partitioning the data and relying on an existing but
> very large source of iops (sharing scrub reads with dedupe), but then
> the dedupe latency is the same as the scrub interval (the worst so far).

That's just the background maintenance implementation. xfs_fsr can
run this way, but it can also be run as a complete immediate scan,
or it can be pointed at a defined set of files to run on. We can
(and probably will) do exactly the same thing with dedupe.

> bees aims to have latency of a few minutes (ideally scanning data while
> it's still dirty in cache, but there's no good userspace API for that)
> though it's obviously not there yet.

Yup, this is pretty much the "I need immediate notification that a
file changed" problem that HSMs face. That's one of the things DMAPI
was used for - XFS has (now unused) dmapi event notification fields in
it's inode structure that were used by DMAPI to configure the
notifications the filesystem was supposed to send userspace

With no DMAPI in the future, people with custom HSM-like interfaces
based on dmapi are starting to turn to fanotify and friends to
provide them with the change notifications they require

> > e.g. a soft requirement is that we need to scan the entire fs at
> > least once a month. 
> 
> I have to scan and dedupe multiple times per hour.  OK, the first-ever
> scan of a non-empty filesystem is allowed to take much longer, but after
> that, if you have enough spare iops for continuous autodefrag you should
> also have spare iops for continuous dedupe.

Yup, but using notifications avoids the for even these scans - you'd
know exactly what data has changed, when it changed, and know
exactly that you needed to read to calculate the new hashes.

> > A simple piece-wise per-AG scanning algorithm (like we use in
> > xfs_repair) could easily work within a 3GB RAM per AG constraint and
> > would scale very well. We'd only need to scan 30-40 AGs in the hour,
> > and a single AG at 1GB/s will only take 2 minutes to scan. We can
> > then do the processing while the next AG gets scanned. If we've got
> > 10-20GB RAM to use (and who doesn't when they have 1PB of storage?)
> > then we can scan 5-10AGs at once to keep th

Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)

2018-09-10 Thread Dave Chinner
On Thu, Sep 06, 2018 at 11:53:06PM -0400, Zygo Blaxell wrote:
> On Thu, Sep 06, 2018 at 06:38:09PM +1000, Dave Chinner wrote:
> > On Fri, Aug 31, 2018 at 01:10:45AM -0400, Zygo Blaxell wrote:
> > > On Thu, Aug 30, 2018 at 04:27:43PM +1000, Dave Chinner wrote:
> > > > On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote:
> > > For future development I've abandoned the entire dedupe_file_range
> > > approach.  I need to be able to read and dedupe the data blocks of
> > > the filesystem directly without having to deal with details like which
> > > files those blocks belong to, especially on filesystems with lots of
> > > existing deduped blocks and snapshots. 
> > 
> > IOWs, your desired OOB dedupe algorithm is:
> > 
> > a) ask the filesystem where all it's file data is
> 
> Actually, it's "ask the filesystem where all the *new* file data is"
> since we don't want to read any unique data twice on subsequent runs.

Sorry, how do you read "unique data" twice? By definition, unique
data only occurs once

Oh, and you still need to hash the old data so you can find
collisions with the new data that got written. Unless, of course,
you are keeping your hash tree in a persistent database and can work
out how to prune stale entries out of it efficiently

And, well, see my comments about media scrubbing below.

> > b) read that used space to build a data hash index
> > c) on all the data hash collisions find the owners of the
> >colliding blocks
> > d) if the block data is the same dedupe it
> > 
> > I agree - that's a simple and effective algorithm. It's also the
> > obvious solution to an expert in the field.
> 
> It's obvious, but if you literally implement that, you end up with a
> filesystem shattered into billions of fragments.

Well, yes, that's obvious. I thought that "details omitted for
reasons of brevity" would be understood, not require omitted details
to be explained to me.

> At step c) I need to figure out _which_ colliding block I want to keep.
> That's where the algorithm needs to inspect the higher-level file
> structure to determine the physical and logical situation for each block.
> That leads to one of:
> 
>   a) store the hashes and block locations in the hash table
>   b) relocate the blocks (defrag)
>   c) replace some blocks with a reference to other blocks (dedupe)
> 
> This sounds a bit like what xfs_fsr does (or will do?).

xfs_fsr currently does the defrag bit w/ the swapext ioctl(*). It'll
get extended to also do dedupe scans eventually, as defrag and
dedupe obviously need to be integrated to prevent them from
duelling.

(*) swapext - in case you might not know about it - atomically
swaps a range of extents between two files without copying data.  So
you remember those dedupe selection problems you were talking about
(details omitted for brevity):

inode 1:ABCDEFG|H|I|JK|L|M|N|O|P
inode 2:a|b|c|d|e|f|g|hijklmop

The XFS solution will be to first defrag the master file (inode 1)
with the larger extents from inode 2:

swapext(H-P, h-p)

inode 1:ABCDEFG|hijklmop
inode 2:a|b|c|d|e|f|g|H|I|JK|L|M|N|O|P

And the we dedupe to the defragged file:

dedupe(inode 1, inode 2)

inode 1:ABCDEFG|hijklmop
inode 2:ABCDEFG|hijklmop

> Bees also operates under a constant-RAM constraint, so it doesn't operate
> in two distinct "collect data" and "act on data collected" passes,
> and cannot spend memory to store data about more than a few extents at
> any time.

I suspect that I'm thinking at a completely different scale to you.
I don't really care for highly constrained or optimal dedupe
algorithms  because those last few dedupe percentages really don't
matter that much to me. I care much more about using all the
resources we can and running as fast as we possibly can, then
providing the admin with means to throttle performance back to what
they need.

i.e. I'm concerned about how to effectively scan and dedupe PBs of
data, where scan rates may need to be measured in tens of GB/s.  In
these cases the limiting factor is either suboptimal IO patterns
(low IO throughput), or not having enough CPU and memory bandwidth
for hashing the data the IO engine is streaming through memory.
Once we get past a certain point, we'll gain far more by
parallelising the dedupe scan over lots of CPU than we ever will by
optimising an algorithm that cannot be parallelised. Hence we need
to start with a parallelisable algorithm, not an artificially
constrained one.

e.g. a soft requirement is that we need to scan the entire fs at
least once a month. That will need to be broken up into nightly
work, with a full 1PB filesystem needing 33+TB a night to be
scanned. We typically get 1

Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)

2018-09-06 Thread Dave Chinner
On Fri, Aug 31, 2018 at 01:10:45AM -0400, Zygo Blaxell wrote:
> On Thu, Aug 30, 2018 at 04:27:43PM +1000, Dave Chinner wrote:
> > On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote:
> > > On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote:
> > > > On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote:
> > > > >   - is documenting rejection on request alignment grounds
> > > > > (i.e. EINVAL) in the man page sufficient for app
> > > > > developers to understand what is going on here?
> > > > 
> > > > I think so.  The manpage says: "The filesystem does not support
> > > > reflinking the ranges of the given files", which (to my mind) covers
> > > > this case of not supporting dedupe of EOF blocks.
> > > 
> > > Older versions of btrfs dedupe (before v4.2 or so) used to do exactly
> > > this; however, on btrfs, not supporting dedupe of EOF blocks means small
> > > files (one extent) cannot be deduped at all, because the EOF block holds
> > > a reference to the entire dst extent.  If a dedupe app doesn't go all the
> > > way to EOF on btrfs, then it should not attempt to dedupe any part of the
> > > last extent of the file as the benefit would be zero or slightly negative.
> > 
> > That's a filesystem implementation issue, not an API or application
> > issue.
> 
> The API and application issue remains even if btrfs is not considered.
> btrfs is just the worst case outcome.  Other filesystems still have
> fragmentation issues, and applications have efficiency-vs-capability
> tradeoffs to make if they can't rely on dedupe-to-EOF being available.
> 
> Tools like 'cp --reflink=auto' work by trying the best case, then falling
> back to a second choice if the first choice returns an error.

Well, yes. That's necessary for the "cp" tool to behave according to
user expectations.  That's not a kernel API issue - that's just an
implementation of an *application requirement*.  Indeed, this is
identical to the behaviour of rename() in "mv" - if rename fails
with -EXDEV, mv needs to fall back to a manual copy because the user
expects the file to be moved.

IOWS, these application level requirements you talk about are just
not relevant to the kernel API for dedupe/clone operations.

[snip]

> It is something that naive dedupe apps will do.  "naive" here meaning
> "does not dive deeply into the filesystem's physical structure (or survey
> the entire filesystem with FIEMAP) to determine that the thousand-refs
> situation does not exist at dst prior to invoking the dedupe() call."

/me sighs and points at FS_IOC_GETFSMAP

$ man ioctl_getfsmap

DESCRIPTION
   This ioctl(2) operation retrieves physical extent mappings
   for a filesystem.  This information can be used to discover
   which files are mapped to a physical block, examine free
   space, or find known bad blocks, among other things.
.

I don't really care about "enabling" naive, inefficient
applications. I care about applications that scale to huge
filesystems and can get the job done quickly and efficiently.

> > XFS doesn't have partial overlaps, we don't have nodatacow hacks,
> > and the subvol snapshot stuff I'm working on just uses shared data
> > extents so it's 100% compatible with dedupe.
> 
> If you allow this sequence of operations, you get partial overlaps:
> 
>   dedupe(fd1, 0, fd2, 0, 1048576);
> 
>   dedupe(fd2, 16384, fd3, 0, 65536);

Oh, I misunderstood - I thought you were refering to sharing partial
filesystem blocks (like at EOF) because that's what this discussion
was originally about. XFS supports the above just fine.

[snip]

tl;dr we don't need a new clone or dedupe API

> For future development I've abandoned the entire dedupe_file_range
> approach.  I need to be able to read and dedupe the data blocks of
> the filesystem directly without having to deal with details like which
> files those blocks belong to, especially on filesystems with lots of
> existing deduped blocks and snapshots. 

IOWs, your desired OOB dedupe algorithm is:

a) ask the filesystem where all it's file data is
b) read that used space to build a data hash index
c) on all the data hash collisions find the owners of the
   colliding blocks
d) if the block data is the same dedupe it

I agree - that's a simple and effective algorithm. It's also the
obvious solution to an expert in the field.

> The file structure is frankly
> just noise for dedupe on large filesystems.

We learnt that lesson back in the late 1990s. xfsdump, xfs_fsr, all
the SGI^WHPE HSM scanning tools, etc all avoid the directory
structure

Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)

2018-08-30 Thread Dave Chinner
On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote:
> On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote:
> > On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote:
> > >   - is documenting rejection on request alignment grounds
> > > (i.e. EINVAL) in the man page sufficient for app
> > > developers to understand what is going on here?
> > 
> > I think so.  The manpage says: "The filesystem does not support
> > reflinking the ranges of the given files", which (to my mind) covers
> > this case of not supporting dedupe of EOF blocks.
> 
> Older versions of btrfs dedupe (before v4.2 or so) used to do exactly
> this; however, on btrfs, not supporting dedupe of EOF blocks means small
> files (one extent) cannot be deduped at all, because the EOF block holds
> a reference to the entire dst extent.  If a dedupe app doesn't go all the
> way to EOF on btrfs, then it should not attempt to dedupe any part of the
> last extent of the file as the benefit would be zero or slightly negative.

That's a filesystem implementation issue, not an API or application
issue.

> The app developer would need to be aware that such a restriction could
> exist on some filesystems, and be able to distinguish this from other
> cases that could lead to EINVAL.  Portable code would have to try a dedupe
> up to EOF, then if that failed, round down and retry, and if that failed
> too, the app would have to figure out which filesystem it's running on
> to know what to do next.  Performance demands the app know what the FS
> will do in advance, and avoid a whole class of behavior.

Nobody writes "portable" applications like that. They read the man
page first, and work out what the common subset of functionality is
and then code from that. Man page says:

"Disk filesystems generally require the offset and length arguments
to be aligned to the fundamental block size."

IOWs, code compatible with starts with supporting the general case.
i.e. a range rounded to filesystem block boundaries (it's already
run fstat() on the files it wants to dedupe to find their size,
yes?), hence ignoring the partial EOF block. Will just work on
everything.

Code that then wants to optimise for btrfs/xfs/ocfs quirks runs
fstatvfs to determine what fs it's operating on and applies the
necessary quirks. For btrfs it can extend the range to include the
partial EOF block, and hence will handle the implementation quirks
btrfs has with single extent dedupe.

Simple, reliable, and doesn't require any sort of flailing
about with offsets and lengths to avoid unexpected EINVAL errors.

> btrfs dedupe reports success if the src extent is inline and the same
> size as the dst extent (i.e. file is smaller than one page).  No dedupe
> can occur in such cases--a clone results in a simple copy, so the best
> a dedupe could do would be a no-op.  Returning EINVAL there would break
> a few popular tools like "cp --reflink".  Returning OK but doing nothing
> seems to be the best option in that case.

Again, those are a filesystem implementation issues, not problems
with the API itself.

> > >   - should we just round down the EOF dedupe request to the
> > > block before EOF so dedupe still succeeds?
> > 
> > I've often wondered if the interface should (have) be(en) that we start
> > at src_off/dst_off and share as many common blocks as possible until we
> > find a mismatch, then tell userspace where we stopped... instead of like
> > now where we compare the entire extent and fail if any part of it
> > doesn't match.
> 
> The usefulness or harmfulness of that approach depends a lot on what
> the application expects the filesystem to do.
> 
> In btrfs, the dedupe operation acts on references to data, not the
> underlying data blocks.  If there are 1000 slightly overlapping references
> to a single contiguous range of data blocks in dst on disk, each dedupe
> operation acts on only one of those, leaving the other 999 untouched.
> If the app then submits 999 other dedupe requests, no references to the
> dst blocks remain and the underlying data blocks can be deleted.

Assuming your strawman is valid, if you have a thousand separate
references across the same set of data blocks on disk, then that data is
already heavily deduplicated.  Trying to optimise that further
seems misguided, way down the curve of diminishing returns.

> In a parallel universe (or a better filesystem, or a userspace emulation
  ^^^
> built out of dedupe and other ioctls), dedupe could work at the extent
> data (physical) level.  The app points at src and dst extent references
> (inode/offset/length tuples), and the filesystem figures out which
> physical blocks t

Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)

2018-08-20 Thread Dave Chinner
On Mon, Aug 20, 2018 at 08:17:18PM -0500, Eric Sandeen wrote:
> 
> 
> On 8/20/18 7:49 PM, Dave Chinner wrote:
> > Upon successful completion of this ioctl, the number of
> > bytes successfully deduplicated is returned in bytes_deduped
> > and a status code for the deduplication operation is
> > returned in status.  If even a single byte in the range does
> > not match, the deduplication request will be ignored  and
> > status set  to FILE_DEDUPE_RANGE_DIFFERS.
> > 
> > This implies we can dedupe less than the entire range as long as the
> > entire range matches. If the entire range does not match, we have
> > to return FILE_DEDUPE_RANGE_DIFFERS, but in this case it does match
> > so we can pick and choose how much we deduplicate. How much we
> > dedupe is then returned as a byte count. In this case, it will be a
> > few bytes short of the entire length requested because we aligned
> > the dedupe inwards
> > 
> > Does that sound reasonable?
> 
> I had hoped that dedupe was advisory as Darrick wished for, but TBH my
> reading of that is no, if you ask for a range to be deduped and any of
> it differs, "even a single byte," you fail it all.

Yes, if the data differs, then it fails.  But that's not what I'm
questioning nor what I care about in this case. I'm asking whether
the filesystem gets to choose how much of the range of same data is
deduped when the file data is apparently the same.

> Why else would that last part be present, if the interface is free to
> ignore later parts that don't match and truncate the range to the
> matching portion?

I think there is a clear differentiation in the man page text
between "all the bytes are the same" and "how much of that range the
filesystem deduped". i.e. the man page does not say that the
filesystem *must* dedupe the entire range if all the data is the
same - it says the filesystem will return /how many bytes it
successfully deduplicated/. IOWs, "do nothing and return zero" is a
valid ->dedupe_file_range implementation.

AFAIC, it's the fact that we do data comparison from the page cache
before calling into the filesystem to check on-disk formats that is
the problem here. Having identical data in the page cache is not the
same thing as "the blocks on disk containing the user data are
identical". i.e. Filesystems deduplicate disk blocks, but they often
perform transformations on the data as it passes between the page
cache and disk blocks or have constraints on the format of the data
on disk. For example:

- encrypted files. unencrypted data in the page cache is the
  same, therefore we can't return FILE_DEDUPE_RANGE_DIFFERS
  because the user will be able to see that they are the
  same. But they will different on disk after encryption
  (e.g. because different nonce/seed or completely different
  keys) and so the filesystem should not dedupe them.

- the two files differ in on-disk format e.g. compressed
  vs encrypted.

- data might be inline with metadata

- tail packing

- dedupe request might be block aligned at file offsets, but
  file offsets are not aligned to disk blocks due to, say,
  multi-block data compression.

- on disk extent tracking granularity might be larger than
  filesystem block size (e.g. ext4's bigalloc mode) so can't
  be deduped on filesystem block size alignment.

- the source or destination inode is marked "NODATACOW" so
  can't contain shared extents

I'm sure there's others, but I think this is enough to demonstrate
that "user visible file data is the same" does not mean the
filesystem can dedupe it. The wording in the man page appears to
understand these limitations and hence it allows us to silently
ignore the partial EOF block when it comes to dedupe

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)

2018-08-20 Thread Dave Chinner
On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote:
> On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote:

> > So why was this dedupe request even accepted by the kernel? Well,
> > I think there's a bug in the check just above this:
> > 
> > /* If we're linking to EOF, continue to the block boundary. */
> > if (pos_in + *len == isize)
> > blen = ALIGN(isize, bs) - pos_in;
> > else
> > blen = *len;
> > 
> > basically, when the "in" file dedupe/reflink range is to EOF, it
> > expands the range to include /all/ of the block that contains the
> > EOF byte. IOWs, it now requests us to dedupe /undefined data beyond
> > EOF/. But when we go to compare the data in these ranges, it
> > truncates the comparison to the length that the user asked for
> > (i.e. *len) and not the extended block length.
> > 
> > IOWs, it doesn't compare the bytes beyond EOF in the source block to
> > the data in the destination block it would replace, and so doesn't
> > fail the compare like it should.
..
> > i.e. the dedupe request should fail as we cannot dedupe the EOF
> > block.
> > 
> > The patch below does this for the VFS dedupe code. it's not a final
> > patch, it's just a demonstration of how this should never have got
> > past the range checks. Open questions:
> 
> (Here's my five minutes of XFS that I'm allowed per day... :/)
> 
> > - is documenting rejection on request alignment grounds
> >   (i.e. EINVAL) in the man page sufficient for app
> >   developers to understand what is going on here?
> 
> I think so.  The manpage says: "The filesystem does not support
> reflinking the ranges of the given files", which (to my mind) covers
> this case of not supporting dedupe of EOF blocks.

Ok.

> > - should we just round down the EOF dedupe request to the
> >   block before EOF so dedupe still succeeds?
> 
> I've often wondered if the interface should (have) be(en) that we start
> at src_off/dst_off and share as many common blocks as possible until we
> find a mismatch, then tell userspace where we stopped... instead of like
> now where we compare the entire extent and fail if any part of it
> doesn't match.

I think that the ioctl_fideduperange() man page description gives us
the flexibility to do this. That is, this text:

Upon successful completion of this ioctl, the number of
bytes successfully deduplicated is returned in bytes_deduped
and a status code for the deduplication operation is
returned in status.  If even a single byte in the range does
not match, the deduplication request will be ignored  and
status set  to FILE_DEDUPE_RANGE_DIFFERS.

This implies we can dedupe less than the entire range as long as the
entire range matches. If the entire range does not match, we have
to return FILE_DEDUPE_RANGE_DIFFERS, but in this case it does match
so we can pick and choose how much we deduplicate. How much we
dedupe is then returned as a byte count. In this case, it will be a
few bytes short of the entire length requested because we aligned
the dedupe inwards

Does that sound reasonable?

> > - should file cloning (i.e. reflink) be allow allowing the
> >   EOF block to be shared somewhere inside EOF in the
> >   destination file?
> 
> I don't think we should.
> 
> > That's stale data exposure, yes?
> 
> Haven't tested that, but seems likely.

Yeah, I haven't tested it either, but it I can't see how the current
behaviour ends well.

> > - should clone only allow sharing of the EOF block if it's a
> >   either a full file clone or a matching range clone and
> >   isize is the same for both src and dest files?
> 
> The above sound reasonable for clone, though I would also allow clone to
> share the EOF block if we extend isize of the dest file.

Hmmm. That covers the only valid rule for sharing the EOF block,
right? i.e. That the source EOF lands exactly at or beyond the
destination EOF, so it remains the EOF block in both files?

FWIW, I just noticed that the ioctl_ficlonerange man page doesn't
document the return value on success of a clone.

> The above also nearly sound reasonable for dedupe too.  If we encounter
> EOF at the same places in the src range and the dest range, should we
> allow sharing there?  The post-eof bytes are undefined by definition;
> does undefined == undefined?

It's undefined, espescially as different fs implementations may be
doing different things with partial blocks (e.g. tail packing).

>From an XFS perspective, I don't think we should physically share
partial EOF blocks on dedupe because it means ex

[patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)

2018-08-19 Thread Dave Chinner
[cc linux-fsdevel now, too]

On Mon, Aug 20, 2018 at 09:11:26AM +1000, Dave Chinner wrote:
> [cc linux-...@vger.kernel.org]
> 
> On Fri, Aug 17, 2018 at 09:39:24AM +0100, fdman...@kernel.org wrote:
> > From: Filipe Manana 
> > 
> > Test that deduplication of an entire file that has a size that is not
> > aligned to the filesystem's block size into a different file does not
> > corrupt the destination's file data.

Ok, I've looked at this now. My first question is where did all the
magic offsets in this test come from? i.e. how was this bug
found and who is it affecting?

> > This test is motivated by a bug found in Btrfs which is fixed by the
> > following patch for the linux kernel:
> > 
> >   "Btrfs: fix data corruption when deduplicating between different files"
> > 
> > XFS also fails this test, at least as of linux kernel 4.18-rc7, exactly
> > with the same corruption as in Btrfs - some bytes of a block get replaced
> > with zeroes after the deduplication.
> 
> Filipe, in future can please report XFS bugs you find to the XFS
> list the moment you find them. We shouldn't ever find out about a
> data corruption bug we need to fix via a "oh, by the way" comment in
> a commit message for a regression test

This becomes much more relevant because of what I've just found

.

> > +# The first byte with a value of 0xae starts at an offset (2518890) which 
> > is not
> > +# a multiple of the block size.
> > +$XFS_IO_PROG -f \
> > +   -c "pwrite -S 0x6b 0 2518890" \
> > +   -c "pwrite -S 0xae 2518890 102398" \
> > +   $SCRATCH_MNT/foo | _filter_xfs_io
> > +
> > +# Create a second file with a length not aligned to the block size, whose 
> > bytes
> > +# all have the value 0x6b, so that its extent(s) can be deduplicated with 
> > the
> > +# first file.
> > +$XFS_IO_PROG -f -c "pwrite -S 0x6b 0 557771" $SCRATCH_MNT/bar | 
> > _filter_xfs_io
> > +
> > +# The file is filled with bytes having the value 0x6b from offset 0 to 
> > offset
> > +# 2518889 and with the value 0xae from offset 2518890 to offset 2621287.
> > +echo "File content before deduplication:"
> > +od -t x1 $SCRATCH_MNT/foo

Please use "od -Ad -t x1 " so the file offsets reported by od
match the offsets used in the test (i.e. in decimal, not octal).

> > +
> > +# Now deduplicate the entire second file into a range of the first file 
> > that
> > +# also has all bytes with the value 0x6b. The destination range's end 
> > offset
> > +# must not be aligned to the block size and must be less then the offset of
> > +# the first byte with the value 0xae (byte at offset 2518890).
> > +$XFS_IO_PROG -c "dedupe $SCRATCH_MNT/bar 0 1957888 557771" 
> > $SCRATCH_MNT/foo \
> > +   | _filter_xfs_io

Ok, now it gets fun. dedupe to non-block aligned rtanges is supposed
to be rejected by the kernel in vfs_clone_file_prep_inodes(). i.e
this check:

/* Only reflink if we're aligned to block boundaries */
if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_in + blen, bs) ||
!IS_ALIGNED(pos_out, bs) || !IS_ALIGNED(pos_out + blen, bs))
return -EINVAL;

And it's pretty clear that a length of 557771 is not block aligned
(being an odd number).

So why was this dedupe request even accepted by the kernel? Well,
I think there's a bug in the check just above this:

/* If we're linking to EOF, continue to the block boundary. */
if (pos_in + *len == isize)
blen = ALIGN(isize, bs) - pos_in;
else
blen = *len;

basically, when the "in" file dedupe/reflink range is to EOF, it
expands the range to include /all/ of the block that contains the
EOF byte. IOWs, it now requests us to dedupe /undefined data beyond
EOF/. But when we go to compare the data in these ranges, it
truncates the comparison to the length that the user asked for
(i.e. *len) and not the extended block length.

IOWs, it doesn't compare the bytes beyond EOF in the source block to
the data in the destination block it would replace, and so doesn't
fail the compare like it should.

And, well, btrfs has the same bug. extent_same_check_offsets()
extends the range for alignment so it passes alignment checks, but
then /explicitly/ uses the original length for the data compare
and dedupe. i.e:

   /* pass original length for comparison so we stay within i_size */
ret = btrfs_cmp_data(olen, cmp);
if (ret == 0)
ret = btrfs_clone(src, dst, loff, olen, len, dst_loff, 1);

This is what we should see if someone tried to dedupe the EOF block
of a file:

generic/505 - output mismatch (see 
/home/dave/src/xfstests-dev/resul

Re: [PATCH] generic: test for deduplication between different files

2018-08-19 Thread Dave Chinner
[cc linux-...@vger.kernel.org]

On Fri, Aug 17, 2018 at 09:39:24AM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> Test that deduplication of an entire file that has a size that is not
> aligned to the filesystem's block size into a different file does not
> corrupt the destination's file data.
> 
> This test is motivated by a bug found in Btrfs which is fixed by the
> following patch for the linux kernel:
> 
>   "Btrfs: fix data corruption when deduplicating between different files"
> 
> XFS also fails this test, at least as of linux kernel 4.18-rc7, exactly
> with the same corruption as in Btrfs - some bytes of a block get replaced
> with zeroes after the deduplication.

Filipe, in future can please report XFS bugs you find to the XFS
list the moment you find them. We shouldn't ever find out about a
data corruption bug we need to fix via a "oh, by the way" comment in
a commit message for a regression test

Cheers,

Dave.

> Signed-off-by: Filipe Manana 
> ---
>  tests/generic/505 | 84 
> +++
>  tests/generic/505.out | 33 
>  tests/generic/group   |  1 +
>  3 files changed, 118 insertions(+)
>  create mode 100755 tests/generic/505
>  create mode 100644 tests/generic/505.out
> 
> diff --git a/tests/generic/505 b/tests/generic/505
> new file mode 100755
> index ..5ee232a2
> --- /dev/null
> +++ b/tests/generic/505
> @@ -0,0 +1,84 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (C) 2018 SUSE Linux Products GmbH. All Rights Reserved.
> +#
> +# FS QA Test No. 505
> +#
> +# Test that deduplication of an entire file that has a size that is not 
> aligned
> +# to the filesystem's block size into a different file does not corrupt the
> +# destination's file data.
> +#
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +tmp=/tmp/$$
> +status=1 # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> + cd /
> + rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +. ./common/reflink
> +
> +# real QA test starts here
> +_supported_fs generic
> +_supported_os Linux
> +_require_scratch_dedupe
> +
> +rm -f $seqres.full
> +
> +_scratch_mkfs >>$seqres.full 2>&1
> +_scratch_mount
> +
> +# The first byte with a value of 0xae starts at an offset (2518890) which is 
> not
> +# a multiple of the block size.
> +$XFS_IO_PROG -f \
> + -c "pwrite -S 0x6b 0 2518890" \
> + -c "pwrite -S 0xae 2518890 102398" \
> + $SCRATCH_MNT/foo | _filter_xfs_io
> +
> +# Create a second file with a length not aligned to the block size, whose 
> bytes
> +# all have the value 0x6b, so that its extent(s) can be deduplicated with the
> +# first file.
> +$XFS_IO_PROG -f -c "pwrite -S 0x6b 0 557771" $SCRATCH_MNT/bar | 
> _filter_xfs_io
> +
> +# The file is filled with bytes having the value 0x6b from offset 0 to offset
> +# 2518889 and with the value 0xae from offset 2518890 to offset 2621287.
> +echo "File content before deduplication:"
> +od -t x1 $SCRATCH_MNT/foo
> +
> +# Now deduplicate the entire second file into a range of the first file that
> +# also has all bytes with the value 0x6b. The destination range's end offset
> +# must not be aligned to the block size and must be less then the offset of
> +# the first byte with the value 0xae (byte at offset 2518890).
> +$XFS_IO_PROG -c "dedupe $SCRATCH_MNT/bar 0 1957888 557771" $SCRATCH_MNT/foo \
> + | _filter_xfs_io
> +
> +# The bytes in the range starting at offset 2515659 (end of the deduplication
> +# range) and ending at offset 2519040 (start offset rounded up to the block
> +# size) must all have the value 0xae (and not replaced with 0x00 values).
> +# In other words, we should have exactly the same data we had before we asked
> +# for deduplication.
> +echo "File content after deduplication and before unmounting:"
> +od -t x1 $SCRATCH_MNT/foo
> +
> +# Unmount the filesystem and mount it again. This guarantees any file data in
> +# the page cache is dropped.
> +_scratch_cycle_mount
> +
> +# The bytes in the range starting at offset 2515659 (end of the deduplication
> +# range) and ending at offset 2519040 (start offset rounded up to the block
> +# size) must all have the value 0xae (and not replaced with 0x00 values).
> +# In other words, we should have exactly the same data we had before we asked
> +# for deduplication.
> +echo "File content after unmounting:"

Re: [PATCH V4] test online label ioctl

2018-05-17 Thread Dave Chinner
On Thu, May 17, 2018 at 10:28:26AM -0500, Eric Sandeen wrote:
> This tests the online label ioctl that btrfs has, which has been
> recently proposed for XFS.
> 
> To run, it requires an updated xfs_io with the label command and a
> filesystem that supports it
> 
> A slight change here to _require_xfs_io_command as well, so that tests
> which simply fail with "Inappropriate ioctl" can be caught in the
> common case.
> 
> Signed-off-by: Eric Sandeen <sand...@redhat.com>
> ---
> 
> Now with new and improved sequential V4 versioning!

Looks good now.

Reviewed-by: Dave Chinner <dchin...@redhat.com>
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3] test online label ioctl

2018-05-15 Thread Dave Chinner
On Tue, May 15, 2018 at 10:22:37AM -0500, Eric Sandeen wrote:
> This tests the online label ioctl that btrfs has, which has been
> recently proposed for XFS.
> 
> To run, it requires an updated xfs_io with the label command and a
> filesystem that supports it
> 
> A slight change here to _require_xfs_io_command as well, so that tests
> which simply fail with "Inappropriate ioctl" can be caught in the
> common case.
> 
> Signed-off-by: Eric Sandeen <sand...@redhat.com>
> ---
> 
> (urgh send as proper new thread, sorry)
> 
> This passes on btrfs, _notruns on xfs/ext4 of yore, and passes
> on xfs w/ my online label patchset (as long as xfs_io has the new
> capability)
> 
> V2: Add a max label length helper
> Set the proper btrfs max label length o_O oops
> Filter trailing whitespace from blkid output
> 
> V3: lowercase local vars, simplify max label len function

Looks good now, but I wondered about one thing the test doesn't
cover: can you clear the label by setting it to a null string?
i.e you check max length bounds, but don't check empty string
behaviour...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2] test online label ioctl

2018-05-14 Thread Dave Chinner
On Mon, May 14, 2018 at 06:26:07PM -0500, Eric Sandeen wrote:
> On 5/14/18 6:11 PM, Dave Chinner wrote:
> > On Mon, May 14, 2018 at 12:09:16PM -0500, Eric Sandeen wrote:
> >> This tests the online label ioctl that btrfs has, which has been
> >> recently proposed for XFS.
> >>
> >> To run, it requires an updated xfs_io with the label command and a
> >> filesystem that supports it
> >>
> >> A slight change here to _require_xfs_io_command as well, so that tests
> >> which simply fail with "Inappropriate ioctl" can be caught in the
> >> common case.
> >>
> >> Signed-off-by: Eric Sandeen <sand...@redhat.com>
> >> ---
> 
> 
>  
> >> +# The maximum filesystem label length, not including terminating NULL
> >> +_label_get_max()
> >> +{
> >> +  case $FSTYP in
> >> +  xfs)
> >> +  MAXLEN=12
> >> +  ;;
> >> +  btrfs)
> >> +  MAXLEN=255
> >> +  ;;
> >> +  *)
> >> +  MAXLEN=0
> > 
> > Why not just _notrun here?
> 
> do we want to go through the other steps only to get here and notrun
> halfway through?
> 
> >> +  ;;
> >> +  esac
> >> +
> >> +  echo $MAXLEN
> >> +}
> >> +
> >> +_require_label_get_max()
> >> +{
> >> +  if [ $(_label_get_max) -eq 0 ]; then
> >> +  _notrun "$FSTYP does not define maximum label length"
> >> +  fi
> > 
> > And this check can go away?
> 
> We'd like to know if all the validations in this can complete before we
> get started, right?  That's why I did it this way.

Sure, just trying to be a bit defensive as people often forget
_requires directives when writing new tests

> > Also, shouldn't it be _require_online_label_change() ? And then
> > maybe you can move the xfs_io label command check inside it?
> 
> Well, we want to know a lot of things:
> 
> 1) can the kernel code for this filesystem support online label
> 2) does xfs_io support the label command
> 3) does this test know the maximum label length to test for this fs
> 
> the above stuff is for 3)

What I was suggesting was doing all of these in one function similar
to _require_xfs_sparse_inodes, _require_meta_uuid, etc:

_require_online_label_change()
{
# need xfs_io support
_require_xfs_io_command "label"

# need fstests knowledge of label size
[ $(_label_get_max) -eq 0 ] && _notrun "$FSTYP does not define maximum 
label length"

# need kernel support
$XFS_IO_PROG -c label $TEST_DIR > /dev/null 2>&1
[ $? -ne 0 ] && _notrun "Kernel does not support FS_IOC_GETLABEL"
}

Which also means that the label_f command in xfs_io needs to set the
exitcode to non-zero when the ioctl fails so that xfs_io returns
non-zero on failure...

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2] test online label ioctl

2018-05-14 Thread Dave Chinner
On Mon, May 14, 2018 at 12:09:16PM -0500, Eric Sandeen wrote:
> This tests the online label ioctl that btrfs has, which has been
> recently proposed for XFS.
> 
> To run, it requires an updated xfs_io with the label command and a
> filesystem that supports it
> 
> A slight change here to _require_xfs_io_command as well, so that tests
> which simply fail with "Inappropriate ioctl" can be caught in the
> common case.
> 
> Signed-off-by: Eric Sandeen <sand...@redhat.com>
> ---
> 
> (urgh send as proper new thread, sorry)
> 
> This passes on btrfs, _notruns on xfs/ext4 of yore, and passes
> on xfs w/ my online label patchset (as long as xfs_io has the new
> capability)
> 
> V2: Add a max label length helper
> Set the proper btrfs max label length o_O oops
> Filter trailing whitespace from blkid output
> 
> 
> diff --git a/common/rc b/common/rc
> index 9ffab7fd..88a99cff 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -2158,6 +2158,9 @@ _require_xfs_io_command()
>   echo $testio | grep -q "Inappropriate ioctl" && \
>   _notrun "xfs_io $command support is missing"
>   ;;
> + "label")
> + testio=`$XFS_IO_PROG -c "label" $TEST_DIR 2>&1`
> + ;;
>   "open")
>   # -c "open $f" is broken in xfs_io <= 4.8. Along with the fix,
>   # a new -C flag was introduced to execute one shot commands.
> @@ -2196,7 +2199,7 @@ _require_xfs_io_command()
>   rm -f $testfile 2>&1 > /dev/null
>   echo $testio | grep -q "not found" && \
>   _notrun "xfs_io $command support is missing"
> - echo $testio | grep -q "Operation not supported" && \
> + echo $testio | grep -q "Operation not supported\|Inappropriate ioctl" 
> && \
>   _notrun "xfs_io $command failed (old kernel/wrong fs?)"
>   echo $testio | grep -q "Invalid" && \
>   _notrun "xfs_io $command failed (old kernel/wrong fs/bad args?)"
> @@ -3802,6 +3805,31 @@ _require_scratch_feature()
>   esac
>  }
>  
> +# The maximum filesystem label length, not including terminating NULL
> +_label_get_max()
> +{
> + case $FSTYP in
> + xfs)
> + MAXLEN=12
> + ;;
> + btrfs)
> + MAXLEN=255
> + ;;
> + *)
> + MAXLEN=0

Why not just _notrun here?

> + ;;
> + esac
> +
> + echo $MAXLEN
> +}
> +
> +_require_label_get_max()
> +{
> + if [ $(_label_get_max) -eq 0 ]; then
> + _notrun "$FSTYP does not define maximum label length"
> + fi

And this check can go away?

Also, shouldn't it be _require_online_label_change() ? And then
maybe you can move the xfs_io label command check inside it?

> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +
> +_supported_fs generic
> +_supported_os Linux
> +_require_scratch
> +_require_xfs_io_command "label"
> +_require_label_get_max
> +
> +_scratch_mkfs > $seqres.full 2>&1
> +_scratch_mount
> +
> +# Make sure we can set & clear the label
> +$XFS_IO_PROG -c "label label.$seq" $SCRATCH_MNT
> +$XFS_IO_PROG -c "label" $SCRATCH_MNT
> +
> +# And that userspace can see it now, while mounted
> +# NB: some blkid has trailing whitespace, filter it out here
> +blkid -s LABEL $SCRATCH_DEV | _filter_scratch | sed -e "s/ $//g"
> +
> +# And that the it is still there when it's unmounted
> +_scratch_unmount
> +blkid -s LABEL $SCRATCH_DEV | _filter_scratch | sed -e "s/ $//g"

Ok, so "LABEL" here is a special blkid match token....

> +# And that it persists after a remount
> +_scratch_mount
> +$XFS_IO_PROG -c "label" $SCRATCH_MNT
> +
> +# And that a too-long label is rejected, beyond the interface max:
> +LABEL=$(perl -e "print 'l' x 257;")

And now you use it as a variable. Nasty and confusing. Using lower
case for local variables is the norm, right? I thought we were only
supposed to use upper case for global test harness variables...

But even making it "label" is problematic:

> +$XFS_IO_PROG -c "label $LABEL" $SCRATCH_MNT

because "label" is an xfs_io command. Perhaps call it "fs_label"?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root

2018-05-09 Thread Dave Chinner
On Tue, May 08, 2018 at 10:06:44PM -0400, Jeff Mahoney wrote:
> On 5/8/18 7:38 PM, Dave Chinner wrote:
> > On Tue, May 08, 2018 at 11:03:20AM -0700, Mark Fasheh wrote:
> >> Hi,
> >>
> >> The VFS's super_block covers a variety of filesystem functionality. In
> >> particular we have a single structure representing both I/O and
> >> namespace domains.
> >>
> >> There are requirements to de-couple this functionality. For example,
> >> filesystems with more than one root (such as btrfs subvolumes) can
> >> have multiple inode namespaces. This starts to confuse userspace when
> >> it notices multiple inodes with the same inode/device tuple on a
> >> filesystem.
> > 
> > Devil's Advocate - I'm not looking at the code, I'm commenting on
> > architectural issues I see here.
> > 
> > The XFS subvolume work I've been doing explicitly uses a superblock
> > per subvolume. That's because subvolumes are designed to be
> > completely independent of the backing storage - they know nothing
> > about the underlying storage except to share a BDI for writeback
> > purposes and write to whatever block device the remapping layer
> > gives them at IO time.  Hence XFS subvolumes have (at this point)
> > their own unique s_dev, on-disk format configuration, journal, space
> > accounting, etc. i.e. They are fully independent filesystems in
> > their own right, and as such we do not have multiple inode
> > namespaces per superblock.
> 
> That's a fundamental difference between how your XFS subvolumes work and
> how btrfs subvolumes do.

Yup, you've just proved my point: this is not a "subvolume problem";
but rather a "multiple namespace per root" problem.

> There is no independence among btrfs
> subvolumes.  When a snapshot is created, it has a few new blocks but
> otherwise shares the metadata of the source subvolume.  The metadata
> trees are shared across all of the subvolumes and there are several
> internal trees used to manage all of it.

I don't need btrfs 101 stuff explained to me. :/

> a single transaction engine.  There are housekeeping and maintenance
> tasks that operate across the entire file system internally.  I
> understand that there are several problems you need to solve at the VFS
> layer to get your version of subvolumes up and running, but trying to
> shoehorn one into the other is bound to fail.

Actually, the VFS has provided everything I need for XFS subvolumes
so far without requiring any sort of modifications. That's the
perspective I'm approaching this from - if the VFS can do what we
need for XFS subvolumes, as well as overlay (which are effectively
VFS-based COW subvolumes), then lets see if we can make that work
for btrfs too.

> > So this doesn't sound like a "subvolume problem" - it's a "how do we
> > sanely support multiple independent namespaces per superblock"
> > problem. AFAICT, this same problem exists with bind mounts and mount
> > namespaces - they are effectively multiple roots on a single
> > superblock, but it's done at the vfsmount level and so the
> > superblock knows nothing about them.
> 
> In this case, you're talking about the user-visible file system
> hierarchy namespace that has no bearing on the underlying file system
> outside of per-mount flags.

Except that it tracks and provides infrastructure that allows user
visible  "multiple namespace per root" constructs. Subvolumes - as a
user visible namespace construct - are little different to bind
mounts in behaviour and functionality. 

How the underlying filesystem implements subvolumes is really up to
the filesystem, but we should be trying to implement a clean model
for "multiple namespaces on a single root" at the VFS so we have
consistent behaviour across all filesystems that implement similar
functionality.

FWIW, bind mounts and overlay also have similar inode number
namespace problems to what Mark describes for btrfs subvolumes.
e.g. overlay recently introduce the "xino" mount option to separate
the user presented inode number namespace for overlay inode from the
underlying parent filesystem inodes. How is that different to btrfs
subvolumes needing to present different inode number namespaces from
the underlying parent?

This sort of "namespace shifting" is needed for several different
pieces of information the kernel reports to userspace. The VFS
replacement for shiftfs is an example of this. So is inode number
remapping. I'm sure there's more.

My point is that if we are talking about infrastructure to remap
what userspace sees from different mountpoint views into a
filesystem, then it should be done above the filesystem layers in
the VFS so all filesystems behave th

Re: [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root

2018-05-08 Thread Dave Chinner
On Tue, May 08, 2018 at 11:03:20AM -0700, Mark Fasheh wrote:
> Hi,
> 
> The VFS's super_block covers a variety of filesystem functionality. In
> particular we have a single structure representing both I/O and
> namespace domains.
> 
> There are requirements to de-couple this functionality. For example,
> filesystems with more than one root (such as btrfs subvolumes) can
> have multiple inode namespaces. This starts to confuse userspace when
> it notices multiple inodes with the same inode/device tuple on a
> filesystem.

Devil's Advocate - I'm not looking at the code, I'm commenting on
architectural issues I see here.

The XFS subvolume work I've been doing explicitly uses a superblock
per subvolume. That's because subvolumes are designed to be
completely independent of the backing storage - they know nothing
about the underlying storage except to share a BDI for writeback
purposes and write to whatever block device the remapping layer
gives them at IO time.  Hence XFS subvolumes have (at this point)
their own unique s_dev, on-disk format configuration, journal, space
accounting, etc. i.e. They are fully independent filesystems in
their own right, and as such we do not have multiple inode
namespaces per superblock.

So this doesn't sound like a "subvolume problem" - it's a "how do we
sanely support multiple independent namespaces per superblock"
problem. AFAICT, this same problem exists with bind mounts and mount
namespaces - they are effectively multiple roots on a single
superblock, but it's done at the vfsmount level and so the
superblock knows nothing about them.

So this kinda feel like there's still a impedence mismatch between
btrfs subvolumes being mounted as subtrees on the underlying root
vfsmount rather than being created as truly independent vfs
namespaces that share a superblock. To put that as a question: why
aren't btrfs subvolumes vfsmounts in their own right, and the unique
information subvolume information get stored in (or obtained from)
the vfsmount?

> In addition, it's currently impossible for a filesystem subvolume to
> have a different security context from it's parent. If we could allow
> for subvolumes to optionally specify their own security context, we
> could use them as containers directly instead of having to go through
> an overlay.

Again, XFS subvolumes don't have this problem. So really we need to
frame this discussion in terms of supporting multiple namespaces
within a superblock sanely, not subvolumes.

> I ran into this particular problem with respect to Btrfs some years
> ago and sent out a very naive set of patches which were (rightfully)
> not incorporated:
> 
> https://marc.info/?l=linux-btrfs=130074451403261=2
> https://marc.info/?l=linux-btrfs=130532890824992=2
> 
> During the discussion, one question did come up - why can't
> filesystems like Btrfs use a superblock per subvolume? There's a
> couple of problems with that:
> 
> - It's common for a single Btrfs filesystem to have thousands of
>   subvolumes. So keeping a superblock for each subvol in memory would
>   get prohibively expensive - imagine having 8000 copies of struct
>   super_block for a file system just because we wanted some separation
>   of say, s_dev.

That's no different to using individual overlay mounts for the
thousands of containers that are on the system. This doesn't seem to
be a major problem...

> - Writeback would also have to walk all of these superblocks -
>   again not very good for system performance.

Background writeback is backing device focussed, not superblock
focussed. It will only iterate the superblocks that have dirty
inodes on the bdi writeback lists, not all the superblocks on the
bdi. IOWs, this isn't a major problem except for sync() operations
that iterate superblocks.

> - Anyone wanting to lock down I/O on a filesystem would have to
> freeze all the superblocks. This goes for most things related to
> I/O really - we simply can't afford to have the kernel walking
> thousands of superblocks to sync a single fs.

Not with XFS subvolumes. Freezing the underlying parent filesystem
will effectively stop all IO from the mounted subvolumes by freezing
remapping calls before IO. Sure, those subvolumes aren't in a
consistent state, but we don't freeze userspace so none of the
application data is ever in a consistent state when filesystems are
frozen.

So, again, I'm not sure there's /subvolume/ problem here. There's
definitely a "freeze heirarchy" problem, but that already exists and
it's something we talked about at LSFMM because we need to solve it
for reliable hibernation.

> It's far more efficient then to pull those fields we need for a
> subvolume namespace into their own structure.

I'm not convinced yet - it still feels like it's the wrong layer to
be solving the multiple namespace per superblock problem


Re: Symlink not persisted even after fsync

2018-04-16 Thread Dave Chinner
On Sun, Apr 15, 2018 at 07:10:52PM -0500, Vijay Chidambaram wrote:
> Thanks! As I mentioned before, this is useful. I have a follow-up
> question. Consider the following workload:
> 
>  creat foo
>  link (foo, A/bar)
>  fsync(foo)
>  crash
> 
> In this case, after the file system recovers, do we expect foo's link
> count to be 2 or 1? 

So, strictly ordered behaviour:

create foo:
- creates dirent in inode B and new inode A in an atomic
  transaction sequence #1

link foo -> A/bar
- creates dirent in inode C and bumps inode A link count in
  an atomic transaction seqeunce #2.

fsync foo
- looks at inode A, sees it's "last modification" sequence
  counter as #2
- flushes all transactions up to and including #2 to the
  journal.

See the dependency chain? Both the inodes and dirents in the create
operation and the link operation are chained to the inode foo via
the atomic transactions. Hence when we flush foo, we also flush the
dependent changes because of the change atomicity requirements

> I would say 2,

Correct, for strict ordering. But

> but POSIX is silent on this,

Well, it's not silent, POSIX explicitly allows for fsync() to do
nothing and report success. Hence we can't really look to POSIX to
define how fsync() should behave.

> so
> thought I would confirm. The tricky part here is we are not calling
> fsync() on directory A.

Right. But directory A has a dependent change linked to foo. If we
fsync() foo, we are persisting the link count change in that file,
and hence all the other changes related to that link count change
must also be flushed. Similarly, all the cahnges related to the
creation on foo must be flushed, too.

> In this case, its not a symlink; its a hard link, so I would say the
> link count for foo should be 2.

Right - that's the "reference counted object dependency" I refered
to. i.e. it's a bi-direction atomic dependency - either we show both
the new dirent and the link count change, or we show neither of
them.  Hence fsync on one object implies that we are also persisting
the related changes in the other object, too.

> But btrfs and F2FS show link count of
> 1 after a crash.

That may be valid if the dirent A/bar does not exist after recovery,
but it also means fsync() hasn't actually guaranteed inode changes
made prior to the fsync to be persistent on disk. i.e. that's a
violation of ordered metadata semantics and probably a bug.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Symlink not persisted even after fsync

2018-04-14 Thread Dave Chinner
On Fri, Apr 13, 2018 at 10:27:56PM -0500, Vijay Chidambaram wrote:
> Hi Dave,
> 
> Thanks for the reply.
> 
> I feel like we are not talking about the same thing here.
> 
> What we are asking is: if you perform
> 
> fsync(symlink)
> crash
> 
> can we expect it to see the symlink file in the parent directory after
> a crash given we didn't fsync the parent directory? Amir argues we
> can't expect it. Your first email seemed to argue we should expect it.

My first email comments on Amir's quoting of behaviours for files vs
directories on fsync, and then applying those caveats to symlinks.
It probably wasn't that clear I was mainly trying to point out that
symlinks are not files, so they have different ordering
requirements. i.e. that you have to look at ordering requirements of
the filesystems, not the fsync() specification to determine what the
fsync behviour is supposed to be.

My second email clarifies the ordering behaviour that is expected
with symlinks and the reason why you'll see different behaviour to
files w.r.t. fsync and parent directories.

> ext4 and xfs have this behavior, which Amir argues is an
> implementation side-effect, and not intended.
> 
> >> >>> 1. symlink (foo, bar.tmp)
> >> >>> 2. open bar.tmp
> >> >>> 3. fsync bar.tmp
> >> >>> 4. rename(bar.tmp, bar)
> >> >>> 5. fsync bar
> >> >>> crash here
> 
> The second workload that Amir constructed just moves the symlink
> creation into a different transaction. In both workloads, we are
> creating or renaming new symlinks and calling fsync on them. In both
> cases we are not explicitly calling fsync on the parent directory.

Yes, I decided not to write all this "symlink behaviour is dependent
on initial conditions" stuff because, AFAIC, it is a pretty obvious
conclusion to draw from the ordering dependencies I described
between the symlink and the object it points at.

Script that demonstrates this is simple:

$ cat t.sh
#!/bin/bash

dev=/dev/vdb
mnt=/mnt/scratch
test_file=$mnt/foo

# 1. symlink (foo, bar.tmp)
# 2. open bar.tmp
# 3. fsync bar.tmp
# 4. rename(bar.tmp, bar)
# 5. fsync bar

umount $mnt
mount $dev $mnt

cd $mnt
rm -f foo bar.tmp bar
sync

# Don't fsync creation of foo, will see foo and bar.tmp after shutdown
touch foo
ln -s foo bar.tmp
xfs_io -c fsync bar.tmp
mv bar.tmp bar
xfs_io -c fsync bar
xfs_io -xc "shutdown" $mnt

cd ~
umount $mnt
mount $dev $mnt
cd $mnt
ls -l $mnt
rm -f foo bar.tmp bar
sync

# don't fsync foo or bar.tmp, will see foo and bar after shutdown
touch foo
xfs_io -c fsync foo

touch foo
ln -s foo bar.tmp
mv bar.tmp bar
xfs_io -c fsync bar
xfs_io -xc "shutdown" $mnt


cd ~
umount $mnt
mount $dev $mnt
cd $mnt
ls -l $mnt
rm -f foo bar.tmp bar
sync

# fsync creation of foo, will see only foo after shutdown
touch foo
xfs_io -c fsync foo

ln -s foo bar.tmp
xfs_io -c fsync bar.tmp
mv bar.tmp bar
xfs_io -c fsync bar
xfs_io -xc "shutdown" $mnt

cd ~
umount $mnt
mount $dev $mnt
cd $mnt
ls -l $mnt
$

And the output is:

$ sudo umount /mnt/scratch ; sudo mount /dev/vdb /mnt/scratch ; sudo ./t.sh ;
total 0
lrwxrwxrwx. 1 root root 3 Apr 14 09:52 bar.tmp -> foo
-rw-r--r--. 1 root root 0 Apr 14 09:52 foo
total 0
lrwxrwxrwx. 1 root root 3 Apr 14 09:52 bar -> foo
-rw-r--r--. 1 root root 0 Apr 14 09:52 foo
total 0
-rw-r--r--. 1 root root 0 Apr 14 09:52 foo
$

i.e. it depends on the state of the original file as to what is
captured by the fsync of that file through the symlink. i.e.
symlinks has no ordering dependency with the object resolved from
the path in the symlink.


> Note that we are not saying if we call fsync on symlink file, it
> should call fsync on the original file. We agree that should not be
> done as the symlink file and the original link are two distinct
> entities.

"symlink file" - there's no such thing. It's either a symlink or a
regular file and it cant be both. 

And, well, you can't fsync a symlink *inode*, anyway, because you
can't open it directly for IO operations.

> I believe in most journaling/copy-on-write file systems today, if you
> call fsync on a new file, the fsync will persist the directory entry
> of the new file in the parent directory (even though POSIX doesn't
> really require this).

Yes, that's the strict ordering dependency thing I talked about, and
it was something that btrfs got wrong for an awful long time.

> It seems reasonable to extend this persistence
> courtesy to symlinks (considering them just as normal files).

And no, that's not reasonable, because symlinks only contain a path
instead of a direct reference to any filesysetm object. i.e. it's an
indirect reference, and that can be clearly seen by the fact that
Symlinks are created and removed without referencing the object they
point to or caring wheth

Re: Symlink not persisted even after fsync

2018-04-13 Thread Dave Chinner
On Fri, Apr 13, 2018 at 09:39:27AM -0500, Jayashree Mohan wrote:
> Hey Dave,
> 
> Thanks for clarifying the crash recovery semantics of strictly
> metadata ordered filesystems. We had a follow-up question in this
> case.
> 
> On Fri, Apr 13, 2018 at 8:16 AM, Amir Goldstein <amir7...@gmail.com> wrote:
> > On Fri, Apr 13, 2018 at 3:54 PM, Vijay Chidambaram <vi...@cs.utexas.edu> 
> > wrote:
> >> Hi Amir,
> >>
> >> Thanks for the reply!
> >>
> >> On Fri, Apr 13, 2018 at 12:52 AM, Amir Goldstein <amir7...@gmail.com> 
> >> wrote:
> >>>
> >>> Not a bug.
> >>>
> >>> From man 2 fsync:
> >>>
> >>> "Calling  fsync() does not necessarily ensure that the entry in the
> >>>  directory containing the file has also reached disk.  For that an
> >>>  explicit fsync() on a file descriptor for the directory is also needed."
> >>
> >>
> >> Are we understanding this right:
> >>
> >> ext4 and xfs fsync the parent directory if a sym link file is fsync-ed. But
> >> btrfs does not. Is this what we are seeing?
> >
> > Nope.
> >
> > You are seeing an unintentional fsync of parent, because both
> > parent update and symlink update are metadata updates that are
> > tracked by the same transaction.
> >
> > fsync of symlink forces the current transaction to the journal,
> > pulling in the parent update with it.
> >
> >
> >>
> >> I agree that fsync of a file does not mean fsync of its directory entry, 
> >> but
> >> it seems odd to do it for regular files and not for sym links. We do not 
> >> see
> >> this behavior if we use a regular file instead of a sym link file.
> >>
> >
> > fsync of regular file behaves differently than fsync of non regular file.
> > I suggest this read:
> > https://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/
> >
> >>>
> >>> There is a reason why this behavior is not being reproduces in
> >>> ext4/xfs, but you should be able to reproduce a similar issue
> >>> like this:
> >>>
> >>>
> >>> 1. symlink (foo, bar.tmp)
> >>> 2. open bar.tmp
> >>> 3. fsync bar.tmp
> >>> 4. rename(bar.tmp, bar)
> >>> 5. fsync bar
> >>> crash here
> >>
> 
> Going by your argument that all previous transactions that referenced
> the file being fsync-ed needs to be committed, should we expect xfs
> (and ext4) to persist file bar in this case?

No, that's not what I'm implying. I'm implying that there is
specific ordering dependencies that govern this behaviour, and
assuming that what the fsync man page says about files applies to
symlinks is not a valid assumption because files and symlinks are
not equivalent objects.

In these cases, you first have to ask "what are we actually running
fsync on?"

The fsync is being performed on the inode the symlink points to, not
the symlink. You can't directly open a symlink to fsync the symlink.

Then you have to ask "what is the dependency chain between the
parent directory, the symlink and the file it points to?"

the short answer is that symlinks have no direct relationship to the
object they point to. i.e. symlinks contain a path, not a reference
to a specific filesystem object.

IOWs, symlinks are really a directory construct, not a file.
However, there is no ordering dependency between a symlink and what
it points to. symlinks contain a path which needs to be resolved to
find out what it points to, and that may not even exist. Files have
no reference to symlinks that point at them, so there's no way we
can create an ordering dependency between file updates and any
symlink that points to them.

Directories, OTOH, contain a pointer to a reference counted object
(an inode) in their dirents. hence if you add/remove directory
dirents that point to an inode, you also have to modify the inode
link counts as it records how many directory entries point at it.
That's a bi-directional atomic modification ordering dependency
between directories and inodes they point at.

So when we look at symlinks, the parent directory has a ordering
dependency with the symlink inode, not whatever is found by
resolving the path in the symlink data. IOWs, there is no ordering
relationship between the symlink's parent directory and whatever the
symlink points at. i.e. it's a one-way relationship, and so there is
no reverse ordering dependency that requires fsync() on the file to
force synchronisation of a symlink it knows nothing about.

i.e. the ordering dependency that exists with symlinks is between
the symlink and it's parent directory, not whatever the symlink
points to. Hence fsyncing whatever the symlink points to does not
guarantee that the symlink is made stable because the symlink is not
part of the dependency chain of the object being fsync()d

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Symlink not persisted even after fsync

2018-04-13 Thread Dave Chinner
On Fri, Apr 13, 2018 at 08:52:19AM +0300, Amir Goldstein wrote:
> On Thu, Apr 12, 2018 at 8:51 PM, Jayashree Mohan
> <jayashree2...@gmail.com> wrote:
> > Hi,
> >
> > We came across what seems to be a crash consistency bug on btrfs and
> > f2fs. When we create a symlink for a file and fsync the symlink, on
> > recovery from crash, the fsync-ed file is missing.
> >
> > You can reproduce this behaviour using this minimal workload :
> >
> > 1. symlink (foo, bar)
> > 2. open bar
> > 3. fsync bar
> > crash here
> >
> > When we recover, we find that file bar is missing. This behaviour
> > seems unexpected as the fsynced file is lost on a crash. ext4 and xfs
> > correctly recovers file bar. This seems like a bug. If not, could you
> > explain why?
> >
> 
> Not a bug.

Actually, for a filesystem with strictly ordered metadata recovery
semantics, it is a bug.

> From man 2 fsync:
> 
> "Calling  fsync() does not necessarily ensure that the entry in the
>  directory containing the file has also reached disk.  For that an
>  explicit fsync() on a file descriptor for the directory is also needed."

We've been through this before, many times. This caveat does not
apply to strictly ordered metadata filesystems. If you fsync a file
on an ordered metadata filesystem, then all previous transactions
that are needed to reference the file are also committed.

The behaviour from ext4 and XFS is correct for strictly ordered
filesystems.  This is not a "fsync requirement", nor is it a general
linux filesystem requirement. It is a requirement of the desired
filesystem crash recovery mechanisms

BTRFS is advertised as having strictly ordered metadata
recovery semantics, so it should behave the same way as ext4 and
XFS in tests like these. If it doesn't, then there's filesystem bugs
that need fixing...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] common/rc: raise mixed mode threshold to 1GB

2018-04-11 Thread Dave Chinner
On Wed, Apr 11, 2018 at 10:07:29PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osan...@fb.com>
> 
> generic/427 creates a 256 MB filesystem and then writes a 200 MB file,
> which fails on Btrfs if mixed mode is not enabled. Raise the threshold
> to 1GB, which is where we typically recommend mixed mode.
> 
> Signed-off-by: Omar Sandoval <osan...@fb.com>
> ---
>  common/rc | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/common/rc b/common/rc
> index 5dbb8fe5..dc0062de 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -999,7 +999,7 @@ _scratch_mkfs_sized()
>   ;;
>  btrfs)
>   local mixed_opt=
> - (( fssize <= 100 * 1024 * 1024 )) && mixed_opt='--mixed'
> + (( fssize <= 1024 * 1024 * 1024 )) && mixed_opt='--mixed'
>   $MKFS_BTRFS_PROG $MKFS_OPTIONS $mixed_opt -b $fssize $SCRATCH_DEV
>   ;;
>  jfs)

Makes sense.

Reviewed-by: Dave Chinner <dchin...@redhat.com>

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] generic/427: used mixed mode for Btrfs

2018-04-11 Thread Dave Chinner
On Wed, Apr 11, 2018 at 04:47:30PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osan...@fb.com>
> 
> This test creates a 256 MB filesystem and then writes a 200 MB file.
> With separate data and metadata, Btrfs will run out of data space since
> it needs to allocate some metadata space. Use mixed mode, which is the
> recommendation for smaller filesystems.
> 
> Signed-off-by: Omar Sandoval <osan...@fb.com>
> ---
>  tests/generic/427 | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/tests/generic/427 b/tests/generic/427
> index 9cde5f50..b2cb4526 100755
> --- a/tests/generic/427
> +++ b/tests/generic/427
> @@ -55,6 +55,8 @@ _require_test_program "feature"
>  _require_aiodio aio-dio-eof-race
>  
>  # limit the filesystem size, to save the time of filling filesystem
> +# Btrfs needs to use mixed mode for such a small filesystem
> +[ "$FSTYP" = "btrfs" ] && MKFS_OPTIONS="$MKFS_OPTIONS -M"


_scratch_mkfs_sized() should already be adding this for small btrfs
filesystems. Yup, it does:

btrfs)  
 
local mixed_opt=
 
(( fssize <= 100 * 1024 * 1024 )) && mixed_opt='--mixed'
 
$MKFS_BTRFS_PROG $MKFS_OPTIONS $mixed_opt -b $fssize $SCRATCH_DEV   
 
;;  
 

>  _scratch_mkfs_sized $((256 * 1024 * 1024)) >>$seqres.full 2>&1

But this uses a filesystem larger than the mixed mode threshold in
_scratch_mkfs_sized(). Please update the generic threshold rather
than special case this test.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: generic test for fsync after fallocate

2018-04-09 Thread Dave Chinner
On Mon, Apr 09, 2018 at 11:00:52AM +0100, Filipe Manana wrote:
> On Mon, Apr 9, 2018 at 10:51 AM, Dave Chinner <da...@fromorbit.com> wrote:
> > On Sun, Apr 08, 2018 at 10:07:54AM +0800, Eryu Guan wrote:
> >> On Thu, Apr 05, 2018 at 10:56:14PM +0100, fdman...@kernel.org wrote:
> > You also cannot assume that two separate preallocations beyond EOF
> > are going to be contiguous (i.e. it could be two separate extents.
> 
> I really don't care about that.
> Just want to check that allocated space is not lost, don't care if
> that space is covered by 1, 2 or more extents, or whether they are
> contiguous or not.

Sure, but that's not what the test does :/

> > What you should just be checking is that there are extents allocated
> > covering EOF to 3MB, not the exactly size, shape and type of extents
> > are allocated.
> 
> How to do such check, for a generic test, without using fiemap?

We already do such checks using fiemap, yes?

i.e. there's a bunch of _filter*fiemap() functions in common/punch
which should show you exactly how to do this i.e. filter all the
unwritten/data extents to be the same name, then they coalesce into
single ranges

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: generic test for fsync after fallocate

2018-04-09 Thread Dave Chinner
On Sun, Apr 08, 2018 at 10:07:54AM +0800, Eryu Guan wrote:
> On Thu, Apr 05, 2018 at 10:56:14PM +0100, fdman...@kernel.org wrote:
> > From: Filipe Manana <fdman...@suse.com>
> > 
> > Test that fsync operations preserve extents allocated with fallocate(2)
> > that are placed beyond a file's size.
> > 
> > This test is motivated by a bug found in btrfs where unwritten extents
> > beyond the inode's i_size were not preserved after a fsync and power
> > failure. The btrfs bug is fixed by the following patch for the linux
> > kernel:
> > 
> >  "Btrfs: fix loss of prealloc extents past i_size after fsync log replay"
> > 
> > Signed-off-by: Filipe Manana <fdman...@suse.com>
> 
> Hmm, xfs fails this test, while ext4 passes.
> 
> # diff -u tests/generic/483.out 
> /root/workspace/xfstests/results//xfs_4k_crc/generic/483.out.bad
> --- tests/generic/483.out   2018-04-07 23:35:00.55511 +0800
> +++ /root/workspace/xfstests/results//xfs_4k_crc/generic/483.out.bad
> 2018-04-07 23:39:48.780659707 +0800
> @@ -6,5 +6,5 @@
>  0: [0..511]: data
>  1: [512..2559]: unwritten
>  File baz fiemap:
> -0: [0..511]: data
> -1: [512..6143]: unwritten
> +0: [0..895]: data
> +1: [896..6143]: unwritten

Perfectly valid result from the test.

> > +# A file which already has a prealloc extent beyond its size.
> > +# The fsync done on it is motivated by differences in the btrfs 
> > implementation
> > +# of fsync (first fsync has different logic from subsequent fsyncs).
> > +$XFS_IO_PROG -f -c "pwrite -S 0xf1 0 256K" \
> > +-c "falloc -k 256K 768K" \
> > +-c "fsync" \
> > +$SCRATCH_MNT/baz >/dev/null

Delayed allocation extending the file to 256k means we'll have
speculative prealloc of data beyond 256k - it should be somewhere
around 448k IIRC.

Now, falloc doesn't guarantee unwritten extents are allocated - it
just guarantees space is allocated. unwritten extents are an
optimisation to avoid needing to zero the extents pre-allocated
within EOF. i.e. we can quite safely allocate data extents beyond
EOF rather than unwritten extents, and this is not a bug.

In this case, the delalloc extent is completely allocated as written
data, but nothing is written beyond EOF @ 256k.  Hence it reports as
a data extent, not an unwritten extent, and this is a perfectly
valid thing to do.

> > +# Allocate another extent beyond the size of file baz.
> > +$XFS_IO_PROG -c "falloc -k 1M 2M"\
> > +-c "fsync" \
> > +$SCRATCH_MNT/baz

You also cannot assume that two separate preallocations beyond EOF
are going to be contiguous (i.e. it could be two separate extents.

What you should just be checking is that there are extents allocated
covering EOF to 3MB, not the exactly size, shape and type of extents
are allocated.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/3] fstests: generic: Check the fs after each FUA writes

2018-03-28 Thread Dave Chinner
On Wed, Mar 28, 2018 at 12:40:23PM +0800, Qu Wenruo wrote:
> Basic test case which triggers fsstress with dm-log-writes, and then
> check the fs after each FUA writes.
> With needed infrastructure and special handlers for journal based fs.
> 
> Signed-off-by: Qu Wenruo <w...@suse.com>
.
> For xfs:
> _check_xfs_filesystem: filesystem on /dev/mapper/test-scratch1 is 
> inconsistent (r)
> *** xfs_repair -n output ***
> Phase 1 - find and verify superblock...
> Phase 2 - using internal log
> - zero log...
> - scan filesystem freespace and inode maps...
> - found root inode chunk
> Phase 3 - for each AG...
> - scan (but don't clear) agi unlinked lists...
> - process known inodes and perform inode discovery...
> - agno = 0
> - agno = 1
> - agno = 2
> bad symlink header ino 8409190, file block 0, disk block 1051147
> problem with symbolic link in inode 8409190
> would have cleared inode 8409190
> - agno = 3
> - process newly discovered inodes...
> Phase 4 - check for duplicate blocks...
> - setting up duplicate extent list...
> - check for inodes claiming duplicate blocks...
> - agno = 0
> - agno = 1
> - agno = 3
> - agno = 2
> entry "lb" in shortform directory 8409188 references free inode 8409190
> would have junked entry "lb" in directory inode 8409188
> bad symlink header ino 8409190, file block 0, disk block 1051147
> problem with symbolic link in inode 8409190
> would have cleared inode 8409190
> No modify flag set, skipping phase 5
> Phase 6 - check inode connectivity...
> - traversing filesystem ...
> entry "lb" in shortform directory inode 8409188 points to free inode 8409190
> would junk entry
> - traversal finished ...
> - moving disconnected inodes to lost+found ...
> Phase 7 - verify link counts...
> Maximum metadata LSN (1:396) is ahead of log (1:63).

That warning implies a write ordering problem - there's a write with
an LSN on disk that does not yet exist in the log. i.e. the last FUA
write to the log had 1:63 in it, yet there's metadata on disk that
could only be *issued* after a REQ_FLUSH|REQ_FUA log write with
1:396 in it was *completed*. If we've only replayed up to the
FUA write with 1:63 in it, then no metadata writes should have been
*issued* with 1:396 in it as the LSN that is stamped into metadata
is only updated on log IO completion

On first glance, this implies a bug in the underlying device write
replay code

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs send | receive ERROR rename file exists

2018-02-14 Thread Dave
Can anyone give me any ideas why this error would happen? The receive
directory started empty. Snapshot 3 exists at both source and target.

# mkdir /.snapshots/bw538/
# btrfs send -p /mnt/backup/root/laptop/3/snapshot/
/mnt/backup/root/laptop/4/snapshot/ | btrfs receive /.snapshots/bw538/
At subvol /mnt/backup/root/laptop/4/snapshot/
At snapshot snapshot
ERROR: rename o439638-244-0 -> var/cache/man/pl/index.db failed: File exists
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


how to make a cache directory nodatacow while also excluded from snapshots?

2018-01-15 Thread Dave
I want to exclude my ~/.cache directory from snapshots. The obvious
way to do this is to mount a btrfs subvolume at that location.

However, I also want the ~/.cache directory to be nodatacow. Since the
parent volume is COW, I believe it isn't possible to mount the
subvolume with different mount options.

What's the solution for achieving both of these goals?

I tried this without success:

chattr +C ~/.cache

Since ~/.cache is a btrfs subvolume, apparently that doesn't work.

lsattr ~/.cache

returns nothing.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 19/19] fs: handle inode->i_version more efficiently

2018-01-09 Thread Dave Chinner
On Tue, Jan 09, 2018 at 09:10:59AM -0500, Jeff Layton wrote:
> From: Jeff Layton <jlay...@redhat.com>
> 
> Since i_version is mostly treated as an opaque value, we can exploit that
> fact to avoid incrementing it when no one is watching. With that change,
> we can avoid incrementing the counter on writes, unless someone has
> queried for it since it was last incremented. If the a/c/mtime don't
> change, and the i_version hasn't changed, then there's no need to dirty
> the inode metadata on a write.
> 
> Convert the i_version counter to an atomic64_t, and use the lowest order
> bit to hold a flag that will tell whether anyone has queried the value
> since it was last incremented.
> 
> When we go to maybe increment it, we fetch the value and check the flag
> bit.  If it's clear then we don't need to do anything if the update
> isn't being forced.
> 
> If we do need to update, then we increment the counter by 2, and clear
> the flag bit, and then use a CAS op to swap it into place. If that
> works, we return true. If it doesn't then do it again with the value
> that we fetch from the CAS operation.
> 
> On the query side, if the flag is already set, then we just shift the
> value down by 1 bit and return it. Otherwise, we set the flag in our
> on-stack value and again use cmpxchg to swap it into place if it hasn't
> changed. If it has, then we use the value from the cmpxchg as the new
> "old" value and try again.
> 
> This method allows us to avoid incrementing the counter on writes (and
> dirtying the metadata) under typical workloads. We only need to increment
> if it has been queried since it was last changed.
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>
> Reviewed-by: Jan Kara <j...@suse.cz>

Documentation helps a lot in understanding all this. Thanks for
adding it into the patch!

Acked-by: Dave Chinner <dchin...@redhat.com>
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 17/19] xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing

2018-01-09 Thread Dave Chinner
On Tue, Jan 09, 2018 at 09:10:57AM -0500, Jeff Layton wrote:
> From: Jeff Layton <jlay...@redhat.com>
> 
> If XFS_ILOG_CORE is already set then go ahead and increment it.
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>
> Acked-by: Darrick J. Wong <darrick.w...@oracle.com>

Acked-by: Dave Chinner <dchin...@redhat.com>

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 14/19] xfs: convert to new i_version API

2018-01-09 Thread Dave Chinner
On Tue, Jan 09, 2018 at 09:10:54AM -0500, Jeff Layton wrote:
> From: Jeff Layton <jlay...@redhat.com>
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>
> Acked-by: Darrick J. Wong <darrick.w...@oracle.com>

Looks ok, but I haven't tested it at all.

Acked-by: Dave Chinner <dchin...@redhat.com>

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata

2018-01-03 Thread Dave Chinner
On Wed, Jan 03, 2018 at 02:59:21PM +0100, Jan Kara wrote:
> On Wed 03-01-18 13:32:19, Dave Chinner wrote:
> > I think we could probably block ->write_metadata if necessary via a
> > completion/wakeup style notification when a specific LSN is reached
> > by the log tail, but realistically if there's any amount of data
> > needing to be written it'll throttle data writes because the IO
> > pipeline is being kept full by background metadata writes
> 
> So the problem I'm concerned about is a corner case. Consider a situation
> when you have no dirty data, only dirty metadata but enough of them to
> trigger background writeback. How should metadata writeback behave for XFS
> in this case? Who should be responsible that wb_writeback() just does not
> loop invoking ->write_metadata() as fast as CPU allows until xfsaild makes
> enough progress?
>
> Thinking about this today, I think this looping prevention belongs to
> wb_writeback().

Well, backgroudn data writeback can block in two ways. One is during
IO submission when the request queue is full, the other is when all
dirty inodes have had some work done on them and have all been moved
to b_more_io - wb_writeback waits for the __I_SYNC bit to be cleared
on the last(?) inode on that list, hence backing off before
submitting more IO.

IOws, there's a "during writeback" blocking mechanism as well as a
"between cycles" block mechanism.

> Sadly we don't have much info to decide how long to sleep
> before trying more writeback so we'd have to just sleep for
>  if we found no writeback happened in the last writeback
> round before going through the whole writeback loop again.

Right - I don't think we can provide a generic "between cycles"
blocking mechanism for XFS, but I'm pretty sure we can emulate a
"during writeback" blocking mechanism to avoid busy looping inside
the XFS code.

e.g. if we get a writeback call that asks for 5% to be written,
and we already have a metadata writeback target of 5% in place,
that means we should block for a while. That would emulate request
queue blocking and prevent busy looping in this case

> And
> ->write_metadata() for XFS would need to always return 0 (as in "no progress
> made") to make sure this busyloop avoidance logic in wb_writeback()
> triggers. ext4 and btrfs would return number of bytes written from
> ->write_metadata (or just 1 would be enough to indicate some progress in
> metadata writeback was made and busyloop avoidance is not needed).

Well, if we block for a little while, we can indicate that progress
has been made and this whole mess would go away, right?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata

2018-01-02 Thread Dave Chinner
On Tue, Jan 02, 2018 at 11:13:06AM -0500, Josef Bacik wrote:
> On Wed, Dec 20, 2017 at 03:30:55PM +0100, Jan Kara wrote:
> > On Wed 20-12-17 08:35:05, Dave Chinner wrote:
> > > On Tue, Dec 19, 2017 at 01:07:09PM +0100, Jan Kara wrote:
> > > > On Wed 13-12-17 09:20:04, Dave Chinner wrote:
> > > > > IOWs, treating metadata like it's one great big data inode doesn't
> > > > > seem to me to be the right abstraction to use for this - in most
> > > > > fileystems it's a bunch of objects with a complex dependency tree
> > > > > and unknown write ordering, not an inode full of data that can be
> > > > > sequentially written.
> > > > > 
> > > > > Maybe we need multiple ops with well defined behaviours. e.g.
> > > > > ->writeback_metadata() for background writeback, ->sync_metadata() for
> > > > > sync based operations. That way different filesystems can ignore the
> > > > > parts they don't need simply by not implementing those operations,
> > > > > and the writeback code doesn't need to try to cater for all
> > > > > operations through the one op. The writeback code should be cleaner,
> > > > > the filesystem code should be cleaner, and we can tailor the work
> > > > > guidelines for each operation separately so there's less mismatch
> > > > > between what writeback is asking and how filesystems track dirty
> > > > > metadata...
> > > > 
> > > > I agree that writeback for memory cleaning and writeback for data 
> > > > integrity
> > > > are two very different things especially for metadata. In fact for data
> > > > integrity writeback we already have ->sync_fs operation so there the
> > > > functionality gets duplicated. What we could do is that in
> > > > writeback_sb_inodes() we'd call ->write_metadata only when
> > > > work->for_kupdate or work->for_background is set. That way 
> > > > ->write_metadata
> > > > would be called only for memory cleaning purposes.
> > > 
> > > That makes sense, but I still think we need a better indication of
> > > how much writeback we need to do than just "writeback this chunk of
> > > pages". That "writeback a chunk" interface is necessary to share
> > > writeback bandwidth across numerous data inodes so that we don't
> > > starve any one inode of writeback bandwidth. That's unnecessary for
> > > metadata writeback on a superblock - we don't need to share that
> > > bandwidth around hundreds or thousands of inodes. What we actually
> > > need to know is how much writeback we need to do as a total of all
> > > the dirty metadata on the superblock.
> > > 
> > > Sure, that's not ideal for btrfs and mayext4, but we can write a
> > > simple generic helper that converts "flush X percent of dirty
> > > metadata" to a page/byte chunk as the current code does. DOing it
> > > this way allows filesystems to completely internalise the accounting
> > > that needs to be done, rather than trying to hack around a
> > > writeback accounting interface with large impedance mismatches to
> > > how the filesystem accounts for dirty metadata and/or tracks
> > > writeback progress.
> > 
> > Let me think loud on how we could tie this into how memory cleaning
> > writeback currently works - the one with for_background == 1 which is
> > generally used to get amount of dirty pages in the system under control.
> > We have a queue of inodes to write, we iterate over this queue and ask each
> > inode to write some amount (e.g. 64 M - exact amount depends on measured

It's a maximum of 1024 pages per inode.

> > writeback bandwidth etc.). Some amount from that inode gets written and we
> > continue with the next inode in the queue (put this one at the end of the
> > queue if it still has dirty pages). We do this until:
> > 
> > a) the number of dirty pages in the system is below background dirty limit
> >and the number dirty pages for this device is below background dirty
> >limit for this device.
> > b) run out of dirty inodes on this device
> > c) someone queues different type of writeback
> > 
> > And we need to somehow incorporate metadata writeback into this loop. I see
> > two questions here:
> > 
> > 1) When / how often should we ask for metadata writeback?
> > 2) How much to ask to write in one go?
> > 
> > The second question is especially tricky in t

Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata

2017-12-19 Thread Dave Chinner
On Tue, Dec 19, 2017 at 01:07:09PM +0100, Jan Kara wrote:
> On Wed 13-12-17 09:20:04, Dave Chinner wrote:
> > On Tue, Dec 12, 2017 at 01:05:35PM -0500, Josef Bacik wrote:
> > > On Tue, Dec 12, 2017 at 10:36:19AM +1100, Dave Chinner wrote:
> > > > On Mon, Dec 11, 2017 at 04:55:31PM -0500, Josef Bacik wrote:
> > > This is just one of those things that's going to be slightly shitty.  
> > > It's the
> > > same for memory reclaim, all of those places use pages so we just take
> > > METADATA_*_BYTES >> PAGE_SHIFT to get pages and figure it's close enough.
> > 
> > Ok, so that isn't exactly easy to deal with, because all our
> > metadata writeback is based on log sequence number targets (i.e. how
> > far to push the tail of the log towards the current head). We've
> > actually got no idea how pages/bytes actually map to a LSN target
> > because while we might account a full buffer as dirty for memory
> > reclaim purposes (up to 64k in size), we might have only logged 128
> > bytes of it.
> > 
> > i.e. if we are asked to push 2MB of metadata and we treat that as
> > 2MB of log space (i.e. push target of tail LSN + 2MB) we could have
> > logged several tens of megabytes of dirty metadata in that LSN
> > range and have to flush it all. OTOH, if the buffers are fully
> > logged, then that same target might only flush 1.5MB of metadata
> > once all the log overhead is taken into account.
> > 
> > So there's a fairly large disconnect between the "flush N bytes of
> > metadata" API and the "push to a target LSN" that XFS uses for
> > flushing metadata in aged order. I'm betting that extN and otehr
> > filesystems might have similar mismatches with their journal
> > flushing...
> 
> Well, for ext4 it isn't as bad since we do full block logging only. So if
> we are asked to flush N pages, we can easily translate that to number of fs
> blocks and flush that many from the oldest transaction.
> 
> Couldn't XFS just track how much it has cleaned (from reclaim perspective)
> when pushing items from AIL (which is what I suppose XFS would do in
> response to metadata writeback request) and just stop pushing when it has
> cleaned as much as it was asked to?

If only it were that simple :/

To start with, flushing the dirty objects (such as inodes) to their
backing buffers do not mean the the object is clean once the
writeback completes. XFS has decoupled in-memory objects with
logical object logging rather than logging physical buffers, and
so can be modified and dirtied while the inode buffer
is being written back. Hence if we just count things like "buffer
size written" it's not actually a correct account of the amount of
dirty metadata we've cleaned. If we don't get that right, it'll
result in accounting errors and incorrect behaviour.

The bigger problem, however, is that we have no channel to return
flush information from the AIL pushing to whatever caller asked for
the push. Pushing metadata is completely decoupled from every other
subsystem. i.e. the caller asked the xfsaild to push to a specific
LSN (e.g. to free up a certain amount of log space for new
transactions), and *nothing* has any idea of how much metadata we'll
need to write to push the tail of the log to that LSN.

It's also completely asynchronous - there's no mechanism for waiting
on a push to a specific LSN. Anything that needs a specific amount
of log space to be available waits in ordered ticket queues on the
log tail moving forwards. The only interfaces that have access to
the log tail ticket waiting is the transaction reservation
subsystem, which cannot be used during metadata writeback because
that's a guaranteed deadlock vector

Saying "just account for bytes written" assumes directly connected,
synchronous dispatch metadata writeback infrastructure which we
simply don't have in XFS. "just clean this many bytes" doesn't
really fit at all because we have no way of referencing that to the
distance we need to push the tail of the log. An interface that
tells us "clean this percentage of dirty metadata" is much more
useful because we can map that easily to a log sequence number
based push target

> > IOWs, treating metadata like it's one great big data inode doesn't
> > seem to me to be the right abstraction to use for this - in most
> > fileystems it's a bunch of objects with a complex dependency tree
> > and unknown write ordering, not an inode full of data that can be
> > sequentially written.
> > 
> > Maybe we need multiple ops with well defined behaviours. e.g.
> > ->writeback_metadata() for background writeback, ->sync_metadata() for
> > sync based operations. That way different filesystems can igno

Re: [PATCH v3 19/19] fs: handle inode->i_version more efficiently

2017-12-18 Thread Dave Chinner
On Mon, Dec 18, 2017 at 02:35:20PM -0500, Jeff Layton wrote:
> [PATCH] SQUASH: add memory barriers around i_version accesses

Why explicit memory barriers rather than annotating the operations
with the required semantics and getting the barriers the arch
requires automatically?  I suspect this should be using
atomic_read_acquire() and atomic_cmpxchg_release(), because AFAICT
the atomic_cmpxchg needs to have release semantics to match the
acquire semantics needed for the load of the current value.

>From include/linux/atomics.h:

 * For compound atomics performing both a load and a store, ACQUIRE
 * semantics apply only to the load and RELEASE semantics only to the
 * store portion of the operation. Note that a failed cmpxchg_acquire
 * does -not- imply any memory ordering constraints.

Memory barriers hurt my brain. :/

At minimum, shouldn't the atomic op specific barriers be used rather
than full memory barriers? i.e:

> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index a9fbf99709df..39ec9aa9e08e 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -87,6 +87,25 @@ static inline void
>  inode_set_iversion_raw(struct inode *inode, const u64 val)
>  {
>   atomic64_set(>i_version, val);
> + smp_wmb();

smp_mb__after_atomic();
.
> +static inline u64
> +inode_peek_iversion_raw(const struct inode *inode)
> +{
> + smp_rmb();

smp_mb__before_atomic();

> + return atomic64_read(>i_version);
>  }

And, of course, these will require comments explaining what they
match with and what they are ordering.

> @@ -152,7 +171,7 @@ inode_maybe_inc_iversion(struct inode *inode, bool force)
>  {
>   u64 cur, old, new;
>  
> - cur = (u64)atomic64_read(>i_version);
> + cur = inode_peek_iversion_raw(inode);
>   for (;;) {
>   /* If flag is clear then we needn't do anything */
>   if (!force && !(cur & I_VERSION_QUERIED))

cmpxchg in this loop needs a release barrier so everyone will
see the change?

> @@ -183,23 +202,6 @@ inode_inc_iversion(struct inode *inode)
>   inode_maybe_inc_iversion(inode, true);
>  }
>  
> -/**
> - * inode_peek_iversion_raw - grab a "raw" iversion value
> - * @inode: inode from which i_version should be read
> - *
> - * Grab a "raw" inode->i_version value and return it. The i_version is not
> - * flagged or converted in any way. This is mostly used to access a 
> self-managed
> - * i_version.
> - *
> - * With those filesystems, we want to treat the i_version as an entirely
> - * opaque value.
> - */
> -static inline u64
> -inode_peek_iversion_raw(const struct inode *inode)
> -{
> - return atomic64_read(>i_version);
> -}
> -
>  /**
>   * inode_iversion_need_inc - is the i_version in need of being incremented?
>   * @inode: inode to check
> @@ -248,7 +250,7 @@ inode_query_iversion(struct inode *inode)
>  {
>   u64 cur, old, new;
>  
> - cur = atomic64_read(>i_version);
> + cur = inode_peek_iversion_raw(inode);
>   for (;;) {
>   /* If flag is already set, then no need to swap */
>   if (cur & I_VERSION_QUERIED)

cmpxchg in this loop needs a release barrier so everyone will
see the change on load?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 01/19] fs: new API for handling inode->i_version

2017-12-16 Thread Dave Chinner
On Sat, Dec 16, 2017 at 08:46:38AM -0500, Jeff Layton wrote:
> From: Jeff Layton <jlay...@redhat.com>
> 
> Add a documentation blob that explains what the i_version field is, how
> it is expected to work, and how it is currently implemented by various
> filesystems.
> 
> We already have inode_inc_iversion. Add several other functions for
> manipulating and accessing the i_version counter. For now, the
> implementation is trivial and basically works the way that all of the
> open-coded i_version accesses work today.
> 
> Future patches will convert existing users of i_version to use the new
> API, and then convert the backend implementation to do things more
> efficiently.
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>
> ---
>  include/linux/fs.h | 200 
> ++---
>  1 file changed, 192 insertions(+), 8 deletions(-)

Just a random sunday morning coffee musing

I was just wondering if it would be better to split this stuff out
into it's own header file now? include/linux/fs.h is aleady a
massive header file (~3500 lines) and changes cause tree-wide
rebuilds, so maybe it would be better to split relatively isolated
functionality like this out while it's being reworked and you're
already touching every file that uses it?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata

2017-12-12 Thread Dave Chinner
On Tue, Dec 12, 2017 at 01:05:35PM -0500, Josef Bacik wrote:
> On Tue, Dec 12, 2017 at 10:36:19AM +1100, Dave Chinner wrote:
> > On Mon, Dec 11, 2017 at 04:55:31PM -0500, Josef Bacik wrote:
> > > From: Josef Bacik <jba...@fb.com>
> > > 
> > > Now that we have metadata counters in the VM, we need to provide a way to 
> > > kick
> > > writeback on dirty metadata.  Introduce super_operations->write_metadata. 
> > >  This
> > > allows file systems to deal with writing back any dirty metadata we need 
> > > based
> > > on the writeback needs of the system.  Since there is no inode to key off 
> > > of we
> > > need a list in the bdi for dirty super blocks to be added.  From there we 
> > > can
> > > find any dirty sb's on the bdi we are currently doing writeback on and 
> > > call into
> > > their ->write_metadata callback.
> > > 
> > > Signed-off-by: Josef Bacik <jba...@fb.com>
> > > Reviewed-by: Jan Kara <j...@suse.cz>
> > > Reviewed-by: Tejun Heo <t...@kernel.org>
> > > ---
> > >  fs/fs-writeback.c| 72 
> > > 
> > >  fs/super.c   |  6 
> > >  include/linux/backing-dev-defs.h |  2 ++
> > >  include/linux/fs.h   |  4 +++
> > >  mm/backing-dev.c |  2 ++
> > >  5 files changed, 80 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > index 987448ed7698..fba703dff678 100644
> > > --- a/fs/fs-writeback.c
> > > +++ b/fs/fs-writeback.c
> > > @@ -1479,6 +1479,31 @@ static long writeback_chunk_size(struct 
> > > bdi_writeback *wb,
> > >   return pages;
> > >  }
> > >  
> > > +static long writeback_sb_metadata(struct super_block *sb,
> > > +   struct bdi_writeback *wb,
> > > +   struct wb_writeback_work *work)
> > > +{
> > > + struct writeback_control wbc = {
> > > + .sync_mode  = work->sync_mode,
> > > + .tagged_writepages  = work->tagged_writepages,
> > > + .for_kupdate= work->for_kupdate,
> > > + .for_background = work->for_background,
> > > + .for_sync   = work->for_sync,
> > > + .range_cyclic   = work->range_cyclic,
> > > + .range_start= 0,
> > > + .range_end  = LLONG_MAX,
> > > + };
> > > + long write_chunk;
> > > +
> > > + write_chunk = writeback_chunk_size(wb, work);
> > > + wbc.nr_to_write = write_chunk;
> > > + sb->s_op->write_metadata(sb, );
> > > + work->nr_pages -= write_chunk - wbc.nr_to_write;
> > > +
> > > + return write_chunk - wbc.nr_to_write;
> > 
> > Ok, writeback_chunk_size() returns a page count. We've already gone
> > through the "metadata is not page sized" dance on the dirty
> > accounting side, so how are we supposed to use pages to account for
> > metadata writeback?
> > 
> 
> This is just one of those things that's going to be slightly shitty.  It's the
> same for memory reclaim, all of those places use pages so we just take
> METADATA_*_BYTES >> PAGE_SHIFT to get pages and figure it's close enough.

Ok, so that isn't exactly easy to deal with, because all our
metadata writeback is based on log sequence number targets (i.e. how
far to push the tail of the log towards the current head). We've
actually got no idea how pages/bytes actually map to a LSN target
because while we might account a full buffer as dirty for memory
reclaim purposes (up to 64k in size), we might have only logged 128
bytes of it.

i.e. if we are asked to push 2MB of metadata and we treat that as
2MB of log space (i.e. push target of tail LSN + 2MB) we could have
logged several tens of megabytes of dirty metadata in that LSN
range and have to flush it all. OTOH, if the buffers are fully
logged, then that same target might only flush 1.5MB of metadata
once all the log overhead is taken into account.

So there's a fairly large disconnect between the "flush N bytes of
metadata" API and the "push to a target LSN" that XFS uses for
flushing metadata in aged order. I'm betting that extN and otehr
filesystems might have similar mismatches with their journal
flushing...

> > And, from what I can tell, if work->sync_mode = WB_SYNC_ALL or
> > work->tagged_writepages is set, this will basically tell us to f

Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

2017-12-11 Thread Dave Chinner
On Mon, Dec 11, 2017 at 02:12:28PM -0800, Joe Perches wrote:
> On Tue, 2017-12-12 at 08:43 +1100, Dave Chinner wrote:
> > On Sat, Dec 09, 2017 at 09:00:18AM -0800, Joe Perches wrote:
> > > On Sat, 2017-12-09 at 09:36 +1100, Dave Chinner wrote:
> > > > 1. Using lockdep_set_novalidate_class() for anything other
> > > > than device->mutex will throw checkpatch warnings. Nice. (*)
> > > []
> > > > (*) checkpatch.pl is considered mostly harmful round here, too,
> > > > but that's another rant
> > > 
> > > How so?
> > 
> > Short story is that it barfs all over the slightly non-standard
> > coding style used in XFS.
> []
> > This sort of stuff is just lowest-common-denominator noise - great
> > for new code and/or inexperienced developers, but not for working
> > with large bodies of existing code with slightly non-standard
> > conventions.
> 
> Completely reasonable.  Thanks.
> 
> Do you get many checkpatch submitters for fs/xfs?

We used to. Not recently, though.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata

2017-12-11 Thread Dave Chinner
On Mon, Dec 11, 2017 at 04:55:31PM -0500, Josef Bacik wrote:
> From: Josef Bacik <jba...@fb.com>
> 
> Now that we have metadata counters in the VM, we need to provide a way to kick
> writeback on dirty metadata.  Introduce super_operations->write_metadata.  
> This
> allows file systems to deal with writing back any dirty metadata we need based
> on the writeback needs of the system.  Since there is no inode to key off of 
> we
> need a list in the bdi for dirty super blocks to be added.  From there we can
> find any dirty sb's on the bdi we are currently doing writeback on and call 
> into
> their ->write_metadata callback.
> 
> Signed-off-by: Josef Bacik <jba...@fb.com>
> Reviewed-by: Jan Kara <j...@suse.cz>
> Reviewed-by: Tejun Heo <t...@kernel.org>
> ---
>  fs/fs-writeback.c| 72 
> 
>  fs/super.c   |  6 
>  include/linux/backing-dev-defs.h |  2 ++
>  include/linux/fs.h   |  4 +++
>  mm/backing-dev.c |  2 ++
>  5 files changed, 80 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 987448ed7698..fba703dff678 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1479,6 +1479,31 @@ static long writeback_chunk_size(struct bdi_writeback 
> *wb,
>   return pages;
>  }
>  
> +static long writeback_sb_metadata(struct super_block *sb,
> +   struct bdi_writeback *wb,
> +   struct wb_writeback_work *work)
> +{
> + struct writeback_control wbc = {
> + .sync_mode  = work->sync_mode,
> + .tagged_writepages  = work->tagged_writepages,
> + .for_kupdate= work->for_kupdate,
> + .for_background = work->for_background,
> + .for_sync   = work->for_sync,
> + .range_cyclic   = work->range_cyclic,
> + .range_start= 0,
> + .range_end  = LLONG_MAX,
> + };
> + long write_chunk;
> +
> + write_chunk = writeback_chunk_size(wb, work);
> + wbc.nr_to_write = write_chunk;
> + sb->s_op->write_metadata(sb, );
> + work->nr_pages -= write_chunk - wbc.nr_to_write;
> +
> + return write_chunk - wbc.nr_to_write;

Ok, writeback_chunk_size() returns a page count. We've already gone
through the "metadata is not page sized" dance on the dirty
accounting side, so how are we supposed to use pages to account for
metadata writeback?

And, from what I can tell, if work->sync_mode = WB_SYNC_ALL or
work->tagged_writepages is set, this will basically tell us to flush
the entire dirty metadata cache because write_chunk will get set to
LONG_MAX.

IOWs, this would appear to me to change sync() behaviour quite
dramatically on filesystems where ->write_metadata is implemented.
That is, instead of leaving all the metadata dirty in memory and
just forcing the journal to stable storage, filesystems will be told
to also write back all their dirty metadata before sync() returns,
even though it is not necessary to provide correct sync()
semantics

Mind you, writeback invocation is so convoluted now I could easily
be mis-interpretting this code, but it does seem to me like this
code is going to have some unintended behaviours

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

2017-12-11 Thread Dave Chinner
On Sun, Dec 10, 2017 at 08:23:15PM -0800, Matthew Wilcox wrote:
> On Mon, Dec 11, 2017 at 10:57:45AM +1100, Dave Chinner wrote:
> > i.e.  the fact the cmpxchg failed may not have anything to do with a
> > race condtion - it failed because the slot wasn't empty like we
> > expected it to be. There can be any number of reasons the slot isn't
> > empty - the API should not "document" that the reason the insert
> > failed was a race condition. It should document the case that we
> > "couldn't insert because there was an existing entry in the slot".
> > Let the surrounding code document the reason why that might have
> > happened - it's not for the API to assume reasons for failure.
> > 
> > i.e. this API and potential internal implementation makes much
> > more sense:
> > 
> > int
> > xa_store_iff_empty(...)
> > {
> > curr = xa_cmpxchg(>pag_ici_xa, agino, NULL, ip, GFP_NOFS);
> > if (!curr)
> > return 0;   /* success! */
> > if (!IS_ERR(curr))
> > return -EEXIST; /* failed - slot not empty */
> > return PTR_ERR(curr);   /* failed - XA internal issue */
> > }
> > 
> > as it replaces the existing preload and insert code in the XFS code
> > paths whilst letting us handle and document the "insert failed
> > because slot not empty" case however we want. It implies nothing
> > about *why* the slot wasn't empty, just that we couldn't do the
> > insert because it wasn't empty.
> 
> Yeah, OK.  So, over the top of the recent changes I'm looking at this:
> 
> I'm not in love with xa_store_empty() as a name.  I almost want
> xa_store_weak(), but after my MAP_FIXED_WEAK proposed name got shot
> down, I'm leery of it.  "empty" is at least a concept we already have
> in the API with the comment for xa_init() talking about an empty array
> and xa_empty().  I also considered xa_store_null and xa_overwrite_null
> and xa_replace_null().  Naming remains hard.
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 941f38bb94a4..586b43836905 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -451,7 +451,7 @@ xfs_iget_cache_miss(
>   int flags,
>   int lock_flags)
>  {
> - struct xfs_inode*ip, *curr;
> + struct xfs_inode*ip;
>   int error;
>   xfs_agino_t agino = XFS_INO_TO_AGINO(mp, ino);
>   int iflags;
> @@ -498,8 +498,7 @@ xfs_iget_cache_miss(
>   xfs_iflags_set(ip, iflags);
>  
>   /* insert the new inode */
> - curr = xa_cmpxchg(>pag_ici_xa, agino, NULL, ip, GFP_NOFS);
> - error = __xa_race(curr, -EAGAIN);
> + error = xa_store_empty(>pag_ici_xa, agino, ip, GFP_NOFS, -EAGAIN);
>   if (error)
>   goto out_unlock;

Can we avoid encoding the error to be returned into the function
call? No other generic/library API does this, so this seems like a
highly unusual special snowflake approach. I just don't see how
creating a whole new error specification convention is justified
for the case where it *might* save a line or two of error checking
code in a caller?

I'd much prefer that the API defines the error on failure, and let
the caller handle that specific error however they need to like
every other library interface we have...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

2017-12-11 Thread Dave Chinner
On Sat, Dec 09, 2017 at 09:00:18AM -0800, Joe Perches wrote:
> On Sat, 2017-12-09 at 09:36 +1100, Dave Chinner wrote:
> > 1. Using lockdep_set_novalidate_class() for anything other
> > than device->mutex will throw checkpatch warnings. Nice. (*)
> []
> > (*) checkpatch.pl is considered mostly harmful round here, too,
> > but that's another rant
> 
> How so?

Short story is that it barfs all over the slightly non-standard
coding style used in XFS.  It generates enough noise on incidental
things we aren't important that it complicates simple things. e.g. I
just moved a block of defines from one header to another, and
checkpatch throws about 10 warnings on that because of whitespace.
I'm just moving code - I don't want to change it and it doesn't need
to be modified because it is neat and easy to read and is obviously
correct. A bunch of prototypes I added another parameter to gets
warnings because it uses "unsigned" for an existing parameter that
I'm not changing. And so on.

This sort of stuff is just lowest-common-denominator noise - great
for new code and/or inexperienced developers, but not for working
with large bodies of existing code with slightly non-standard
conventions. Churning *lots* of code we otherwise wouldn't touch
just to shut up checkpatch warnings takes time, risks regressions
and makes it harder to trace the history of the code when we are
trying to track down bugs.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-12-10 Thread Dave
On Tue, Oct 31, 2017 someone wrote:
>
>
> > 2. Put $HOME/.cache on a separate BTRFS subvolume that is mounted
> > nocow -- it will NOT be snapshotted

I did exactly this. It servers the purpose of avoiding snapshots.
However, today I saw the following at
https://wiki.archlinux.org/index.php/Btrfs

Note: From Btrfs Wiki Mount options: within a single file system, it
is not possible to mount some subvolumes with nodatacow and others
with datacow. The mount option of the first mounted subvolume applies
to any other subvolumes.

That makes me think my nodatacow mount option on $HOME/.cache is not
effective. True?

(My subjective performance results have not been as good as hoped for
with the tweaks I have tried so far.)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

2017-12-10 Thread Dave Chinner
On Fri, Dec 08, 2017 at 03:01:31PM -0800, Matthew Wilcox wrote:
> On Thu, Dec 07, 2017 at 11:38:43AM +1100, Dave Chinner wrote:
> > > > cmpxchg is for replacing a known object in a store - it's not really
> > > > intended for doing initial inserts after a lookup tells us there is
> > > > nothing in the store.  The radix tree "insert only if empty" makes
> > > > sense here, because it naturally takes care of lookup/insert races
> > > > via the -EEXIST mechanism.
> > > > 
> > > > I think that providing xa_store_excl() (which would return -EEXIST
> > > > if the entry is not empty) would be a better interface here, because
> > > > it matches the semantics of lookup cache population used all over
> > > > the kernel
> > > 
> > > I'm not thrilled with xa_store_excl(), but I need to think about that
> > > a bit more.
> > 
> > Not fussed about the name - I just think we need a function that
> > matches the insert semantics of the code
> 
> I think I have something that works better for you than returning -EEXIST
> (because you don't actually want -EEXIST, you want -EAGAIN):
> 
> /* insert the new inode */
> -   spin_lock(>pag_ici_lock);
> -   error = radix_tree_insert(>pag_ici_root, agino, ip);
> -   if (unlikely(error)) {
> -   WARN_ON(error != -EEXIST);
> -   XFS_STATS_INC(mp, xs_ig_dup);
> -   error = -EAGAIN;
> -   goto out_preload_end;
> -   }
> -   spin_unlock(>pag_ici_lock);
> -   radix_tree_preload_end();
> +   curr = xa_cmpxchg(>pag_ici_xa, agino, NULL, ip, GFP_NOFS);
> +   error = __xa_race(curr, -EAGAIN);
> +   if (error)
> +   goto out_unlock;
> 
> ...
> 
> -out_preload_end:
> -   spin_unlock(>pag_ici_lock);
> -   radix_tree_preload_end();
> +out_unlock:
> +   if (error == -EAGAIN)
> +   XFS_STATS_INC(mp, xs_ig_dup);
> 
> I've changed the behaviour slightly in that returning an -ENOMEM used to
> hit a WARN_ON, and I don't think that's the right response -- GFP_NOFS
> returning -ENOMEM probably gets you a nice warning already from the
> mm code.

It's been a couple of days since I first looked at this, and my
initial reaction of "that's horrible!" hasn't changed.

What you are proposing here might be a perfectly reasonable
*internal implemention* of xa_store_excl(), but it makes for a
terrible external API because the sematics and behaviour are so
vague. e.g. what does "race" mean here with respect to an insert
failure?

i.e.  the fact the cmpxchg failed may not have anything to do with a
race condtion - it failed because the slot wasn't empty like we
expected it to be. There can be any number of reasons the slot isn't
empty - the API should not "document" that the reason the insert
failed was a race condition. It should document the case that we
"couldn't insert because there was an existing entry in the slot".
Let the surrounding code document the reason why that might have
happened - it's not for the API to assume reasons for failure.

i.e. this API and potential internal implementation makes much
more sense:

int
xa_store_iff_empty(...)
{
curr = xa_cmpxchg(>pag_ici_xa, agino, NULL, ip, GFP_NOFS);
if (!curr)
return 0;   /* success! */
if (!IS_ERR(curr))
return -EEXIST; /* failed - slot not empty */
return PTR_ERR(curr);   /* failed - XA internal issue */
}

as it replaces the existing preload and insert code in the XFS code
paths whilst letting us handle and document the "insert failed
because slot not empty" case however we want. It implies nothing
about *why* the slot wasn't empty, just that we couldn't do the
insert because it wasn't empty.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Lockdep is less useful than it was

2017-12-08 Thread Dave Chinner
On Fri, Dec 08, 2017 at 10:14:38AM -0800, Matthew Wilcox wrote:
> At the moment, the radix tree actively disables the RCU checking that
> enabling lockdep would give us.  It has to, because it has no idea what
> lock protects any individual access to the radix tree.  The XArray can
> use the RCU checking because it knows that every reference is protected
> by either the spinlock or the RCU lock.
> 
> Dave was saying that he has a tree which has to be protected by a mutex
> because of where it is in the locking hierarchy, and I was vigorously
> declining his proposal of allowing him to skip taking the spinlock.

Oh, I wasn't suggesting that you remove the internal tree locking
because we need external locking.

I was trying to point out that the internal locking doesn't remove
the need for external locking,  and that there are cases where
smearing the internal lock outside the XA tree doesn't work, either.
i.e. internal locking doesn't replace all the cases where external
locking is required, and so it's less efficient than the existing
radix tree code.

What I was questioning was the value of replacing the radix tree
code with a less efficient structure just to add lockdep validation
to a tree that doesn't actually need any extra locking validation...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

2017-12-08 Thread Dave Chinner
On Fri, Dec 08, 2017 at 12:35:07PM -0500, Alan Stern wrote:
> On Fri, 8 Dec 2017, Byungchul Park wrote:
> 
> > I'm sorry to hear that.. If I were you, I would also get
> > annoyed. And.. thanks for explanation.
> > 
> > But, I think assigning lock classes properly and checking
> > relationship of the classes to detect deadlocks is reasonable.
> > 
> > In my opinion about the common lockdep stuff, there are 2
> > problems on it.
> > 
> > 1) Firstly, it's hard to assign lock classes *properly*. By
> > default, it relies on the caller site of lockdep_init_map(),
> > but we need to assign another class manually, where ordering
> > rules are complicated so cannot rely on the caller site. That
> > *only* can be done by experts of the subsystem.

Sure, but that's not the issue here. The issue here is the lack of
communication with subsystem experts and that the annotation
complexity warnings given immediately by the subsystem experts were
completely ignored...

> > I think if they want to get benifit from lockdep, they have no
> > choice but to assign classes manually with the domain knowledge,
> > or use *lockdep_set_novalidate_class()* to invalidate locks
> > making the developers annoyed and not want to use the checking
> > for them.
> 
> Lockdep's no_validate class is used when the locking patterns are too
> complicated for lockdep to understand.  Basically, it tells lockdep to
> ignore those locks.

Let me just point out two things here:

1. Using lockdep_set_novalidate_class() for anything other
than device->mutex will throw checkpatch warnings. Nice. (*)

2. lockdep_set_novalidate_class() is completely undocumented
- it's the first I've ever heard of this functionality. i.e.
nobody has ever told us there is a mechanism to turn off
validation of an object; we've *always* been told to "change
your code and/or fix your annotations" when discussing
lockdep deficiencies. (**)

> The device core uses that class.  The tree of struct devices, each with
> its own lock, gets used in many different and complicated ways.  
> Lockdep can't understand this -- it doesn't have the ability to
> represent an arbitrarily deep hierarchical tree of locks -- so we tell
^^

That largely describes the in-memory structure of XFS, except we
have a forest of lock trees, not just one

> it to ignore the device locks.
> 
> It sounds like XFS may need to do the same thing with its semaphores.

Who-ever adds semaphore checking to lockdep can add those
annotations. The externalisation of the development cost of new
lockdep functionality is one of the problems here.

-Dave.

(*) checkpatch.pl is considered mostly harmful round here, too,
but that's another rant

(**) the frequent occurrence of "core code/devs aren't held to the
same rules/standard as everyone else" is another rant I have stored
up for a rainy day.

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

2017-12-07 Thread Dave Chinner
On Fri, Dec 08, 2017 at 01:45:52PM +0900, Byungchul Park wrote:
> On Fri, Dec 08, 2017 at 09:22:16AM +1100, Dave Chinner wrote:
> > On Thu, Dec 07, 2017 at 11:06:34AM -0500, Theodore Ts'o wrote:
> > > On Wed, Dec 06, 2017 at 06:06:48AM -0800, Matthew Wilcox wrote:
> > > > > Unfortunately for you, I don't find arguments along the lines of
> > > > > "lockdep will save us" at all convincing.  lockdep already throws
> > > > > too many false positives to be useful as a tool that reliably and
> > > > > accurately points out rare, exciting, complex, intricate locking
> > > > > problems.
> > > > 
> > > > But it does reliably and accurately point out "dude, you forgot to take
> > > > the lock".  It's caught a number of real problems in my own testing that
> > > > you never got to see.
> > > 
> > > The problem is that if it has too many false positives --- and it's
> > > gotten *way* worse with the completion callback "feature", people will
> > > just stop using Lockdep as being too annyoing and a waste of developer
> > > time when trying to figure what is a legitimate locking bug versus
> > > lockdep getting confused.
> > > 
> > > I can't even disable the new Lockdep feature which is throwing
> > > lots of new false positives --- it's just all or nothing.
> > > 
> > > Dave has just said he's already stopped using Lockdep, as a result.
> > 
> > This is compeltely OT, but FYI I stopped using lockdep a long time
> > ago.  We've spend orders of magnitude more time and effort to shut
> > up lockdep false positives in the XFS code than we ever have on
> > locking problems that lockdep has uncovered. And still lockdep
> > throws too many false positives on XFS workloads to be useful to me.
> > 
> > But it's more than that: I understand just how much lockdep *doesn't
> > check* and that means *I know I can't rely on lockdep* for potential
> > deadlock detection. e.g.  it doesn't cover semaphores, which means
> 
> Hello,
> 
> I'm careful in saying the following since you seem to feel not good at
> crossrelease and even lockdep. Now that cross-release has been
> introduced, semaphores can be covered as you might know. Actually, all
> general waiters can.

And all it will do is create a whole bunch more work for us XFS guys
to shut up all the the false positive crap that falls out from it
because the locking model we have is far more complex than any of
the lockdep developers thought was necessary to support, just like
happened with the XFS inode annotations all those years ago.

e.g. nobody has ever bothered to ask us what is needed to describe
XFS's semaphore locking model.  If you did that, you'd know that we
nest *thousands* of locked semaphores in compeltely random lock
order during metadata buffer writeback. And that this lock order
does not reflect the actual locking order rules we have for locking
buffers during transactions.

Oh, and you'd also know that a semaphore's lock order and context
can change multiple times during the life time of the buffer.  Say
we free a block and the reallocate it as something else before it is
reclaimed - that buffer now might have a different lock order. Or
maybe we promote a buffer to be a root btree block as a result of a
join - it's now the first buffer in a lock run, rather than a child.
Or we split a tree, and the root is now a node and so no longer is
the first buffer in a lock run. Or that we walk sideways along the
leaf nodes siblings during searches.  IOWs, there is no well defined
static lock ordering at all for buffers - and therefore semaphores -
in XFS at all.

And knowing that, you wouldn't simply mention that lockdep can
support semaphores now as though that is necessary to "make it work"
for XFS.  It's going to be much simpler for us to just turn off
lockdep and ignore whatever crap it sends our way than it is to
spend unplanned weeks of our time to try to make lockdep sorta work
again. Sure, we might get there in the end, but it's likely to take
months, if not years like it did with the XFS inode annotations.

> > it has zero coverage of the entire XFS metadata buffer subsystem and
> > the complex locking orders we have for metadata updates.
> > 
> > Put simply: lockdep doesn't provide me with any benefit, so I don't
> > use it...
> 
> Sad..

I don't think you understand. I'll try to explain.

The lockdep infrastructure by itself doesn't make lockdep a useful
tool - it mostly generates false positives because it has no
concept of locking models that don't match it's internal tracking
assumptions and/or limitations.

That means if we can't suppress the false positives, then l

Re: Lockdep is less useful than it was

2017-12-07 Thread Dave Chinner
On Thu, Dec 07, 2017 at 02:38:03PM -0800, Matthew Wilcox wrote:
> On Thu, Dec 07, 2017 at 11:06:34AM -0500, Theodore Ts'o wrote:
> > The problem is that if it has too many false positives --- and it's
> > gotten *way* worse with the completion callback "feature", people will
> > just stop using Lockdep as being too annyoing and a waste of developer
> > time when trying to figure what is a legitimate locking bug versus
> > lockdep getting confused.
> > 
> > I can't even disable the new Lockdep feature which is throwing
> > lots of new false positives --- it's just all or nothing.
> 
> You *can* ... but it's way more hacking Kconfig than you ought to have
> to do (which is a separate rant ...)
> 
> You need to get LOCKDEP_CROSSRELEASE off.  I'd revert patches
> e26f34a407aec9c65bce2bc0c838fabe4f051fc6 and
> b483cf3bc249d7af706390efa63d6671e80d1c09
> 
> I think it was a mistake to force these on for everybody; they have a
> much higher false-positive rate than the rest of lockdep, so as you say
> forcing them on leads to fewer people using *any* of lockdep.
> 
> The bug you're hitting isn't Byungchul's fault; it's an annotation
> problem.  The same kind of annotation problem that we used to have with
> dozens of other places in the kernel which are now fixed.

That's one of the fundamental problem with lockdep - it throws the
difficulty of solving all these new false positives onto the
developers who know nothing about lockdep and don't follow it's
development. And until they do solve them - especially in critical
subsystems that everyone uses like the storage stack - lockdep is
essentially worthless.

> If you didn't
> have to hack Kconfig to get rid of this problem, you'd be happier, right?

I'd be much happier if it wasn't turned on by default in the first
place.  We gave plenty of warnings that there were still unsolved
false positive problems with the new checks in the storage stack.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

2017-12-07 Thread Dave Chinner
On Thu, Dec 07, 2017 at 11:06:34AM -0500, Theodore Ts'o wrote:
> On Wed, Dec 06, 2017 at 06:06:48AM -0800, Matthew Wilcox wrote:
> > > Unfortunately for you, I don't find arguments along the lines of
> > > "lockdep will save us" at all convincing.  lockdep already throws
> > > too many false positives to be useful as a tool that reliably and
> > > accurately points out rare, exciting, complex, intricate locking
> > > problems.
> > 
> > But it does reliably and accurately point out "dude, you forgot to take
> > the lock".  It's caught a number of real problems in my own testing that
> > you never got to see.
> 
> The problem is that if it has too many false positives --- and it's
> gotten *way* worse with the completion callback "feature", people will
> just stop using Lockdep as being too annyoing and a waste of developer
> time when trying to figure what is a legitimate locking bug versus
> lockdep getting confused.
> 
> I can't even disable the new Lockdep feature which is throwing
> lots of new false positives --- it's just all or nothing.
> 
> Dave has just said he's already stopped using Lockdep, as a result.

This is compeltely OT, but FYI I stopped using lockdep a long time
ago.  We've spend orders of magnitude more time and effort to shut
up lockdep false positives in the XFS code than we ever have on
locking problems that lockdep has uncovered. And still lockdep
throws too many false positives on XFS workloads to be useful to me.

But it's more than that: I understand just how much lockdep *doesn't
check* and that means *I know I can't rely on lockdep* for potential
deadlock detection. e.g.  it doesn't cover semaphores, which means
it has zero coverage of the entire XFS metadata buffer subsystem and
the complex locking orders we have for metadata updates.

Put simply: lockdep doesn't provide me with any benefit, so I don't
use it...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

2017-12-06 Thread Dave Chinner
On Wed, Dec 06, 2017 at 06:06:48AM -0800, Matthew Wilcox wrote:
> On Wed, Dec 06, 2017 at 07:44:04PM +1100, Dave Chinner wrote:
> > On Tue, Dec 05, 2017 at 08:45:49PM -0800, Matthew Wilcox wrote:
> > > That said, using xa_cmpxchg() in the dquot code looked like the right
> > > thing to do?  Since we'd dropped the qi mutex and the ILOCK, it looks
> > > entirely reasonable for another thread to come in and set up the dquot.
> > > But I'm obviously quite ignorant of the XFS internals, so maybe there's
> > > something else going on that makes this essentially a "can't happen".
> > 
> > It's no different to the inode cache code, which drops the RCU
> > lock on lookup miss, instantiates the new inode (maybe reading it
> > off disk), then locks the tree and attempts to insert it. Both cases
> > use "insert if empty, otherwise retry lookup from start" semantics.
> 
> Ah.  I had my focus set a little narrow on the inode cache code and didn't
> recognise the pattern.
> 
> Why do you sleep for one jiffy after encountering a miss, then seeing
> someone else insert the inode for you?

The sleep is a backoff that allows whatever we raced with to
complete, be it a hit that raced with an inode being reclaimed and
removed, or a miss that raced with another insert. Ideally we'd
sleep on the XFS_INEW bit, similar to the vfs I_NEW flag, but it's
not quite that simple with the reclaim side of things...

> > cmpxchg is for replacing a known object in a store - it's not really
> > intended for doing initial inserts after a lookup tells us there is
> > nothing in the store.  The radix tree "insert only if empty" makes
> > sense here, because it naturally takes care of lookup/insert races
> > via the -EEXIST mechanism.
> > 
> > I think that providing xa_store_excl() (which would return -EEXIST
> > if the entry is not empty) would be a better interface here, because
> > it matches the semantics of lookup cache population used all over
> > the kernel
> 
> I'm not thrilled with xa_store_excl(), but I need to think about that
> a bit more.

Not fussed about the name - I just think we need a function that
matches the insert semantics of the code

> > > I'm quite happy to have normal API variants that don't save/restore
> > > interrupts.  Just need to come up with good names ... I don't think
> > > xa_store_noirq() is a good name, but maybe you do?
> > 
> > I'd prefer not to have to deal with such things at all. :P
> > 
> > How many subsystems actually require irq safety in the XA locking
> > code? Make them use irqsafe versions, not make everyone else use
> > "noirq" versions, as is the convention for the rest of the kernel
> > code
> 
> Hard to say how many existing radix tree users require the irq safety.

The mapping tree requires it because it gets called from IO
completion contexts to clear page writeback state, but I don't know
about any of the others.

> Also hard to say how many potential users (people currently using
> linked lists, people using resizable arrays, etc) need irq safety.
> My thinking was "make it safe by default and let people who know better
> have a way to opt out", but there's definitely something to be said for
> "make it fast by default and let people who need the unusual behaviour
> type those extra few letters".
> 
> So, you're arguing for providing xa_store(), xa_store_irq(), xa_store_bh()
> and xa_store_irqsafe()?  (at least on demand, as users come to light?)
> At least the read side doesn't require any variants; everybody can use
> RCU for read side protection.

That would follow the pattern of the rest of the kernel APIs, though
I think it might be cleaner to simply state the locking requirement
to xa_init() and keep all those details completely internal rather
than encoding them into API calls. After all, the "irqsafe-ness" of
the locking needs to be consistent across the entire XA instance

> ("safe", not "save" because I wouldn't make the caller provide the
> "flags" argument).
> 
> > > At least, not today.  One of the future plans is to allow xa_nodes to
> > > be allocated from ZONE_MOVABLE.  In order to do that, we have to be
> > > able to tell which lock protects any given node.  With the XArray,
> > > we can find that out (xa_node->root->xa_lock); with the radix tree,
> > > we don't even know what kind of lock protects the tree.
> > 
> > Yup, this is a prime example of why we shouldn't be creating
> > external dependencies by smearing the locking context outside the XA
> > structure itself. It's not a stretch to se

Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

2017-12-06 Thread Dave Chinner
On Tue, Dec 05, 2017 at 08:45:49PM -0800, Matthew Wilcox wrote:
> On Wed, Dec 06, 2017 at 02:14:56PM +1100, Dave Chinner wrote:
> > > The other conversions use the normal API instead of the advanced API, so
> > > all of this gets hidden away.  For example, the inode cache does this:
> > 
> > Ah, OK, that's not obvious from the code changes. :/
> 
> Yeah, it's a lot easier to understand (I think!) if you build the
> docs in that tree and look at
> file:///home/willy/kernel/xarray-3/Documentation/output/core-api/xarray.html
> (mutatis mutandi).  I've tried to tell a nice story about how to put
> all the pieces together from the normal to the advanced API.
> 
> > However, it's probably overkill for XFS. In all the cases, when we
> > insert there should be no entry in the tree - the
> > radix tree insert error handling code there was simply catching
> > "should never happen" cases and handling it without crashing.
> 
> I thought it was probably overkill to be using xa_cmpxchg() in the
> pag_ici patch.  I didn't want to take away your error handling as part
> of the conversion, but I think a rational person implementing it today
> would just call xa_store() and not even worry about the return value
> except to check it for IS_ERR().

*nod*

> That said, using xa_cmpxchg() in the dquot code looked like the right
> thing to do?  Since we'd dropped the qi mutex and the ILOCK, it looks
> entirely reasonable for another thread to come in and set up the dquot.
> But I'm obviously quite ignorant of the XFS internals, so maybe there's
> something else going on that makes this essentially a "can't happen".

It's no different to the inode cache code, which drops the RCU
lock on lookup miss, instantiates the new inode (maybe reading it
off disk), then locks the tree and attempts to insert it. Both cases
use "insert if empty, otherwise retry lookup from start" semantics.

cmpxchg is for replacing a known object in a store - it's not really
intended for doing initial inserts after a lookup tells us there is
nothing in the store.  The radix tree "insert only if empty" makes
sense here, because it naturally takes care of lookup/insert races
via the -EEXIST mechanism.

I think that providing xa_store_excl() (which would return -EEXIST
if the entry is not empty) would be a better interface here, because
it matches the semantics of lookup cache population used all over
the kernel

> > Now that I've looked at this, I have to say that having a return
> > value of NULL meaning "success" is quite counter-intuitive. That's
> > going to fire my "that looks so wrong" detector every time I look at
> > the code and notice it's erroring out on a non-null return value
> > that isn't a PTR_ERR case
> 
> It's the same convention as cmpxchg().  I think it's triggering your
> "looks so wrong" detector because it's fundamentally not the natural
> thing to write.

Most definitely the case, and this is why it's a really bad
interface for the semantics we have. This how we end up with code
that makes it easy for programmers to screw up pointer checks in
error handling... :/

> I'm quite happy to have normal API variants that don't save/restore
> interrupts.  Just need to come up with good names ... I don't think
> xa_store_noirq() is a good name, but maybe you do?

I'd prefer not to have to deal with such things at all. :P

How many subsystems actually require irq safety in the XA locking
code? Make them use irqsafe versions, not make everyone else use
"noirq" versions, as is the convention for the rest of the kernel
code

> > > It's the design pattern I've always intended to use.  Naturally, the
> > > xfs radix trees weren't my initial target; it was the page cache, and
> > > the page cache does the same thing; uses the tree_lock to protect both
> > > the radix tree and several other fields in that same data structure.
> > > 
> > > I'm open to argument on this though ... particularly if you have a better
> > > design pattern in mind!
> > 
> > I don't mind structures having internal locking - I have a problem
> > with leaking them into contexts outside the structure they protect.
> > That way lies madness - you can't change the internal locking in
> > future because of external dependencies, and the moment you need
> > something different externally we've got to go back to an external
> > lock anyway.
> > 
> > This is demonstrated by the way you converted the XFS dquot tree -
> > you didn't replace the dquot tree lock with the internal xa_lock
> > because it's a mutex and we have to sleep holding it. IOWs, we've
> > added another layer of locking here, not simplif

Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

2017-12-05 Thread Dave Chinner
On Tue, Dec 05, 2017 at 06:02:08PM -0800, Matthew Wilcox wrote:
> On Wed, Dec 06, 2017 at 12:36:48PM +1100, Dave Chinner wrote:
> > > - if (radix_tree_preload(GFP_NOFS))
> > > - return -ENOMEM;
> > > -
> > >   INIT_LIST_HEAD(>list_node);
> > >   elem->key = key;
> > >  
> > > - spin_lock(>lock);
> > > - error = radix_tree_insert(>store, key, elem);
> > > - radix_tree_preload_end();
> > > - if (!error)
> > > - _xfs_mru_cache_list_insert(mru, elem);
> > > - spin_unlock(>lock);
> > > + do {
> > > + xas_lock();
> > > + xas_store(, elem);
> > > + error = xas_error();
> > > + if (!error)
> > > + _xfs_mru_cache_list_insert(mru, elem);
> > > + xas_unlock();
> > > + } while (xas_nomem(, GFP_NOFS));
> > 
> > Ok, so why does this have a retry loop on ENOMEM despite the
> > existing code handling that error? And why put such a loop in this
> > code and not any of the other XFS code that used
> > radix_tree_preload() and is arguably much more important to avoid
> > ENOMEM on insert (e.g. the inode cache)?
> 
> If we need more nodes in the tree, xas_store() will try to allocate them
> with GFP_NOWAIT | __GFP_NOWARN.  If that fails, it signals it in xas_error().
> xas_nomem() will notice that we're in an ENOMEM situation, and allocate
> a node using your preferred GFP flags (NOIO in your case).  Then we retry,
> guaranteeing forward progress. [1]
> 
> The other conversions use the normal API instead of the advanced API, so
> all of this gets hidden away.  For example, the inode cache does this:
> 
> +   curr = xa_cmpxchg(>pag_ici_xa, agino, NULL, ip, GFP_NOFS);
> 
> and xa_cmpxchg internally does:
> 
> do {
> xa_lock_irqsave(xa, flags);
> curr = xas_create();
> if (curr == old)
> xas_store(, entry);
> xa_unlock_irqrestore(xa, flags);
> } while (xas_nomem(, gfp));

Ah, OK, that's not obvious from the code changes. :/

However, it's probably overkill for XFS. In all the cases, when we
insert there should be no entry in the tree - the
radix tree insert error handling code there was simply catching
"should never happen" cases and handling it without crashing.

Now that I've looked at this, I have to say that having a return
value of NULL meaning "success" is quite counter-intuitive. That's
going to fire my "that looks so wrong" detector every time I look at
the code and notice it's erroring out on a non-null return value
that isn't a PTR_ERR case

Also, there's no need for irqsave/restore() locking contexts here as
we never access these caches from interrupt contexts. That's just
going to be extra overhead, especially on workloads that run 10^6
inodes inodes through the cache every second. That's a problem
caused by driving the locks into the XA structure and then needing
to support callers that require irq safety

> > Also, I really don't like the pattern of using xa_lock()/xa_unlock()
> > to protect access to an external structure. i.e. the mru->lock
> > context is protecting multiple fields and operations in the MRU
> > structure, not just the radix tree operations. Turning that around
> > so that a larger XFS structure and algorithm is now protected by an
> > opaque internal lock from generic storage structure the forms part
> > of the larger structure seems like a bad design pattern to me...
> 
> It's the design pattern I've always intended to use.  Naturally, the
> xfs radix trees weren't my initial target; it was the page cache, and
> the page cache does the same thing; uses the tree_lock to protect both
> the radix tree and several other fields in that same data structure.
> 
> I'm open to argument on this though ... particularly if you have a better
> design pattern in mind!

I don't mind structures having internal locking - I have a problem
with leaking them into contexts outside the structure they protect.
That way lies madness - you can't change the internal locking in
future because of external dependencies, and the moment you need
something different externally we've got to go back to an external
lock anyway.

This is demonstrated by the way you converted the XFS dquot tree -
you didn't replace the dquot tree lock with the internal xa_lock
because it's a mutex and we have to sleep holding it. IOWs, we've
added another layer of locking here, not simplified the code.

What I really see here is that  we have inconsistent locking
patterns w.r.t. XA stores inside XFS - some have an external mutex
to cover a wider scope, some use xa_

Re: [PATCH v4 00/73] XArray version 4

2017-12-05 Thread Dave Chinner
On Tue, Dec 05, 2017 at 06:05:15PM -0800, Matthew Wilcox wrote:
> On Wed, Dec 06, 2017 at 12:45:49PM +1100, Dave Chinner wrote:
> > On Tue, Dec 05, 2017 at 04:40:46PM -0800, Matthew Wilcox wrote:
> > > From: Matthew Wilcox <mawil...@microsoft.com>
> > > 
> > > I looked through some notes and decided this was version 4 of the XArray.
> > > Last posted two weeks ago, this version includes a *lot* of changes.
> > > I'd like to thank Dave Chinner for his feedback, encouragement and
> > > distracting ideas for improvement, which I'll get to once this is merged.
> > 
> > BTW, you need to fix the "To:" line on your patchbombs:
> > 
> > > To: unlisted-recipients: ;, no To-header on input 
> > > <@gmail-pop.l.google.com> 
> > 
> > This bad email address getting quoted to the cc line makes some MTAs
> > very unhappy.
> 
> I know :-(  I was unhappy when I realised what I'd done.
> 
> https://marc.info/?l=git=151252237912266=2
> 
> > I'll give this a quick burn this afternoon and see what catches fire...
> 
> All of the things ... 0day gave me a 90% chance of hanging in one
> configuration.  Need to drill down on it more and find out what stupid
> thing I've done wrong this time.

Yup, Bad Stuff happened on boot:

[   24.548039] INFO: rcu_preempt detected stalls on CPUs/tasks:
[   24.548978]  1-...!: (0 ticks this GP) idle=688/0/0 softirq=143/143 fqs=0
[   24.549926]  5-...!: (0 ticks this GP) idle=db8/0/0 softirq=120/120 fqs=0
[   24.550864]  6-...!: (0 ticks this GP) idle=d58/0/0 softirq=111/111 fqs=0
[   24.551802]  8-...!: (5 GPs behind) idle=514/0/0 softirq=189/189 fqs=0
[   24.552722]  10-...!: (84 GPs behind) idle=ac0/0/0 softirq=80/80 fqs=0
[   24.553617]  11-...!: (8 GPs behind) idle=cfc/0/0 softirq=95/95 fqs=0
[   24.554496]  13-...!: (8 GPs behind) idle=b0c/0/0 softirq=82/82 fqs=0
[   24.555382]  14-...!: (38 GPs behind) idle=a7c/0/0 softirq=93/93 fqs=0
[   24.556305]  15-...!: (4 GPs behind) idle=b18/0/0 softirq=88/88 fqs=0
[   24.557190]  (detected by 9, t=5252 jiffies, g=-178, c=-179, q=994)
[   24.558051] Sending NMI from CPU 9 to CPUs 1:
[   24.558703] NMI backtrace for cpu 1 skipped: idling at 
native_safe_halt+0x2/0x10
[   24.559654] Sending NMI from CPU 9 to CPUs 5:
[   24.559675] NMI backtrace for cpu 5 skipped: idling at 
native_safe_halt+0x2/0x10
[   24.560654] Sending NMI from CPU 9 to CPUs 6:
[   24.560689] NMI backtrace for cpu 6 skipped: idling at 
native_safe_halt+0x2/0x10
[   24.561655] Sending NMI from CPU 9 to CPUs 8:
[   24.561701] NMI backtrace for cpu 8 skipped: idling at 
native_safe_halt+0x2/0x10
[   24.562654] Sending NMI from CPU 9 to CPUs 10:
[   24.562675] NMI backtrace for cpu 10 skipped: idling at 
native_safe_halt+0x2/0x10
[   24.563653] Sending NMI from CPU 9 to CPUs 11:
[   24.563669] NMI backtrace for cpu 11 skipped: idling at 
native_safe_halt+0x2/0x10
[   24.564653] Sending NMI from CPU 9 to CPUs 13:
[   24.564670] NMI backtrace for cpu 13 skipped: idling at 
native_safe_halt+0x2/0x10
[   24.565652] Sending NMI from CPU 9 to CPUs 14:
[   24.565674] NMI backtrace for cpu 14 skipped: idling at 
native_safe_halt+0x2/0x10
[   24.566652] Sending NMI from CPU 9 to CPUs 15:
[   24.59] NMI backtrace for cpu 15 skipped: idling at 
native_safe_halt+0x2/0x10
[   24.567653] rcu_preempt kthread starved for 5256 jiffies! 
g18446744073709551438 c18446744073709551437 f0x0 RCU_GP_WAIT_FQS(3) 
->state=0x402 ->7
[   24.567654] rcu_preempt I15128 9  2 0x8000
[   24.567660] Call Trace:
[   24.567679]  ? __schedule+0x289/0x880
[   24.567681]  schedule+0x2f/0x90
[   24.567682]  schedule_timeout+0x152/0x370
[   24.567686]  ? __next_timer_interrupt+0xc0/0xc0
[   24.567689]  rcu_gp_kthread+0x561/0x880
[   24.567691]  ? force_qs_rnp+0x1a0/0x1a0
[   24.567693]  kthread+0x111/0x130
[   24.567695]  ? __kthread_create_worker+0x120/0x120
[   24.567697]  ret_from_fork+0x24/0x30
[   44.064092] watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kswapd0:854]
[   44.065920] CPU: 0 PID: 854 Comm: kswapd0 Not tainted 4.15.0-rc2-dgc #228
[   44.067769] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1 04/01/2014
[   44.070030] RIP: 0010:smp_call_function_single+0xce/0x100
[   44.071521] RSP: :c90001d2fb20 EFLAGS: 0202 ORIG_RAX: 
ff11
[   44.073592] RAX:  RBX: 88013ab515c8 RCX: c9000350bb20
[   44.075560] RDX: 0001 RSI: c90001d2fb20 RDI: c90001d2fb20
[   44.077531] RBP: c90001d2fb50 R08: 0007 R09: 0080
[   44.079483] R10: c90001d2fb78 R11: c90001d2fb30 R12: c90001d2fc10
[   44.081465] R13: ea000449fc78 R14: ea000449fc58 R15: 88013ba36c40
[   44.083434] FS:  () GS:88013fc0() 
knlGS:
[   44.085683] CS:  0010 DS:  ES:  CR0

Re: [PATCH v4 00/73] XArray version 4

2017-12-05 Thread Dave Chinner
On Wed, Dec 06, 2017 at 01:53:41AM +, Matthew Wilcox wrote:
> Huh, you've caught a couple of problems that 0day hasn't sent me yet.  Try 
> turning on DAX or TRANSPARENT_HUGEPAGE.  Thanks!

Dax is turned on, CONFIG_TRANSPARENT_HUGEPAGE is not.

Looks like nothing is setting CONFIG_RADIX_TREE_MULTIORDER, which is
what xas_set_order() is hidden under.

Ah, CONFIG_ZONE_DEVICE turns it on, not CONFIG_DAX/CONFIG_FS_DAX.

H.  That seems wrong if it's used in fs/dax.c...

$ grep DAX .config
CONFIG_DAX=y
CONFIG_FS_DAX=y
$ grep ZONE_DEVICE .config
CONFIG_ARCH_HAS_ZONE_DEVICE=y
$

So I have DAX enabled, but not ZONE_DEVICE? Shouldn't DAX be
selecting ZONE_DEVICE, not relying on a user to select both of them
so that stuff works properly? Hmmm - there's no menu option to turn
on zone device, so it's selected by something else?  Oh, HMM turns
on ZONE device. But that is "default y", so should be turned on. But
it's not?  And there's no obvious HMM menu config option, either

What a godawful mess Kconfig has turned into.

I'm just going to enable TRANSPARENT_HUGEPAGE - madness awaits me if
I follow the other path down the rat hole

Ok, it build this time.

-Dave.

> 
> > -Original Message-
> > From: Dave Chinner [mailto:da...@fromorbit.com]
> > Sent: Tuesday, December 5, 2017 8:51 PM
> > To: Matthew Wilcox <wi...@infradead.org>
> > Cc: Matthew Wilcox <mawil...@microsoft.com>; Ross Zwisler
> > <ross.zwis...@linux.intel.com>; Jens Axboe <ax...@kernel.dk>; Rehas
> > Sachdeva <aquan...@gmail.com>; linux...@kvack.org; linux-
> > fsde...@vger.kernel.org; linux-f2fs-de...@lists.sourceforge.net; linux-
> > ni...@vger.kernel.org; linux-btrfs@vger.kernel.org; 
> > linux-...@vger.kernel.org;
> > linux-...@vger.kernel.org; linux-ker...@vger.kernel.org
> > Subject: Re: [PATCH v4 00/73] XArray version 4
> > 
> > On Wed, Dec 06, 2017 at 12:45:49PM +1100, Dave Chinner wrote:
> > > On Tue, Dec 05, 2017 at 04:40:46PM -0800, Matthew Wilcox wrote:
> > > > From: Matthew Wilcox <mawil...@microsoft.com>
> > > >
> > > > I looked through some notes and decided this was version 4 of the 
> > > > XArray.
> > > > Last posted two weeks ago, this version includes a *lot* of changes.
> > > > I'd like to thank Dave Chinner for his feedback, encouragement and
> > > > distracting ideas for improvement, which I'll get to once this is 
> > > > merged.
> > >
> > > BTW, you need to fix the "To:" line on your patchbombs:
> > >
> > > > To: unlisted-recipients: ;, no To-header on input <@gmail-
> > pop.l.google.com>
> > >
> > > This bad email address getting quoted to the cc line makes some MTAs
> > > very unhappy.
> > >
> > > >
> > > > Highlights:
> > > >  - Over 2000 words of documentation in patch 8!  And lots more 
> > > > kernel-doc.
> > > >  - The page cache is now fully converted to the XArray.
> > > >  - Many more tests in the test-suite.
> > > >
> > > > This patch set is not for applying.  0day is still reporting problems,
> > > > and I'd feel bad for eating someone's data.  These patches apply on top
> > > > of a set of prepatory patches which just aren't interesting.  If you
> > > > want to see the patches applied to a tree, I suggest pulling my git 
> > > > tree:
> > > >
> > https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgit.infrade
> > ad.org%2Fusers%2Fwilly%2Flinux-
> > dax.git%2Fshortlog%2Frefs%2Fheads%2Fxarray-2017-12-
> > 04=02%7C01%7Cmawilcox%40microsoft.com%7Ca3e721545f8b4b9dff1
> > 608d53c4bd42f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6364
> > 81218740341312=IXNZXXLTf964OQ0eLDpJt2LCv%2BGGWFW%2FQd4Kc
> > KYu6zo%3D=0
> > > > I also left out the idr_preload removals.  They're still in the git 
> > > > tree,
> > > > but I'm not looking for feedback on them.
> > >
> > > I'll give this a quick burn this afternoon and see what catches fire...
> > 
> > Build warnings/errors:
> > 
> > .
> > lib/radix-tree.c:700:13: warning: ¿radix_tree_free_nodes¿ defined but not 
> > used
> > [-Wunused-function]
> >  static void radix_tree_free_nodes(struct radix_tree_node *node)
> > .
> > lib/xarray.c: In function ¿xas_max¿:
> > lib/xarray.c:291:16: warning: unused variable ¿mask¿
> > [-Wunused-variable]
> >   unsigned long mask, max = xas->xa_index;
> >   ^~~~
> > ..
> > fs/dax.c: In function ¿grab_mapping_entry¿:
> > fs/dax.c:305:2: error: implicit declaration of function ¿xas_set_order¿; 
> > did you
> > mean ¿xas_set_err¿?  [-Werror=implicit-function-declaration]
> >   xas_set_order(, index, size_flag ? PMD_ORDER : 0);
> > ^
> > scripts/Makefile.build:310: recipe for target 'fs/dax.o' failed
> > make[1]: *** [fs/dax.o] Error 1
> > 
> > -Dave.
> > --
> > Dave Chinner
> > da...@fromorbit.com

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 00/73] XArray version 4

2017-12-05 Thread Dave Chinner
On Wed, Dec 06, 2017 at 12:45:49PM +1100, Dave Chinner wrote:
> On Tue, Dec 05, 2017 at 04:40:46PM -0800, Matthew Wilcox wrote:
> > From: Matthew Wilcox <mawil...@microsoft.com>
> > 
> > I looked through some notes and decided this was version 4 of the XArray.
> > Last posted two weeks ago, this version includes a *lot* of changes.
> > I'd like to thank Dave Chinner for his feedback, encouragement and
> > distracting ideas for improvement, which I'll get to once this is merged.
> 
> BTW, you need to fix the "To:" line on your patchbombs:
> 
> > To: unlisted-recipients: ;, no To-header on input <@gmail-pop.l.google.com> 
> 
> This bad email address getting quoted to the cc line makes some MTAs
> very unhappy.
> 
> > 
> > Highlights:
> >  - Over 2000 words of documentation in patch 8!  And lots more kernel-doc.
> >  - The page cache is now fully converted to the XArray.
> >  - Many more tests in the test-suite.
> > 
> > This patch set is not for applying.  0day is still reporting problems,
> > and I'd feel bad for eating someone's data.  These patches apply on top
> > of a set of prepatory patches which just aren't interesting.  If you
> > want to see the patches applied to a tree, I suggest pulling my git tree:
> > http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray-2017-12-04
> > I also left out the idr_preload removals.  They're still in the git tree,
> > but I'm not looking for feedback on them.
> 
> I'll give this a quick burn this afternoon and see what catches fire...

Build warnings/errors:

.
lib/radix-tree.c:700:13: warning: ¿radix_tree_free_nodes¿ defined but not used 
[-Wunused-function]
 static void radix_tree_free_nodes(struct radix_tree_node *node)
.
lib/xarray.c: In function ¿xas_max¿:
lib/xarray.c:291:16: warning: unused variable ¿mask¿
[-Wunused-variable]
  unsigned long mask, max = xas->xa_index;
  ^~~~
..
fs/dax.c: In function ¿grab_mapping_entry¿:
fs/dax.c:305:2: error: implicit declaration of function ¿xas_set_order¿; did 
you mean ¿xas_set_err¿?  [-Werror=implicit-function-declaration]
  xas_set_order(, index, size_flag ? PMD_ORDER : 0);
^
scripts/Makefile.build:310: recipe for target 'fs/dax.o' failed
make[1]: *** [fs/dax.o] Error 1

-Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 00/73] XArray version 4

2017-12-05 Thread Dave Chinner
On Tue, Dec 05, 2017 at 04:40:46PM -0800, Matthew Wilcox wrote:
> From: Matthew Wilcox <mawil...@microsoft.com>
> 
> I looked through some notes and decided this was version 4 of the XArray.
> Last posted two weeks ago, this version includes a *lot* of changes.
> I'd like to thank Dave Chinner for his feedback, encouragement and
> distracting ideas for improvement, which I'll get to once this is merged.

BTW, you need to fix the "To:" line on your patchbombs:

> To: unlisted-recipients: ;, no To-header on input <@gmail-pop.l.google.com> 

This bad email address getting quoted to the cc line makes some MTAs
very unhappy.

> 
> Highlights:
>  - Over 2000 words of documentation in patch 8!  And lots more kernel-doc.
>  - The page cache is now fully converted to the XArray.
>  - Many more tests in the test-suite.
> 
> This patch set is not for applying.  0day is still reporting problems,
> and I'd feel bad for eating someone's data.  These patches apply on top
> of a set of prepatory patches which just aren't interesting.  If you
> want to see the patches applied to a tree, I suggest pulling my git tree:
> http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray-2017-12-04
> I also left out the idr_preload removals.  They're still in the git tree,
> but I'm not looking for feedback on them.

I'll give this a quick burn this afternoon and see what catches fire...

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

2017-12-05 Thread Dave Chinner
On Tue, Dec 05, 2017 at 04:41:58PM -0800, Matthew Wilcox wrote:
> From: Matthew Wilcox <mawil...@microsoft.com>
> 
> This eliminates a call to radix_tree_preload().

.

>  void
> @@ -431,24 +424,24 @@ xfs_mru_cache_insert(
>   unsigned long   key,
>   struct xfs_mru_cache_elem *elem)
>  {
> + XA_STATE(xas, >store, key);
>   int error;
>  
>   ASSERT(mru && mru->lists);
>   if (!mru || !mru->lists)
>   return -EINVAL;
>  
> - if (radix_tree_preload(GFP_NOFS))
> - return -ENOMEM;
> -
>   INIT_LIST_HEAD(>list_node);
>   elem->key = key;
>  
> - spin_lock(>lock);
> - error = radix_tree_insert(>store, key, elem);
> - radix_tree_preload_end();
> - if (!error)
> - _xfs_mru_cache_list_insert(mru, elem);
> - spin_unlock(>lock);
> + do {
> + xas_lock();
> + xas_store(, elem);
> + error = xas_error();
> + if (!error)
> + _xfs_mru_cache_list_insert(mru, elem);
> + xas_unlock();
> + } while (xas_nomem(, GFP_NOFS));

Ok, so why does this have a retry loop on ENOMEM despite the
existing code handling that error? And why put such a loop in this
code and not any of the other XFS code that used
radix_tree_preload() and is arguably much more important to avoid
ENOMEM on insert (e.g. the inode cache)?

Also, I really don't like the pattern of using xa_lock()/xa_unlock()
to protect access to an external structure. i.e. the mru->lock
context is protecting multiple fields and operations in the MRU
structure, not just the radix tree operations. Turning that around
so that a larger XFS structure and algorithm is now protected by an
opaque internal lock from generic storage structure the forms part
of the larger structure seems like a bad design pattern to me...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] iomap: report collisions between directio and buffered writes to userspace

2017-11-21 Thread Dave Chinner
On Tue, Nov 21, 2017 at 04:52:53AM -0800, Matthew Wilcox wrote:
> On Tue, Nov 21, 2017 at 05:48:15PM +1100, Dave Chinner wrote:
> > On Mon, Nov 20, 2017 at 08:32:40PM -0800, Matthew Wilcox wrote:
> > > On Mon, Nov 20, 2017 at 05:37:53PM -0800, Darrick J. Wong wrote:
> > > > On Tue, Nov 21, 2017 at 09:27:49AM +1100, Dave Chinner wrote:
> > > > > First thing I noticed was that "xa" as a prefix is already quite
> > > > > widely used in XFS - it's shorthand for "XFS AIL". Indeed, xa_lock
> > >
> > > The X stands for 'eXpandable' or 'eXtending'.  I really don't want to
> > > use more than a two-letter acronym for whatever the XArray ends up being
> > > called.  One of the problems with the radix tree is that everything has
> > > that 11-character 'radix_tree_' prefix; just replacing that with 'xa_'
> > > makes a huge improvement to readability.
> > 
> > Yeah, understood. That's why
> > we have very little clear
> > prefix namespace left :/
> > 
> > [ somedays I write something that looks sorta like a haiku, and from
> > that point on everything just starts falling out of my brain that
> > way. I blame Eric for this today! :P ]
> 
> When the namespace is
> tight we must consider the
> competing users.
> 
> The earliest us'r
> has a claim to a prefix
> we are used to it.
> 
> Also a more wide-
> spread user has a claim to
> a shorter prefix.
> 
> Would you mind changing
> your prefix to one only
> one letter longer?

We can do something
like that, though Darrick has the
final say in it.

> ... ok, I give up ;-)

Top marks for effort :)

> All your current usage of the xa_ prefix looks somewhat like this:
> 
> fs/xfs/xfs_trans_ail.c: spin_lock(>xa_lock);
> 
> with honourable mentions to:
> fs/xfs/xfs_log.c: spin_lock(>m_ail->xa_lock);
> 
> Would you mind if I bolt a patch on to the front of the series called
> something like "free up xa_ namespace" that renamed your xa_* to ail_*?
> There are no uses of the 'ail_' prefix in the kernel today.
> 
> I don't think that
>   spin_lock(>ail_lock);
> loses any readability.

Not sure that's going to work - there's an "xa_ail" member for the
AIL list head. That would now read in places:

if (list_empty(>ail_ail))

I'd be inclined to just drop the "xa_" prefix from the XFS
structure.  There is no information loss by removing the prefix in
the XFS code because the pointer name tells us what structure it is
pointing at.

> 
> By the way, what does AIL stand for?  It'd be nice if it were spelled out
> in at least one of the header files, maybe fs/xfs/xfs_trans_priv.h?

Active Item List. See the section "Tracking Changes" in
Documentation/filesystems/xfs-delayed-logging-design.txt for the
full rundown, but in short:

"The log item is already used to track the log items that have been
written to the log but not yet written to disk. Such log items are
considered "active" and as such are stored in the Active Item List
(AIL) which is a LSN-ordered double linked list. Items are inserted
into this list during log buffer IO completion, after which they are
unpinned and can be written to disk."

The lack of comments describing the AIL is historic - it's never
been documented in the code, nor has the whole relogging concept it
implements. I wrote the document above when introducing the CIL
(Commited Item List) because almost no-one actively working on XFS
understood how the whole journalling subsystem worked in any detail.

> > Zoetrope Array.
> > Labyrinth of illusion.
> > Structure never ends.
> 
> Thank you for making me look up zoetrope ;-)

My pleasure :)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] iomap: report collisions between directio and buffered writes to userspace

2017-11-20 Thread Dave Chinner
On Mon, Nov 20, 2017 at 08:32:40PM -0800, Matthew Wilcox wrote:
> On Mon, Nov 20, 2017 at 05:37:53PM -0800, Darrick J. Wong wrote:
> > On Tue, Nov 21, 2017 at 09:27:49AM +1100, Dave Chinner wrote:
> > > First thing I noticed was that "xa" as a prefix is already quite
> > > widely used in XFS - it's shorthand for "XFS AIL". Indeed, xa_lock
> > > already exists and is quite widely used, so having a generic
> > > interface using the same prefixes and lock names is going to be
> > > quite confusing in the XFS code. Especially considering there's
> > > fair bit of radix tree use in XFS (e.g. the internal inode and
> > > dquot caches).
> > > 
> > > FYI, from fs/xfs/xfs_trans_priv.h:
> > > 
> > > /*
> > >  * Private AIL structures.
> > >  *
> > >  * Eventually we need to drive the locking in here as well.
> > >  */
> > > struct xfs_ail {
> > > struct xfs_mount*xa_mount;
> > > struct task_struct  *xa_task;
> > > struct list_headxa_ail;
> > > xfs_lsn_t   xa_target;
> > > xfs_lsn_t   xa_target_prev;
> > > struct list_headxa_cursors;
> > > spinlock_t  xa_lock;
> > > xfs_lsn_t   xa_last_pushed_lsn;
> > > int xa_log_flush;
> > > struct list_headxa_buf_list;
> > > wait_queue_head_t   xa_empty;
> > > };
> > > 
> > > FWIW, why is it named "XArray"?  "X" stands for what?  It still
> > > looks like a tree structure to me, but without a design doc I'm a
> > > bit lost to how it differs to the radix tree (apart from the API)
> > > and why it's considered an "array".
> > 
> > /me nominates 'xarr' for the prefix because pirates. :P
> 
> The X stands for 'eXpandable' or 'eXtending'.  I really don't want to
> use more than a two-letter acronym for whatever the XArray ends up being
> called.  One of the problems with the radix tree is that everything has
> that 11-character 'radix_tree_' prefix; just replacing that with 'xa_'
> makes a huge improvement to readability.

Yeah, understood. That's why
we have very little clear
prefix namespace left :/

[ somedays I write something that looks sorta like a haiku, and from
that point on everything just starts falling out of my brain that
way. I blame Eric for this today! :P ]

> I'm open to renaming it, but I want a good name.  I think it needs to
> be *something* array, so we're looking for a prefix used nowhere in the
> kernel.  Running 'git grep' in turn for '\<[a-z]a_' really only allows
> for ja, ya and za.  'ja_' and 'ya_' only have one user, while 'za_'
> has none.  So ... anyone got a good retcon for JArray, YArray or ZArray?

A Yellow Array.
Colour is irrelevant.
The bikeshed looks nice.

> It's considered an array because it behaves "as if" it's an infinite
> array.

Infinite Array.
Namespace prefix collision
this haiku can't solve.

> The fact that we chop it up into chunks of 64 entries, and then
> arrange those chunks into a tree is not something the average user needs
> to concern themselves with.

Jumbo Array. Many
pieces hidden behind walls.
Will anyone notice?

> We have a lot of places in the kernel that
> fuss around with "allocate N entries, whoops, we exceeded N, kmalloc some
> more, oh no kmalloc failed, vmalloc it instead".  So the idea is to give
> these places an interface that acts like an automatically-resizing array.

Zoetrope Array.
Labyrinth of illusion.
Structure never ends.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] iomap: report collisions between directio and buffered writes to userspace

2017-11-20 Thread Dave Chinner
On Mon, Nov 20, 2017 at 01:51:00PM -0800, Matthew Wilcox wrote:
> On Tue, Nov 21, 2017 at 07:26:06AM +1100, Dave Chinner wrote:
> > On Mon, Nov 20, 2017 at 08:18:29AM -0800, Matthew Wilcox wrote:
> > > On Fri, Nov 17, 2017 at 11:39:25AM -0800, Darrick J. Wong wrote:
> > > > If two programs simultaneously try to write to the same part of a file
> > > > via direct IO and buffered IO, there's a chance that the post-diowrite
> > > > pagecache invalidation will fail on the dirty page.  When this happens,
> > > > the dio write succeeded, which means that the page cache is no longer
> > > > coherent with the disk!
> > > 
> > > This seems like a good opportunity to talk about what I've been working
> > > on for solving this problem.  The XArray is going to introduce a set
> > > of entries which can be stored to locations in the page cache that I'm
> > > calling 'wait entries'.
> > 
> > What's this XArray thing you speak of?
> 
> Ah, right, you were on sabbatical at LSFMM this year where I talked
> about it.  Briefly, it's a new API for the radix tree.  The data structure
> is essentially unchanged (minor enhancements), but I'm rationalising
> existing functionality and adding new abilities.  And getting rid of
> misfeatures like the preload API and implicit GFP flags.
> 
> My current working tree is here:
> 
> http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray-2017-11-20

First thing I noticed was that "xa" as a prefix is already quite
widely used in XFS - it's shorthand for "XFS AIL". Indeed, xa_lock
already exists and is quite widely used, so having a generic
interface using the same prefixes and lock names is going to be
quite confusing in the XFS code. Especially considering there's
fair bit of radix tree use in XFS (e.g. the internal inode and
dquot caches).

FYI, from fs/xfs/xfs_trans_priv.h:

/*
 * Private AIL structures.
 *
 * Eventually we need to drive the locking in here as well.
 */
struct xfs_ail {
struct xfs_mount*xa_mount;
struct task_struct  *xa_task;
struct list_headxa_ail;
xfs_lsn_t   xa_target;
xfs_lsn_t   xa_target_prev;
struct list_headxa_cursors;
spinlock_t  xa_lock;
xfs_lsn_t   xa_last_pushed_lsn;
int xa_log_flush;
struct list_headxa_buf_list;
wait_queue_head_t   xa_empty;
};


> Ignoring the prep patches, the excitement is all to be found with the
> commits which start 'xarray:'

FWIW, why is it named "XArray"?  "X" stands for what?  It still
looks like a tree structure to me, but without a design doc I'm a
bit lost to how it differs to the radix tree (apart from the API)
and why it's considered an "array".

> If you want an example of it in use, I'm pretty happy with this patch
> that switches the brd driver entirely from the radix tree API to the
> xarray API:
> 
> http://git.infradead.org/users/willy/linux-dax.git/commitdiff/dbf96ae943e43563cbbaa26e21b656b6fe8f4b0f

Looks pretty neat, but I'll reserve judgement for when I see the
conversion of the XFS radix tree code

> I've been pretty liberal with the kernel-doc, but I haven't written out
> a good .rst file to give an overview of how to use it.

Let me know when you've written it :)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] iomap: report collisions between directio and buffered writes to userspace

2017-11-20 Thread Dave Chinner
On Mon, Nov 20, 2017 at 08:18:29AM -0800, Matthew Wilcox wrote:
> On Fri, Nov 17, 2017 at 11:39:25AM -0800, Darrick J. Wong wrote:
> > If two programs simultaneously try to write to the same part of a file
> > via direct IO and buffered IO, there's a chance that the post-diowrite
> > pagecache invalidation will fail on the dirty page.  When this happens,
> > the dio write succeeded, which means that the page cache is no longer
> > coherent with the disk!
> 
> This seems like a good opportunity to talk about what I've been working
> on for solving this problem.  The XArray is going to introduce a set
> of entries which can be stored to locations in the page cache that I'm
> calling 'wait entries'.

What's this XArray thing you speak of?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)

2017-11-14 Thread Dave
On Tue, Nov 14, 2017 at 3:50 AM, Roman Mamedov <r...@romanrm.net> wrote:
>
> On Mon, 13 Nov 2017 22:39:44 -0500
> Dave <davestechs...@gmail.com> wrote:
>
> > I have my live system on one block device and a backup snapshot of it
> > on another block device. I am keeping them in sync with hourly rsync
> > transfers.
> >
> > Here's how this system works in a little more detail:
> >
> > 1. I establish the baseline by sending a full snapshot to the backup
> > block device using btrfs send-receive.
> > 2. Next, on the backup device I immediately create a rw copy of that
> > baseline snapshot.
> > 3. I delete the source snapshot to keep the live filesystem free of
> > all snapshots (so it can be optimally defragmented, etc.)
> > 4. hourly, I take a snapshot of the live system, rsync all changes to
> > the backup block device, and then delete the source snapshot. This
> > hourly process takes less than a minute currently. (My test system has
> > only moderate usage.)
> > 5. hourly, following the above step, I use snapper to take a snapshot
> > of the backup subvolume to create/preserve a history of changes. For
> > example, I can find the version of a file 30 hours prior.
>
> Sounds a bit complex, I still don't get why you need all these snapshot
> creations and deletions, and even still using btrfs send-receive.


Hopefully, my comments below will explain my reasons.

>
> Here is my scheme:
> 
> /mnt/dst <- mounted backup storage volume
> /mnt/dst/backup  <- a subvolume
> /mnt/dst/backup/host1/ <- rsync destination for host1, regular directory
> /mnt/dst/backup/host2/ <- rsync destination for host2, regular directory
> /mnt/dst/backup/host3/ <- rsync destination for host3, regular directory
> etc.
>
> /mnt/dst/backup/host1/bin/
> /mnt/dst/backup/host1/etc/
> /mnt/dst/backup/host1/home/
> ...
> Self explanatory. All regular directories, not subvolumes.
>
> Snapshots:
> /mnt/dst/snaps/backup <- a regular directory
> /mnt/dst/snaps/backup/2017-11-14T12:00/ <- snapshot 1 of /mnt/dst/backup
> /mnt/dst/snaps/backup/2017-11-14T13:00/ <- snapshot 2 of /mnt/dst/backup
> /mnt/dst/snaps/backup/2017-11-14T14:00/ <- snapshot 3 of /mnt/dst/backup
>
> Accessing historic data:
> /mnt/dst/snaps/backup/2017-11-14T12:00/host1/bin/bash
> ...
> /bin/bash for host1 as of 2017-11-14 12:00 (time on the backup system).
> 
>
> No need for btrfs send-receive, only plain rsync is used, directly from
> hostX:/ to /mnt/dst/backup/host1/;


I prefer to start with a BTRFS snapshot at the backup destination. I
think that's the most "accurate" starting point.

>
> No need to create or delete snapshots during the actual backup process;


Then you can't guarantee consistency of the backed up information.

>
> A single common timeline is kept for all hosts to be backed up, snapshot count
> not multiplied by the number of hosts (in my case the backup location is
> multi-purpose, so I somewhat care about total number of snapshots there as
> well);
>
> Also, all of this works even with source hosts which do not use Btrfs.


That's not a concern for me because I prefer to use BTRFS everywhere.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)

2017-11-13 Thread Dave
On Wed, Nov 1, 2017 at 1:15 AM, Roman Mamedov <r...@romanrm.net> wrote:
> On Wed, 1 Nov 2017 01:00:08 -0400
> Dave <davestechs...@gmail.com> wrote:
>
>> To reconcile those conflicting goals, the only idea I have come up
>> with so far is to use btrfs send-receive to perform incremental
>> backups as described here:
>> https://btrfs.wiki.kernel.org/index.php/Incremental_Backup .
>
> Another option is to just use the regular rsync to a designated destination
> subvolume on the backup host, AND snapshot that subvolume on that host from
> time to time (or on backup completions, if you can synchronize that).
>
> rsync --inplace will keep space usage low as it will not reupload entire files
> in case of changes/additions to them.
>
> Yes rsync has to traverse both directory trees to find changes, but that's
> pretty fast (couple of minutes at most, for a typical root filesystem),
> especially if you use SSD or SSD caching.

Hello. I am implementing this suggestion. So far, so good. However, I
need some further recommendations on rsync options to use for this
purpose.

My rsync command currently looks like this:

rsync -axAHv --inplace --delete-delay --exclude-from="/some/file"
"$source_snapshop/" "$backup_location"

In particular, I want to know if I should or should not be using these options:

-H, --hard-linkspreserve hard links
-A, --acls  preserve ACLs (implies -p)
-X, --xattrspreserve extended attributes
-x, --one-file-system   don't cross filesystem boundaries

I had to use the "x" option to prevent rsync from deleting files in
snapshots in the backup location (as the source location does not
retain any snapshots). Is there a better way?

I have my live system on one block device and a backup snapshot of it
on another block device. I am keeping them in sync with hourly rsync
transfers.

Here's how this system works in a little more detail:

1. I establish the baseline by sending a full snapshot to the backup
block device using btrfs send-receive.
2. Next, on the backup device I immediately create a rw copy of that
baseline snapshot.
3. I delete the source snapshot to keep the live filesystem free of
all snapshots (so it can be optimally defragmented, etc.)
4. hourly, I take a snapshot of the live system, rsync all changes to
the backup block device, and then delete the source snapshot. This
hourly process takes less than a minute currently. (My test system has
only moderate usage.)
5. hourly, following the above step, I use snapper to take a snapshot
of the backup subvolume to create/preserve a history of changes. For
example, I can find the version of a file 30 hours prior.

The backup volume contains up to 100 snapshots while the live volume
has no snapshots. Best of both worlds? I guess I'll find out over
time.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/6] writeback: allow for dirty metadata accounting

2017-11-09 Thread Dave Chinner
On Thu, Nov 09, 2017 at 02:30:57PM -0500, Josef Bacik wrote:
> From: Josef Bacik <jba...@fb.com>
> 
> Provide a mechanism for file systems to indicate how much dirty metadata they
> are holding.  This introduces a few things
> 
> 1) Zone stats for dirty metadata, which is the same as the NR_FILE_DIRTY.
> 2) WB stat for dirty metadata.  This way we know if we need to try and call 
> into
> the file system to write out metadata.  This could potentially be used in the
> future to make balancing of dirty pages smarter.

Ok, so when you have 64k page size and 4k metadata block size and
you're using kmalloc() to allocate the storage for the metadata,
how do we make use of all this page-based metadata accounting
stuff?

We could use dirty metadata accounting infrastructure in
XFS so the mm/ subsystem can push on dirty metadata before we
get into reclaim situations, but I just can't see how this code
works when raw pages are not used to back the metadata cache.

That is, XFS can use various different sizes of metadata buffers in
the same filesystem. For example, we use sector sized buffers for
static AG metadata, filesystem blocks for per-AG metadata, some
multiple (1-16) of filesystem blocks for inode buffers, and some
multiple (1-128) of filesytem blocks for directory blocks.

This means we have a combination of buffers  we need to account for
that are made up of:
heap memory when buffer size < page size,
single pages when buffer size == page size, and
multiple pages when buffer size > page size.

The default filesystem config on a 4k page machine with 512 byte
sectors will create buffers in all three categories above which
tends to indicate we can't use this new generic infrastructure as
proposeda.

Any thoughts about how we could efficiently support accounting for
variable sized, non-page based metadata with this generic
infrastructure?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-11-06 Thread Dave
On Sat, Nov 4, 2017 at 1:25 PM, Chris Murphy <li...@colorremedies.com> wrote:
>
> On Sat, Nov 4, 2017 at 1:26 AM, Dave <davestechs...@gmail.com> wrote:
> > On Mon, Oct 30, 2017 at 5:37 PM, Chris Murphy <li...@colorremedies.com> 
> > wrote:
> >>
> >> That is not a general purpose file system. It's a file system for admins 
> >> who understand where the bodies are buried.
> >
> > I'm not sure I understand your comment...
> >
> > Are you saying BTRFS is not a general purpose file system?
>
> I'm suggesting that any file system that burdens the user with more
> knowledge to stay out of trouble than the widely considered general
> purpose file systems of the day, is not a general purpose file system.
>
> And yes, I'm suggesting that Btrfs is at risk of being neither general
> purpose, and not meeting its design goals as stated in Btrfs
> documentation. It is not easy to admin *when things go wrong*. It's
> great before then. It's a butt ton easier to resize, replace devices,
> take snapshots, and so on. But when it comes to fixing it when it goes
> wrong? It is a goddamn Choose Your Own Adventure book. It's way, way
> more complicated than any other file system I'm aware of.

It sounds like a large part of that could be addressed with better
documentation. I know that documentation such as what you are
suggesting would be really valuable to me!

> > If btrfs isn't able to serve as a general purpose file system for
> > Linux going forward, which file system(s) would you suggest can fill
> > that role? (I can't think of any that are clearly all-around better
> > than btrfs now, or that will be in the next few years.)
>
> ext4 and XFS are clearly the file systems to beat. They almost always
> recover from crashes with just a normal journal replay at mount time,
> file system repair is not often needed. When it is needed, it usually
> works, and there is just the one option to repair and go with it.
> Btrfs has piles of repair options, mount time options, btrfs check has
> options, btrfs rescue has options, it's a bit nutty honestly. And
> there's zero guidance in the available docs what order to try things
> in, not least of which some of these repair tools are still considered
> dangerous at least in the man page text, and the order depends on the
> failure. The user is burdened with way too much.

Neither one of those file systems offers snapshots. (And when I
compared LVM snapshots vs BTRFS snapshots, I got the impression BTRFS
is the clear winner.)

Snapshots and volumes have a lot of value to me and I would not enjoy
going back to a file system without those features.

> Even as much as I know about Btrfs having used it since 2008 and my
> list activity, I routinely have WTF moments when people post problems,
> what order to try to get things going again. Easy to admin? Yeah for
> the most part. But stability is still a problem, and it's coming up on
> a 10 year anniversary soon.
>
> If I were equally familiar with ZFS on Linux as I am with Btrfs, I'd
> use ZoL hands down.

Might it be the case that if you were equally familiar with ZFS, you
would become aware of more of its pitfalls? And that greater knowledge
could always lead to a different decision (such as favoring BTRFS)..
In my experience the grass is always greener when I am less familiar
with the field.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-11-04 Thread Dave
On Mon, Oct 30, 2017 at 5:37 PM, Chris Murphy  wrote:
>
> That is not a general purpose file system. It's a file system for admins who 
> understand where the bodies are buried.

I'm not sure I understand your comment...

Are you saying BTRFS is not a general purpose file system?

If btrfs isn't able to serve as a general purpose file system for
Linux going forward, which file system(s) would you suggest can fill
that role? (I can't think of any that are clearly all-around better
than btrfs now, or that will be in the next few years.)

Or maybe you meant something else?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)

2017-11-02 Thread Dave
On Thu, Nov 2, 2017 at 4:46 PM, Kai Krakow <hurikha...@gmail.com> wrote:
> Am Wed, 1 Nov 2017 02:51:58 -0400
> schrieb Dave <davestechs...@gmail.com>:
>
>> >
>> >> To reconcile those conflicting goals, the only idea I have come up
>> >> with so far is to use btrfs send-receive to perform incremental
>> >> backups
>> >
>> > As already said by Romain Mamedov, rsync is viable alternative to
>> > send-receive with much less hassle. According to some reports it
>> > can even be faster.
>>
>> Thanks for confirming. I must have missed those reports. I had never
>> considered this idea until now -- but I like it.
>>
>> Are there any blogs or wikis where people have done something similar
>> to what we are discussing here?
>
> I used rsync before, backup source and destination both were btrfs. I
> was experiencing the same btrfs bug from time to time on both devices,
> luckily not at the same time.
>
> I instead switched to using borgbackup, and xfs as the destination (to
> not fall the same-bug-in-two-devices pitfall).

I'm going to stick with btrfs everywhere. My reasoning is that my
biggest pitfalls will be related to lack of knowledge. So focusing on
learning one filesystem better (vs poorly learning two) is the better
strategy for me, given my limited time. (I'm not an IT professional of
any sort.)

Is there any problem with the Borgbackup repository being on btrfs?

> Borgbackup achieves a
> much higher deduplication density and compression, and as such also is
> able to store much more backup history in the same storage space. The
> first run is much slower than rsync (due to enabled compression) but
> successive runs are much faster (like 20 minutes per backup run instead
> of 4-5 hours).
>
> I'm currently storing 107 TB of backup history in just 2.2 TB backup
> space, which counts a little more than one year of history now,
> containing 56 snapshots. This is my retention policy:
>
>   * 5 yearly snapshots
>   * 12 monthly snapshots
>   * 14 weekly snapshots (worth around 3 months)
>   * 30 daily snapshots
>
> Restore is fast enough, and a snapshot can even be fuse-mounted (tho,
> in that case mounted access can be very slow navigating directories).
>
> With latest borgbackup version, the backup time increased to around 1
> hour from 15-20 minutes in the previous version. That is due to
> switching the file cache strategy from mtime to ctime. This can be
> tuned to get back to old performance, but it may miss some files during
> backup if you're doing awkward things to file timestamps.
>
> I'm also backing up some servers with it now, then use rsync to sync
> the borg repository to an offsite location.
>
> Combined with same-fs local btrfs snapshots with short retention times,
> this could be a viable solution for you.

Yes, I appreciate the idea. I'm going to evaluate both rsync and Borgbackup.

The advantage of rsync, I think, is that it will likely run in just a
couple minutes. That will allow me to run it hourly and to keep my
live volume almost entire free of snapshots and fully defragmented.
It's also very simple as I already have rsync. And since I'm going to
run btrfs on the backup volume, I can perform hourly snapshots there
and use Snapper to manage retention. It's all very simple and relies
on tools I already have and know.

However, the advantages of Borgbackup you mentioned (much higher
deduplication density and compression) make it worth considering.
Maybe Borgbackup won't take long to complete successive (incremental)
backups on my system. I'll have to try it to see. It's a very nice
looking project. I'm surprised I never heard of it before.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-11-02 Thread Dave
On Thu, Nov 2, 2017 at 7:07 AM, Austin S. Hemmelgarn
<ahferro...@gmail.com> wrote:
> On 2017-11-01 21:39, Dave wrote:
>> I'm going to make this change now. What would be a good way to
>> implement this so that the change applies to the $HOME/.cache of each
>> user?
>>
>> The simple way would be to create a new subvolume for each existing
>> user and mount it at $HOME/.cache in /etc/fstab, hard coding that
>> mount location for each user. I don't mind doing that as there are
>> only 4 users to consider. One minor concern is that it adds an
>> unexpected step to the process of creating a new user. Is there a
>> better way?
>>
> The easiest option is to just make sure nobody is logged in and run the
> following shell script fragment:
>
> for dir in /home/* ; do
> rm -rf $dir/.cache
> btrfs subvolume create $dir/.cache
> done
>
> And then add something to the user creation scripts to create that
> subvolume.  This approach won't pollute /etc/fstab, will still exclude the
> directory from snapshots, and doesn't require any hugely creative work to
> integrate with user creation and deletion.
>
> In general, the contents of the .cache directory are just that, cached data.
> Provided nobody is actively accessing it, it's perfectly safe to just nuke
> the entire directory...

I like this suggestion. Thank you. I had intended to mount the .cache
subvolumes with the NODATACOW option. However, with this approach, I
won't be explicitly mounting the .cache subvolumes. Is it possible to
use "chattr +C $dir/.cache" in that loop even though it is a
subvolume? And, is setting the .cache directory to NODATACOW the right
choice given this scenario? From earlier comments, I believe it is,
but I want to be sure I understood correctly.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-11-02 Thread Dave
On Thu, Nov 2, 2017 at 5:16 PM, Kai Krakow  wrote:

>
> You may want to try btrfs autodefrag mount option and see if it
> improves things (tho, the effect may take days or weeks to apply if you
> didn't enable it right from the creation of the filesystem).
>
> Also, autodefrag will probably unshare reflinks on your snapshots. You
> may be able to use bees[1] to work against this effect. Its interaction
> with autodefrag is not well tested but it works fine for me. Also, bees
> is able to reduce some of the fragmentation during deduplication
> because it will rewrite extents back into bigger chunks (but only for
> duplicated data).
>
> [1]: https://github.com/Zygo/bees

I will look into bees. And yes, I plan to try autodefrag. (I already
have it enabled now.) However, I need to understand something about
how btrfs send-receive works in regard to reflinks and fragmentation.

Say I have 2 snapshots on my live volume. The earlier one of them has
already been sent to another block device by btrfs send-receive (full
backup). Now defrag runs on the live volume and breaks some percentage
of the reflinks. At this point I do an incremental btrfs send-receive
using "-p" (or "-c") with the diff going to the same other block
device where the prior snapshot was already sent.

Will reflinks be "made whole" (restored) on the receiving block
device? Or is the state of the source volume replicated so closely
that reflink status is the same on the target?

Also, is fragmentation reduced on the receiving block device?

My expectation is that fragmentation would be reduced and duplication
would be reduced too. In other words, does send-receive result in
defragmentation and deduplication too?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-11-02 Thread Dave
On Thu, Nov 2, 2017 at 7:17 AM, Austin S. Hemmelgarn
 wrote:

>> And the worst performing machine was the one with the most RAM and a
>> fast NVMe drive and top of the line hardware.
>
> Somewhat nonsensically, I'll bet that NVMe is a contributing factor in this
> particular case.  NVMe has particularly bad performance with the old block
> IO schedulers (though it is NVMe, so it should still be better than a SATA
> or SAS SSD), and the new blk-mq framework just got scheduling support in
> 4.12, and only got reasonably good scheduling options in 4.13.  I doubt it's
> the entirety of the issue, but it's probably part of it.

Thanks for that news. Based on that, I assume the advice here (to use
noop for NVMe) is now outdated?
https://stackoverflow.com/a/27664577/463994

Is the solution as simple as running a kernel >= 4.13? Or do I need to
specify which scheduler to use?

I just checked one computer:

uname -a
Linux morpheus 4.13.5-1-ARCH #1 SMP PREEMPT Fri Oct 6 09:58:47 CEST
2017 x86_64 GNU/Linux

$ sudo find /sys -name scheduler -exec grep . {} +
/sys/devices/pci:00/:00:1d.0/:08:00.0/nvme/nvme0/nvme0n1/queue/scheduler:[none]
mq-deadline kyber bfq

>From this article, it sounds like (maybe) I should use kyber. I see
kyber listed in the output above, so I assume that means it is
available. I also think [none] is the current scheduler being used, as
it is in brackets.

I checked this:
https://www.kernel.org/doc/Documentation/block/switching-sched.txt
Based on that, I assume I would do this at runtime:

echo kyber > 
/sys/devices/pci:00/:00:1d.0/:08:00.0/nvme/nvme0/nvme0n1/queue/scheduler

I assume this is equivalent:

echo kyber > /sys/block/nvme0n1/queue/scheduler

How would I set it permanently at boot time?

>> While Firefox and Linux in general have their performance "issues",
>> that's not relevant here. I'm comparing the same distros, same Firefox
>> versions, same Firefox add-ons, etc. I eventually tested many hardware
>> configurations: different CPU's, motherboards, GPU's, SSD's, RAM, etc.
>> The only remaining difference I can find is that the computer with
>> acceptable performance uses LVM + EXT4 while all the others use BTRFS.
>>
>> With all the great feedback I have gotten here, I'm now ready to
>> retest this after implementing all the BTRFS-related suggestions I
>> have received. Maybe that will solve the problem or maybe this mystery
>> will continue...
>
> Hmm, if you're only using SSD's, that may partially explain things.  I don't
> remember if it was mentioned earlier in this thread, but you might try
> adding 'nossd' to the mount options.  The 'ssd' mount option (which gets set
> automatically if the device reports as non-rotational) impacts how the block
> allocator works, and that can have a pretty insane impact on performance.

I will test the "nossd" mount option.

> Additionally, independently from that, try toggling the 'discard' mount
> option.  If you have it enabled, disable it, if you have it disabled, enable
> it.  Inline discards can be very expensive on some hardware, especially
> older SSD's, and discards happen pretty frequently in a COW filesystem.

I have been following this advice, so I have never enabled discard for
an NVMe drive. Do you think it is worth testing?

Solid State Drives/NVMe - ArchWiki
https://wiki.archlinux.org/index.php/Solid_State_Drives/NVMe

Discards:
Note: Although continuous TRIM is an option (albeit not recommended)
for SSDs, NVMe devices should not be issued discards.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Parity-based redundancy (RAID5/6/triple parity and beyond) on BTRFS and MDADM (Dec 2014) – Ronny Egners Blog

2017-11-01 Thread Dave
Has this been discussed here? Has anything changed since it was written?

Parity-based redundancy (RAID5/6/triple parity and beyond) on BTRFS
and MDADM (Dec 2014) – Ronny Egners Blog
http://blog.ronnyegner-consulting.de/2014/12/10/parity-based-redundancy-raid56triple-parity-and-beyond-on-btrfs-and-mdadm-dec-2014/comment-page-1/

TL;DR: There are patches to extend the linux kernel to support up to 6
parity disks but BTRFS does not want them because it does not fit
their “business case” and MDADM would want them but somebody needs to
develop patches for the MDADM component. The kernel raid
implementation is ready and usable. If someone volunteers to do this
kind of work I would support with equipment and myself as a test
resource.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-11-01 Thread Dave
On Wed, Nov 1, 2017 at 8:21 AM, Austin S. Hemmelgarn
 wrote:

>> The cache is in a separate location from the profiles, as I'm sure you
>> know.  The reason I suggested a separate BTRFS subvolume for
>> $HOME/.cache is that this will prevent the cache files for all
>> applications (for that user) from being included in the snapshots. We
>> take frequent snapshots and (afaik) it makes no sense to include cache
>> in backups or snapshots. The easiest way I know to exclude cache from
>> BTRFS snapshots is to put it on a separate subvolume. I assumed this
>> would make several things related to snapshots more efficient too.
>
> Yes, it will, and it will save space long-term as well since $HOME/.cache is
> usually the most frequently modified location in $HOME. In addition to not
> including this in the snapshots, it may also improve performance.  Each
> subvolume is it's own tree, with it's own locking, which means that you can
> generally improve parallel access performance by splitting the workload
> across multiple subvolumes.  Whether it will actually provide any real
> benefit in that respect is heavily dependent on the exact workload however,
> but it won't hurt performance.

I'm going to make this change now. What would be a good way to
implement this so that the change applies to the $HOME/.cache of each
user?

The simple way would be to create a new subvolume for each existing
user and mount it at $HOME/.cache in /etc/fstab, hard coding that
mount location for each user. I don't mind doing that as there are
only 4 users to consider. One minor concern is that it adds an
unexpected step to the process of creating a new user. Is there a
better way?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-11-01 Thread Dave
On Wed, Nov 1, 2017 at 1:48 PM, Peter Grandi  wrote:
>> When defragmenting individual files on a BTRFS filesystem with
>> COW, I assume reflinks between that file and all snapshots are
>> broken. So if there are 30 snapshots on that volume, that one
>> file will suddenly take up 30 times more space... [ ... ]
>
> Defragmentation works by effectively making a copy of the file
> contents (simplistic view), so the end result is one copy with
> 29 reflinked contents, and one copy with defragmented contents.

The clarification is much appreciated.

>> Can you also give an example of using find, as you suggested
>> above? [ ... ]
>
> Well, one way is to use 'find' as a filtering replacement for
> 'defrag' option '-r', as in for example:
>
>   find "$HOME" -xdev '(' -name '*.sqlite' -o -name '*.mk4' ')' \
> -type f  -print0 | xargs -0 btrfs fi defrag
>
> Another one is to find the most fragmented files first or all
> files of at least 1M with with at least say 100 fragments as in:
>
>   find "$HOME" -xdev -type f -size +1M -print0 | xargs -0 filefrag \
> | perl -n -e 'print "$1\0" if (m/(.*): ([0-9]+) extents/ && $1 > 100)' \
> | xargs -0 btrfs fi defrag
>
> But there are many 'find' web pages and that is not quite a
> Btrfs related topic.

Your examples were perfect. I have experience using find in similar
ways. I can take it from there. :-)

>> Background: I'm not sure why our Firefox performance is so terrible
>
> As I always say, "performance" is not the same as "speed", and
> probably your Firefox "performance" is sort of OKish even if the
> "speed" is terrile, and neither is likely related to the profile
> or the cache being on Btrfs.

Here's what happened. Two years ago I installed Kubuntu (with Firefox)
on two desktop computers. One machine performed fine. Like you said,
"sort of OKish" and that's what we expect with the current state of
Linux. The other machine was substantially worse. We ran side-by-side
real-world tests on these two machines for months.

Initially I did a lot of testing, troubleshooting and reconfiguration
trying to get the second machine to perform as well as the first. I
never had success. At first I thought it was related to the GPU (or
driver). Then I thought it was because the first machine used the z170
chipset and the second was X99 based. But that wasn't it. I have never
solved the problem and I have been coming back to it periodically
these last two years. In that time I have tried different distros from
opensuse to Arch, and a lot of different hardware.

Furthermore, my new machines have the same performance problem. The
most interesting example is a high end machine with 256 GB of RAM. It
showed substantially worse desktop application performance than any
other computer here. All are running the exact same version of Firefox
with the exact same add-ons. (The installations are carbon copies of
each other.)

What originally caught my attention was earlier information in this thread:

Am Wed, 20 Sep 2017 07:46:52 -0400
schrieb "Austin S. Hemmelgarn" :

> >  Fragmentation: Files with a lot of random writes can become
> > heavily fragmented (1+ extents) causing excessive multi-second
> > spikes of CPU load on systems with an SSD or large amount a RAM. On
> > desktops this primarily affects application databases (including
> > Firefox). Workarounds include manually defragmenting your home
> > directory using btrfs fi defragment. Auto-defragment (mount option
> > autodefrag) should solve this problem.
> >
> > Upon reading that I am wondering if fragmentation in the Firefox
> > profile is part of my issue. That's one thing I never tested
> > previously. (BTW, this system has 256 GB of RAM and 20 cores.)
> Almost certainly.  Most modern web browsers are brain-dead and insist
> on using SQLite databases (or traditional DB files) for everything,
> including the cache, and the usage for the cache in particular kills
> performance when fragmentation is an issue.

It turns out the the first machine (which performed well enough) was
the last one which was installed using LVM + EXT4. The second machine
(the one with the original performance problem) and all subsequent
machines have used BTRFS.

And the worst performing machine was the one with the most RAM and a
fast NVMe drive and top of the line hardware.

While Firefox and Linux in general have their performance "issues",
that's not relevant here. I'm comparing the same distros, same Firefox
versions, same Firefox add-ons, etc. I eventually tested many hardware
configurations: different CPU's, motherboards, GPU's, SSD's, RAM, etc.
The only remaining difference I can find is that the computer with
acceptable performance uses LVM + EXT4 while all the others use BTRFS.

With all the great feedback I have gotten here, I'm now ready to
retest this after implementing all the BTRFS-related suggestions I
have received. Maybe that will solve the problem or maybe this mystery

Re: defragmenting best practice?

2017-11-01 Thread Dave
On Wed, Nov 1, 2017 at 9:31 AM, Duncan <1i5t5.dun...@cox.net> wrote:
> Dave posted on Tue, 31 Oct 2017 17:47:54 -0400 as excerpted:
>
>> 6. Make sure Firefox is running in multi-process mode. (Duncan's
>> instructions, while greatly appreciated and very useful, left me
>> slightly confused about pulseaudio's compatibility with multi-process
>> mode.)
>
> Just to clarify:
>
> There's no problem with native pulseaudio and firefox multi-process
> mode.

Thank you for clarifying. And I appreciate your detailed explanation.

> Back when I posted that, a not e10s-enabled extension was actually quite
> likely, as e10s was still rather new.  It's probably somewhat less so
> now, and firefox is of course on to the next big change, dropping the old
> "legacy chrome" extension support, in favor of the newer and generally
> Chromium-compatible WebExtensions/WE API, with firefox 57, to be released
> mid-month (Nov 14).

I am now running Firefox 57 beta and I'll be doing my testing with
that using only WebExtensions.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)

2017-11-01 Thread Dave
On Wed, Nov 1, 2017 at 4:34 AM, Marat Khalili  wrote:

>> We do experience severe performance problems now, especially with
>> Firefox. Part of my experiment is to reduce the number of snapshots on
>> the live volumes, hence this question.
>
> Just for statistics, how many snapshots do you have and how often do you
> take them? It's on SSD, right?

I don't think the severe performance problems stem solely from the
number of snapshots. I think it is also related to Firefox stuff
(cache fragmentation, lack of multi-processor mode maybe, etc.) I
still have to investigate the Firefox issues, but I'm starting at the
foundation by trying to get a basic BTRFS setup that will support
better desktop application performance first.

The poor performance has existed from the beginning of using BTRFS +
KDE + Firefox (almost 2 years ago), at a point when very few snapshots
had yet been created. A comparison system running similar hardware as
well as KDE + Firefox (and LVM + EXT4) did not have the performance
problems. The difference has been consistent and significant. For a
while I thought the difference was due to the hardware, as one system
used the z170 chipset and the other used the X99 chipset (but were
otherwise equivalent). So I repeated the testing on identical hardware
and the stark performance difference remained. When I realized that, I
began focusing on BTRFS, as it is the only consistent difference I can
recognize.

Sometimes I have used Snapper settings like this:

TIMELINE_MIN_AGE="1800"
TIMELINE_LIMIT_HOURLY="36"
TIMELINE_LIMIT_DAILY="30"
TIMELINE_LIMIT_MONTHLY="12"
TIMELINE_LIMIT_YEARLY="10"

However, I also have some computers set like this:

TIMELINE_MIN_AGE="1800"
TIMELINE_LIMIT_HOURLY="10"
TIMELINE_LIMIT_DAILY="10"
TIMELINE_LIMIT_WEEKLY="0"
TIMELINE_LIMIT_MONTHLY="0"
TIMELINE_LIMIT_YEARLY="0"

> BTW beware of deleting too many snapshots at once with any tool. Delete few
> and let filesystem stabilize before proceeding.

OK, thanks for the tip.

> For deduplication tool to be useful you ought to have some duplicate data on
> your live volume. Do you have any (e.g. many LXC containers with the same
> distribution)?

No, no containers and no duplication to that large extent.

> P.S. I still think you need some off-system backup solution too, either
> rsync+snapshot-based over ssh or e.g. Burp (shameless advertising:
> http://burp.grke.org/ ).

I agree, but that's beyond the scope of the current problem I'm trying
to solve.  However, I'll check out Burp once I have a base
configuration that is working satisfactorily.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)

2017-11-01 Thread Dave
On Wed, Nov 1, 2017 at 2:19 AM, Marat Khalili  wrote:

> You seem to have two tasks: (1) same-volume snapshots (I would not call them
> backups) and (2) updating some backup volume (preferably on a different
> box). By solving them separately you can avoid some complexity...

Yes, it appears that is a very good strategy -- solve the concerns
separately. Make the live volume performant and the backup volume
historical.

>
>> To reconcile those conflicting goals, the only idea I have come up
>> with so far is to use btrfs send-receive to perform incremental
>> backups
>
> As already said by Romain Mamedov, rsync is viable alternative to
> send-receive with much less hassle. According to some reports it can even be
> faster.

Thanks for confirming. I must have missed those reports. I had never
considered this idea until now -- but I like it.

Are there any blogs or wikis where people have done something similar
to what we are discussing here?

>
>> Given the hourly snapshots, incremental backups are the only practical
>> option. They take mere moments. Full backups could take an hour or
>> more, which won't work with hourly backups.
>
> I don't see much sense in re-doing full backups to the same physical device.
> If you care about backup integrity, it is probably more important to invest
> in backups verification. (OTOH, while you didn't reveal data size, if full
> backup takes just an hour on your system then why not?)

I was saying that a full backup could take an hour or more. That means
full backups are not compatible with an hourly backup schedule. And it
is certainly not a potential solution to making the system perform
better because the system will be spending all its time running
backups -- it would be never ending. With hourly backups, they should
complete in just a few moments, which is the case with incremental
backups. (It sounds like this will be the case with rsync as well.)
>
>> We will delete most snapshots on the live volume, but retain many (or
>> all) snapshots on the backup block device. Is that a good strategy,
>> given my goals?
>
> Depending on the way you use it, retaining even a dozen snapshots on a live
> volume might hurt performance (for high-performance databases) or be
> completely transparent (for user folders). You may want to experiment with
> this number.

We do experience severe performance problems now, especially with
Firefox. Part of my experiment is to reduce the number of snapshots on
the live volumes, hence this question.

>
> In any case I'd not recommend retaining ALL snapshots on backup device, even
> if you have infinite space. Such filesystem would be as dangerous as the
> demon core, only good for adding more snapshots (not even deleting them),
> and any little mistake will blow everything up. Keep a few dozen, hundred at
> most.

The intention -- if we were to keep all snapshots on a backup device
-- would be to never ever try to delete them. However, with the
suggestion to separate the concerns and use rsync, we could also
easily run the Snapper timeline cleanup on the backup volume, thereby
limiting the retained snapshots to some reasonable number.

> Unlike other backup systems, you can fairly easily remove snapshots in the
> middle of sequence, use this opportunity. My thinout rule is: remove
> snapshot if resulting gap will be less than some fraction (e.g. 1/4) of its
> age. One day I'll publish portable solution on github.

Thanks. I hope you do find time to publish it. (And what do you mean
by portable?) For now, Snapper has a cleanup algorithm that we can
use. At least one of the tools listed here has a thinout algorithm
too: https://btrfs.wiki.kernel.org/index.php/Incremental_Backup

>> Given this minimal retention of snapshots on the live volume, should I
>> defrag it (assuming there is at least 50% free space available on the
>> device)? (BTW, is defrag OK on an NVMe drive? or an SSD?)
>>
>> In the above procedure, would I perform that defrag before or after
>> taking the snapshot? Or should I use autodefrag?
>
> I ended up using autodefrag, didn't try manual defragmentation. I don't use
> SSDs as backup volumes.

I don't use SSD's as backup volumes either. I was asking about the live volume.
>
>> Should I consider a dedup tool like one of these?
>
> Certainly NOT for snapshot-based backups: it is already deduplicated almost
> as much as possible, dedup tools can only make it *less* deduplicated.

The question is whether to use a dedup tool on the live volume which
has a few snapshots. Even with the new strategy (based on rsync), the
live volume may sometimes have two snapshots (pre- and post- pacman
upgrades).

I still wish to know, in that case, about using both a dedup tool and
defragmenting the btrfs filesystem.

Also still wondering about these options: no-holes, skinny metadata,
or extended inode refs?

This is a very helpful discussion. Thank you.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the 

Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)

2017-11-01 Thread Dave
On Wed, Nov 1, 2017 at 1:15 AM, Roman Mamedov <r...@romanrm.net> wrote:
> On Wed, 1 Nov 2017 01:00:08 -0400
> Dave <davestechs...@gmail.com> wrote:
>
>> To reconcile those conflicting goals, the only idea I have come up
>> with so far is to use btrfs send-receive to perform incremental
>> backups as described here:
>> https://btrfs.wiki.kernel.org/index.php/Incremental_Backup .
>
> Another option is to just use the regular rsync to a designated destination
> subvolume on the backup host, AND snapshot that subvolume on that host from
> time to time (or on backup completions, if you can synchronize that).
>
> rsync --inplace will keep space usage low as it will not reupload entire files
> in case of changes/additions to them.
>

This seems like a brilliant idea, something that has a lot of potential...

On a system where the root filesystem is on an SSD and the backup
volume on an HDD, I could rsync hourly, and then run Snapper on the
backup volume hourly, as well as using Snapper's timeline cleanup on
the backup volume. The live filesystem would have zero snapshots and
could be optimized for performance. The backup volume could retain a
large number of snapshots (even more than several hundred) because
performance would not be very important (as far as I can guess). This
seems to resolve our conflict.

How about on a system (such as a laptop) with only a single SSD? Would
this same idea work where the backup volume is on the same block
device? I know that is not technically a backup, but what it does
accomplish is separation of the live filesystem from the snapshotted
backup volume for performance reasons -- yet the hourly snapshot
history is still available. That would seem to meet our use case too.
(An external backup disk would be connected to the laptop
periodically, of course, too.)

Currently, for most btrfs volumes, I have three volumes: the main
volume, a snapshot subvolume which contains all the individual
snapshots, and a backup volume* (on a different block device but on
the same machine).

With this new idea, I would have a main volume without any snapshots
and a backup volume which contains all the snapshots. It simplifies
things on that level and it also simplifies performance tuning on the
main volume. In fact it simplifies backup snapshot management too.

My initial impression is that this simplifies everything as well as
optimizing everything. So surely it must have some disadvantages
compared to btrfs send-receive incremental backups
(https://btrfs.wiki.kernel.org/index.php/Incremental_Backup). What
would those disadvantages be?

The first one that comes to mind is that I would lose the
functionality of pre- and post- upgrade snapshots on the root
filesystem. But I think that's minor. I could either keep those two
snapshots for a few hours or days after major upgrades or maybe I
could find a pacman hook that uses rsync to make pre- and post-
upgrade copies...

* Footnote: on some workstation computers, we have 2 or 3 separate
backup block devices (e..g, external USB hard drives, etc.). Laptops,
however, generally only have a single block device and are not always
connected to an external USB hard drive for backup as often as would
be ideal. But we also don't keep any critical data on laptops.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Need help with incremental backup strategy (snapshots, defragmentingt & performance)

2017-10-31 Thread Dave
Our use case requires snapshots. btrfs snapshots are best solution we
have found for our requirements, and over the last year snapshots have
proven their value to us.

(For this discussion I am considering both the "root" volume and the
"home" volume on a typical desktop workstation. Also, all btfs volumes
are mounted with noatime and nodiratime flags.)

For performance reasons, I now wish to minimize the number of
snapshots retained on the live btrfs volume.

However, for backup purposes, I wish to maximize the number of
snapshots retained over time. We'll keep yearly, monthly, weekly,
daily and hourly snapshots for as long as possible.

To reconcile those conflicting goals, the only idea I have come up
with so far is to use btrfs send-receive to perform incremental
backups as described here:
https://btrfs.wiki.kernel.org/index.php/Incremental_Backup .

Given the hourly snapshots, incremental backups are the only practical
option. They take mere moments. Full backups could take an hour or
more, which won't work with hourly backups.

We will delete most snapshots on the live volume, but retain many (or
all) snapshots on the backup block device. Is that a good strategy,
given my goals?

The steps:

I know step one is to do the "bootstrapping" where a full initial copy
of the live volume is sent to the backup volume. I also know the steps
for doing incremental backups.

However, the first problem I see is that performing incremental
backups requires both the live volume and the backup volume to have an
identical "parent" snapshot before each new incremental can be sent. I
have found it easy to accidentally delete that specific required
parent snapshot when hourly snapshots are being taken and many
snaphots exist.

Given that I want to retain the minimum number of snapshots on the
live volume, how do I ensure that a valid "parent" subvolume exists
there in order to perform the incremental backup? (Again, I have often
run into the error "no valid parent exists" when doing incremental
backups.)

I think the rule is like this:

Do not delete a snapshot from the live volume until the next snapshot
based on it has been sent to the backup volume.

In other words, always retain the *exact* snapshot that was the last
one sent to the backup volume. Deleting that one then taking another
one does not seem sufficient. BTRFS does not seem to recognize
parent-child-grandchild relationships of snapshots when doing
send-receive incremental backups.

However, maybe I'm wrong. Would it be sufficient to first take another
snapshot, then delete the prior snapshot? Will the send-receive
algorithm be able to infer a parent exists on the backup volume when
it receives an incremental based on a child snapshot? (My experience
says "no", but I'd like a more authoritative answer.)

The next step in my proposed procedure is to take a new snapshot, send
it to the backup volume, and only then delete the prior snapshot ( and
only from the live volume* ).

Using this strategy, the live volume will always have the current
snapshot (which I guess should not be called a snapshot -- it's the
live volume) plus at least one more snapshot. Briefly, during the
incremental backup, it will have an additional snapshot until the
older one gets deleted.

Given this minimal retention of snapshots on the live volume, should I
defrag it (assuming there is at least 50% free space available on the
device)? (BTW, is defrag OK on an NVMe drive? or an SSD?)

In the above procedure, would I perform that defrag before or after
taking the snapshot? Or should I use autodefrag?

Should I consider a dedup tool like one of these?

g2p/bedup: Btrfs deduplication
https://github.com/g2p/bedup

markfasheh/duperemove: Tools for deduping file systems
https://github.com/markfasheh/duperemove

Zygo/bees: Best-Effort Extent-Same, a btrfs dedup agent
https://github.com/Zygo/bees

Does anyone care to elaborate on the relationship between a dedup tool
like Bees and defragmenting a btrfs filesystem with snapshots? I
understand they do opposing things, but I think it was suggested in
another thread on defragmenting that they can be combined to good
effect. Should I consider this as a possible solution for my
situation?

Should I consider any of these options: no-holes, skinny metadata, or
extended inode refs?

Finally, are there any good BTRFS performance wiki articles or blogs I
should refer to for my situation?

* Footnote: On the backup device, maybe we will never delete
snapshots. In any event, that's not a concern now. We'll retain many,
many snapshots on the backup device.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-10-31 Thread Dave
On Tue, Oct 31, 2017 at 7:06 PM, Peter Grandi  
wrote:
>
> Also nothing forces you to defragment a whole filesystem, you
> can just defragment individual files or directories by using
> 'find' with it.

Thanks for that info. When defragmenting individual files on a BTRFS
filesystem with COW, I assume reflinks between that file and all
snapshots are broken. So if there are 30 snapshots on that volume,
that one file will suddenly take up 30 times more space... Is that
correct? Or are the reflinks only broken between the live file and the
latest snapshot? Or is it something between, based on how many times
the file has changed?

>
> My top "$HOME" fragmented files are the aKregator RSS feed
> databases, usually a few hundred fragments each, and the
> '.sqlite' files for Firefox. Occasionally like just now I do
> this:
>
>   tree$  sudo filefrag .firefox/default/*.sqlite | sort -t: -k 2n | tail -4
>   .firefox/default/cleanup.sqlite: 43 extents found
>   .firefox/default/content-prefs.sqlite: 67 extents found
>   .firefox/default/formhistory.sqlite: 87 extents found
>   .firefox/default/places.sqlite: 3879 extents found
>
>   tree$  sudo btrfs fi defrag .firefox/default/*.sqlite
>
>   tree$  sudo filefrag .firefox/default/*.sqlite | sort -t: -k 2n | tail -4
>   .firefox/default/webappsstore.sqlite: 1 extent found
>   .firefox/default/favicons.sqlite: 2 extents found
>   .firefox/default/kinto.sqlite: 2 extents found
>   .firefox/default/places.sqlite: 44 extents found

That's a very helpful example.

Can you also give an example of using find, as you suggested above?
I'm generally familiar with using find to execute specific commands,
but an example is appreciated in this case.

> > 2. Put $HOME/.cache on a separate BTRFS subvolume that is mounted nocow -- 
> > it will NOT be snapshotted

> Also, you can declare the '.firefox/default/' directory to be NOCOW, and that 
> "just works".

The cache is in a separate location from the profiles, as I'm sure you
know.  The reason I suggested a separate BTRFS subvolume for
$HOME/.cache is that this will prevent the cache files for all
applications (for that user) from being included in the snapshots. We
take frequent snapshots and (afaik) it makes no sense to include cache
in backups or snapshots. The easiest way I know to exclude cache from
BTRFS snapshots is to put it on a separate subvolume. I assumed this
would make several things related to snapshots more efficient too.

As far as the Firefox profile being declared NOCOW, as soon as we take
the first snapshot, I understand that it will become COW again. So I
don't see any point in making it NOCOW.

Thanks for your reply. I appreciate any other feedback or suggestions.

Background: I'm not sure why our Firefox performance is so terrible
but here's my original post from Sept 20. (I could repost the earlier
replies too if needed.) I've been waiting to have a window of
opportunity to try to fix our Firefox performance again, and now I
have that chance.

>On Thu 2017-08-31 (09:05), Ulli Horlacher wrote:
>> When I do a
>> btrfs filesystem defragment -r /directory
>> does it defragment really all files in this directory tree, even if it
>> contains subvolumes?
>> The man page does not mention subvolumes on this topic.
>
>No answer so far :-(
>
>But I found another problem in the man-page:
>
>  Defragmenting with Linux kernel versions < 3.9 or >= 3.14-rc2 as well as
>  with Linux stable kernel versions >= 3.10.31, >= 3.12.12 or >= 3.13.4
>  will break up the ref-links of COW data (for example files copied with
>  cp --reflink, snapshots or de-duplicated data). This may cause
>  considerable increase of space usage depending on the broken up
>  ref-links.
>
>I am running Ubuntu 16.04 with Linux kernel 4.10 and I have several
>snapshots.
>Therefore, I better should avoid calling "btrfs filesystem defragment -r"?
>
>What is the defragmenting best practice?
>Avoid it completly?

My question is the same as the OP in this thread, so I came here to
read the answers before asking.

Based on the answers here, it sounds like I should not run defrag at
all. However, I have a performance problem I need to solve, so if I
don't defrag, I need to do something else.

Here's my scenario. Some months ago I built an over-the-top powerful
desktop computer / workstation and I was looking forward to really
fantastic performance improvements over my 6 year old Ubuntu machine.
I installed Arch Linux on BTRFS on the new computer (on an SSD). To my
shock, it was no faster than my old machine. I focused a lot on
Firefox performance because I use Firefox a lot and that was one of
the applications in which I was most looking forward to better
performance.

I tried everything I could think of and everything recommended to me
in various forums (except switching to Windows) and the performance
remained very disappointing.

Then today I read the following:

Gotchas - btrfs Wiki

Re: defragmenting best practice?

2017-10-31 Thread Dave
I'm following up on all the suggestions regarding Firefox performance
on BTRFS. I have time to make these changes now, but I am having
trouble figuring out what to do. The constraints are:

1. BTRFS snapshots have proven to be too useful (and too important to
our overall IT approach) to forego.
2. We do not see any practical alternative (for us) to the incremental
backup strategy
(https://btrfs.wiki.kernel.org/index.php/Incremental_Backup)
3. We have large amounts of storage space (and can add more), but not
enough to break all reflinks on all snapshots.
4. We can transfer snapshots to backup storage (and thereby retain
minimal snapshots on the live volume)
3. Our team is standardized on Firefox. (Switching to Chromium is not
an option for us.)
5. Firefox profile sync has not worked well for us in the past, so we
don't use it.
6. Our machines generally have plenty of RAM so we could put the
Firefox cache (and maybe profile) into RAM using a technique such as
https://wiki.archlinux.org/index.php/Firefox/Profile_on_RAM. However,
profile persistence is important.

The most common recommendations were to switch to Chromium, defragment
and don't use snapshots. As the constraints above illustrate, we
cannot do those things.

The tentative solution I have come up with is:

1. Continue using snapshots, but retain the minimal number possible on
the live volume. Move historical snapshots to a backup device using
btrfs send-receive.
(https://btrfs.wiki.kernel.org/index.php/Incremental_Backup)

2. Put $HOME/.cache on a separate BTRFS subvolume that is mounted
nocow -- it will NOT be snapshotted

3. Put most of $HOME on a "home" volume but separate all user
documents to another volume (i.e., "documents").

3.a. The "home" volume will retain only the one most recent snapshot
on that live volume. (More backup history will be retained on a backup
volume. ) This home volume can be defragmented. With one snapshot,
that will double our space usage, which is acceptable.

3.b. The documents volume will be snapshotted hourly and 36 hourly
snapshots plus daily, weekly and monthly snapshots retained. Therefore
it will NOT be defragmented, as that would not be practical or
space-wise possible.

3.c. The root volume (operating system, etc.) will follow a strategy
similar to home, but will also retain pre- and post- update snapshots.

4. Put the Firefox cache in RAM

5. If needed, consider putting the Firefox profile in RAM

6. Make sure Firefox is running in multi-process mode. (Duncan's
instructions, while greatly appreciated and very useful, left me
slightly confused about pulseaudio's compatibility with multi-process
mode.)

7. Check various Firefox performance tweaks such as these:
https://wiki.archlinux.org/index.php/Firefox/Tweaks

Can anyone guess whether this will be sufficient to solve our severe
performance problems? Do these steps make sense? Will any of these
steps lead to new problems? Should I proceed to give them a try? Or
can anyone suggest a better set of steps to test?

Notes:

In regard to snapshots, we must retain about 36 hourly snapshots of
user documents, for example. We have to have pre- and post- package
upgrade snapshots from at least the most recent operating system &
application package update. And we have to retain several daily,
weekly and monthly snapshots of system directories and some other
locations.) Most of these snapshots can be retained on backup storage
devices.

Regarding Firefox profile sync, it does not have an intelligent method
for resolving conflicts, for example. We found too many unexpected
changes when using sync, so we do not use it.

On Thu, Sep 21, 2017 at 7:09 AM, Duncan <1i5t5.dun...@cox.net> wrote:
> Dave posted on Wed, 20 Sep 2017 02:38:13 -0400 as excerpted:
>
>> Here's my scenario. Some months ago I built an over-the-top powerful
>> desktop computer / workstation and I was looking forward to really
>> fantastic performance improvements over my 6 year old Ubuntu machine. I
>> installed Arch Linux on BTRFS on the new computer (on an SSD). To my
>> shock, it was no faster than my old machine. I focused a lot on Firefox
>> performance because I use Firefox a lot and that was one of the
>> applications in which I was most looking forward to better performance.
>>
>> I tried everything I could think of and everything recommended to me in
>> various forums (except switching to Windows) and the performance
>> remained very disappointing.
>>
>> Then today I read the following:
>>
>> Gotchas - btrfs Wiki https://btrfs.wiki.kernel.org/index.php/Gotchas
>>
>> Fragmentation: Files with a lot of random writes can become
>> heavily fragmented (1+ extents) causing excessive multi-second
>> spikes of CPU load on systems with an SSD or large amount a RAM. On
>> desktops this primarily affects application databases (i

Re: Problem with file system

2017-10-29 Thread Dave
This is a very helpful thread. I want to share an interesting related story.

We have a machine with 4 btrfs volumes and 4 Snapper configs. I
recently discovered that Snapper timeline cleanup been turned off for
3 of those volumes. In the Snapper configs I found this setting:

TIMELINE_CLEANUP="no"

Normally that would be set to "yes". So I corrected the issue and set
it to "yes" for the 3 volumes where it had not been set correctly.

I suppose it was turned off temporarily and then somebody forgot to
turn it back on.

What I did not know, and what I did not realize was a critical piece
of information, was how long timeline cleanup had been turned off and
how many snapshots had accumulated on each volume in that time.

I naively re-enabled Snapper timeline cleanup. The instant I started
the  snapper-cleanup.service  the system was hosed. The ssh session
became unresponsive, no other ssh sessions could be established and it
was impossible to log into the system at the console.

My subsequent investigation showed that the root filesystem volume
accumulated more than 3000 btrfs snapshots. The two other affected
volumes also had very large numbers of snapshots.

Deleting a single snapshot in that situation would likely require
hours. (I set up a test, but I ran out of patience before I was able
to delete even a single snapshot.) My guess is that if we had been
patient enough to wait for all the snapshots to be deleted, the
process would have finished in some number of months (or maybe a
year).

We did not know most of this at the time, so we did what we usually do
when a system becomes totally unresponsive -- we did a hard reset. Of
course, we could never get the system to boot up again.

Since we had backups, the easiest option became to replace that system
-- not unlike what the OP decided to do. In our case, the hardware was
not old, so we simply reformatted the drives and reinstalled Linux.

That's a drastic consequence of changing TIMELINE_CLEANUP="no" to
TIMELINE_CLEANUP="yes" in the snapper config.

It's all part of the process of gaining critical experience with
BTRFS. Whether or not BTRFS is ready for production use is (it seems
to me) mostly a question of how knowledgeable and experienced are the
people administering it.

In the various online discussions on this topic, all the focus is on
whether or not BTRFS itself is production-ready. At the current
maturity level of BTRFS, I think that's the wrong focus. The right
focus is on how production-ready is the admin person or team (with
respect to their BTRFS knowledge and experience). When a filesystem
has been around for decades, most of the critical admin issues become
fairly common knowledge, fairly widely known and easy to find. When a
filesystem is newer, far fewer people understand the gotchas. Also, in
older or widely used filesystems, when someone hits a gotcha, the
response isn't "that filesystem is not ready for production". Instead
the response is, "you should have known not to do that."

On Wed, Apr 26, 2017 at 12:43 PM, Fred Van Andel  wrote:
> Yes I was running qgroups.
> Yes the filesystem is highly fragmented.
> Yes I have way too many snapshots.
>
> I think it's clear that the problem is on my end. I simply placed too
> many demands on the filesystem without fully understanding the
> implications.  Now I have to deal with the consequences.
>
> It was decided today to replace this computer due to its age.  I will
> use the recover command to pull the needed data off this system and
> onto the new one.
>
>
> Thank you everyone for your assistance and the education.
>
> Fred
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] fsperf: a simple fs/block performance testing framework

2017-10-09 Thread Dave Chinner
On Mon, Oct 09, 2017 at 09:00:51AM -0400, Josef Bacik wrote:
> On Mon, Oct 09, 2017 at 04:17:31PM +1100, Dave Chinner wrote:
> > On Sun, Oct 08, 2017 at 10:25:10PM -0400, Josef Bacik wrote:
> > > > Integrating into fstests means it will be immediately available to
> > > > all fs developers, it'll run on everything that everyone already has
> > > > setup for filesystem testing, and it will have familiar mkfs/mount
> > > > option setup behaviour so there's no new hoops for everyone to jump
> > > > through to run it...
> > > > 
> > > 
> > > TBF I specifically made it as easy as possible because I know we all hate 
> > > trying
> > > to learn new shit.
> > 
> > Yeah, it's also hard to get people to change their workflows to add
> > a whole new test harness into them. It's easy if it's just a new
> > command to an existing workflow :P
> > 
> 
> Agreed, so if you probably won't run this outside of fstests then I'll add it 
> to
> xfstests.  I envision this tool as being run by maintainers to verify their 
> pull
> requests haven't regressed since the last set of patches, as well as by 
> anybody
> trying to fix performance problems.  So it's way more important to me that 
> you,
> Ted, and all the various btrfs maintainers will run it than anybody else.
> 
> > > I figured this was different enough to warrant a separate
> > > project, especially since I'm going to add block device jobs so Jens can 
> > > test
> > > block layer things.  If we all agree we'd rather see this in fstests then 
> > > I'm
> > > happy to do that too.  Thanks,
> > 
> > I'm not fussed either way - it's a good discussion to have, though.
> > 
> > If I want to add tests (e.g. my time-honoured fsmark tests), where
> > should I send patches?
> > 
> 
> I beat you to that!  I wanted to avoid adding fs_mark to the suite because it
> means parsing another different set of outputs, so I added a new ioengine to 
> fio
> for this
> 
> http://www.spinics.net/lists/fio/msg06367.html
> 
> and added a fio job to do 500k files
> 
> https://github.com/josefbacik/fsperf/blob/master/tests/500kemptyfiles.fio
> 
> The test is disabled by default for now because obviously the fio support 
> hasn't
> landed yet.

That seems  misguided. fio is good, but it's not a universal
solution.

> I'd _like_ to expand fio for cases we come up with that aren't possible, as
> there's already a ton of measurements that are taken, especially around
> latencies.

To be properly useful it needs to support more than just fio to run
tests. Indeed, it's largely useless to me if that's all it can do or
it's a major pain to add support for different tools like fsmark.

e.g.  my typical perf regression test that you see the concurrnet
fsmark create workload is actually a lot more. It does:

fsmark to create 50m zero length files
umount,
run parallel xfs_repair (excellent mmap_sem/page fault punisher)
mount
run parallel find -ctime (readdir + lookup traversal)
unmount, mount
run parallel ls -R (readdir + dtype traversal)
unmount, mount
parallel rm -rf of 50m files

I have variants that use small 4k files or large files rather than
empty files, taht use different fsync patterns to stress the
log, use grep -R to traverse the data as well as
the directory/inode structure instead of find, etc.

> That said I'm not opposed to throwing new stuff in there, it just
> means we have to add stuff to parse the output and store it in the database 
> in a
> consistent way, which seems like more of a pain than just making fio do what 
> we
> need it to.  Thanks,

fio is not going to be able to replace the sort of perf tests I run
from week to week. If that's all it's going to do then it's not
directly useful to me...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] fsperf: a simple fs/block performance testing framework

2017-10-08 Thread Dave Chinner
On Sun, Oct 08, 2017 at 10:25:10PM -0400, Josef Bacik wrote:
> On Mon, Oct 09, 2017 at 11:51:37AM +1100, Dave Chinner wrote:
> > On Fri, Oct 06, 2017 at 05:09:57PM -0400, Josef Bacik wrote:
> > > Hello,
> > > 
> > > One thing that comes up a lot every LSF is the fact that we have no 
> > > general way
> > > that we do performance testing.  Every fs developer has a set of scripts 
> > > or
> > > things that they run with varying degrees of consistency, but nothing 
> > > central
> > > that we all use.  I for one am getting tired of finding regressions when 
> > > we are
> > > deploying new kernels internally, so I wired this thing up to try and 
> > > address
> > > this need.
> > > 
> > > We all hate convoluted setups, the more brain power we have to put in to 
> > > setting
> > > something up the less likely we are to use it, so I took the xfstests 
> > > approach
> > > of making it relatively simple to get running and relatively easy to add 
> > > new
> > > tests.  For right now the only thing this framework does is run fio 
> > > scripts.  I
> > > chose fio because it already gathers loads of performance data about it's 
> > > runs.
> > > We have everything we need there, latency, bandwidth, cpu time, and all 
> > > broken
> > > down by reads, writes, and trims.  I figure most of us are familiar 
> > > enough with
> > > fio and how it works to make it relatively easy to add new tests to the
> > > framework.
> > > 
> > > I've posted my code up on github, you can get it here
> > > 
> > > https://github.com/josefbacik/fsperf
> > > 
> > > All (well most) of the results from fio are stored in a local sqlite 
> > > database.
> > > Right now the comparison stuff is very crude, it simply checks against the
> > > previous run and it only checks a few of the keys by default.  You can 
> > > check
> > > latency if you want, but while writing this stuff up it seemed that 
> > > latency was
> > > too variable from run to run to be useful in a "did my thing regress or 
> > > improve"
> > > sort of way.
> > > 
> > > The configuration is brain dead simple, the README has examples.  All you 
> > > need
> > > to do is make your local.cfg, run ./setup and then run ./fsperf and you 
> > > are good
> > > to go.
> > 
> > Why re-invent the test infrastructure? Why not just make it a
> > tests/perf subdir in fstests?
> > 
> 
> Probably should have led with that shouldn't I have?  There's nothing keeping 
> me
> from doing it, but I didn't want to try and shoehorn in a python thing into
> fstests.  I need python to do the sqlite and the json parsing to dump into the
> sqlite database.
> 
> Now if you (and others) are not opposed to this being dropped into tests/perf
> then I'll work that up.  But it's definitely going to need to be done in 
> python.
> I know you yourself have said you aren't opposed to using python in the past, 
> so
> if that's still the case then I can definitely wire it all up.

I have no problems with people using python for stuff like this but,
OTOH, I'm not the fstests maintainer anymore :P

> > > The plan is to add lots of workloads as we discover regressions and such. 
> > >  We
> > > don't want anything that takes too long to run otherwise people won't run 
> > > this,
> > > so the existing tests don't take much longer than a few minutes each.  I 
> > > will be
> > > adding some more comparison options so you can compare against averages 
> > > of all
> > > previous runs and such.
> > 
> > Yup, that fits exactly into what fstests is for... :P
> > 
> > Integrating into fstests means it will be immediately available to
> > all fs developers, it'll run on everything that everyone already has
> > setup for filesystem testing, and it will have familiar mkfs/mount
> > option setup behaviour so there's no new hoops for everyone to jump
> > through to run it...
> > 
> 
> TBF I specifically made it as easy as possible because I know we all hate 
> trying
> to learn new shit.

Yeah, it's also hard to get people to change their workflows to add
a whole new test harness into them. It's easy if it's just a new
command to an existing workflow :P

> I figured this was different enough to warrant a separate
> project, especially since I'm going to add block device jobs so Jens can test
> block layer things.  If we all agree we'd rather see this in fstests then I'm
> happy to do that too.  Thanks,

I'm not fussed either way - it's a good discussion to have, though.

If I want to add tests (e.g. my time-honoured fsmark tests), where
should I send patches?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] fsperf: a simple fs/block performance testing framework

2017-10-08 Thread Dave Chinner
On Fri, Oct 06, 2017 at 05:09:57PM -0400, Josef Bacik wrote:
> Hello,
> 
> One thing that comes up a lot every LSF is the fact that we have no general 
> way
> that we do performance testing.  Every fs developer has a set of scripts or
> things that they run with varying degrees of consistency, but nothing central
> that we all use.  I for one am getting tired of finding regressions when we 
> are
> deploying new kernels internally, so I wired this thing up to try and address
> this need.
> 
> We all hate convoluted setups, the more brain power we have to put in to 
> setting
> something up the less likely we are to use it, so I took the xfstests approach
> of making it relatively simple to get running and relatively easy to add new
> tests.  For right now the only thing this framework does is run fio scripts.  
> I
> chose fio because it already gathers loads of performance data about it's 
> runs.
> We have everything we need there, latency, bandwidth, cpu time, and all broken
> down by reads, writes, and trims.  I figure most of us are familiar enough 
> with
> fio and how it works to make it relatively easy to add new tests to the
> framework.
> 
> I've posted my code up on github, you can get it here
> 
> https://github.com/josefbacik/fsperf
> 
> All (well most) of the results from fio are stored in a local sqlite database.
> Right now the comparison stuff is very crude, it simply checks against the
> previous run and it only checks a few of the keys by default.  You can check
> latency if you want, but while writing this stuff up it seemed that latency 
> was
> too variable from run to run to be useful in a "did my thing regress or 
> improve"
> sort of way.
> 
> The configuration is brain dead simple, the README has examples.  All you need
> to do is make your local.cfg, run ./setup and then run ./fsperf and you are 
> good
> to go.

Why re-invent the test infrastructure? Why not just make it a
tests/perf subdir in fstests?

> The plan is to add lots of workloads as we discover regressions and such.  We
> don't want anything that takes too long to run otherwise people won't run 
> this,
> so the existing tests don't take much longer than a few minutes each.  I will 
> be
> adding some more comparison options so you can compare against averages of all
> previous runs and such.

Yup, that fits exactly into what fstests is for... :P

Integrating into fstests means it will be immediately available to
all fs developers, it'll run on everything that everyone already has
setup for filesystem testing, and it will have familiar mkfs/mount
option setup behaviour so there's no new hoops for everyone to jump
through to run it...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Something like ZFS Channel Programs for BTRFS & probably XFS or even VFS?

2017-10-03 Thread Dave Chinner
On Tue, Oct 03, 2017 at 01:40:51PM -0700, Matthew Wilcox wrote:
> On Wed, Oct 04, 2017 at 07:10:35AM +1100, Dave Chinner wrote:
> > On Tue, Oct 03, 2017 at 03:19:18PM +0200, Martin Steigerwald wrote:
> > > [repost. I didn´t notice autocompletion gave me wrong address for 
> > > fsdevel, 
> > > blacklisted now]
> > > 
> > > Hello.
> > > 
> > > What do you think of
> > > 
> > > http://open-zfs.org/wiki/Projects/ZFS_Channel_Programs
> > 
> > Domain not found.
> 
> Must be an Australian problem ...

Probably, I forgot to stand on my head so everything must have been
sent to the server upside down

Though it is a curious failure - it failed until I went to
"openzfs.org" and that redirected to "open-zfs.org" and now it all
works. Somewhat bizarre.

> A ZFS channel program (ZCP) is a small script written in a domain specific
> language that manipulate ZFS internals in a single, atomically-visible
> operation.  For instance, to delete all snapshots of a filesystem a ZCP
> could be written which 1) generates the list of snapshots, 2) traverses
> that list, and 3) destroys each snapshot unconditionally. Because
> each of these statements would be evaluated from within the kernel,
> ZCPs can guarantee safety from interference with other concurrent ZFS
> modifications. Executing from inside the kernel allows us to guarantee
> atomic visibility of these operations (correctness) and allows them to
> be performed in a single transaction group (performance).
>
> A successful implementation of ZCP will:
> 
> 1. Support equivalent functionality for all of the current ZFS commands
> with improved performance and correctness from the point of view of the
> user of ZFS.
> 
> 2. Facilitate the quick addition of new and useful commands as
> ZCP enables the implementation of more powerful operations which
> previously would have been unsafe to implement in user programs, or
> would require modifications to the kernel for correctness. Since the
> ZCP layer guarantees the atomicity of each ZCP, we only need to write
> new sync_tasks for individual simple operations, then can use ZCPs to
> chain those simple operations together into more complicated operations.
> 
> 3. Allow ZFS users to safely implement their own ZFS operations without
> performing operations they don’t have the privileges for.
> 
> 4. Improve the performance and correctness of existing applications
> built on ZFS operations.

/me goes and looks at the slides

Seems like they are trying to solve a problem of their own making,
in that admin operations are run by the kernel from a separate task
that is really, really slow. So this scripting is a method of aggregating
multiple "sync tasks" into a single operation so there isn't delays
between tasks.

/me chokes on slide 8/8

"Add a Lua interpreter to the kernel, implement ZFS intrinsics (...)
as extensions to the Lua language"

Somehow, I don't see that happening in Linux.

Yes, I can see us potentially adding some custom functionality in
filesystems with eBPF (e.g. custom allocation policies), but I think
admin operations need to be done from userspace through a clear,
stable interface that supports all the necessary primitives to
customise admin operations for different needs.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Something like ZFS Channel Programs for BTRFS & probably XFS or even VFS?

2017-10-03 Thread Dave Chinner
On Tue, Oct 03, 2017 at 03:19:18PM +0200, Martin Steigerwald wrote:
> [repost. I didn´t notice autocompletion gave me wrong address for fsdevel, 
> blacklisted now]
> 
> Hello.
> 
> What do you think of
> 
> http://open-zfs.org/wiki/Projects/ZFS_Channel_Programs

Domain not found.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-09-21 Thread Dave
These are great suggestions. I will test several of them (or all of
them) and report back with my results once I have done the testing.
Thank you! This is a fantastic mailing list.

P.S. I'm inclined to stay with Firefox, but I will definitely test
Chromium vs Firefox after making a series of changes based on the
suggestions here. I would hate to see the market lose the option of
Firefox because everyone goes to Chrome/Chromium.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: btrfs/150 regression test for reading compressed data

2017-09-20 Thread Dave Chinner
On Wed, Sep 20, 2017 at 05:52:43PM -0600, Liu Bo wrote:
> We had a bug in btrfs compression code which could end up with a
> kernel panic.
> 
> This is adding a regression test for the bug and I've also sent a
> kernel patch to fix the bug.
> 
> The patch is "Btrfs: fix kernel oops while reading compressed data".
> 
> Signed-off-by: Liu Bo <bo.li@oracle.com>
> ---
>  tests/btrfs/150 | 102 
> 
>  tests/btrfs/150.out |   3 ++
>  tests/btrfs/group   |   1 +
>  3 files changed, 106 insertions(+)
>  create mode 100755 tests/btrfs/150
>  create mode 100644 tests/btrfs/150.out
> 
> diff --git a/tests/btrfs/150 b/tests/btrfs/150
> new file mode 100755
> index 000..834be51
> --- /dev/null
> +++ b/tests/btrfs/150
> @@ -0,0 +1,102 @@
> +#! /bin/bash
> +# FS QA Test btrfs/150
> +#
> +# This is a regression test which ends up with a kernel oops in btrfs.

group += dangerous

> +# It occurs when btrfs's read repair happens while reading a compressed
> +# extent.
> +# The patch for this is 
> +# x

Incomplete?

> +#
> +#---
> +# Copyright (c) 2017 Liu Bo.  All Rights Reserved.

You're signing off this patch an Oracle employee, but claiming
personal copyright. Please clarify who owns the copyright - if it's
your personal copyright then please sign off with a personal email
address, not your employer's...

Also, I note that these recently added tests from you:

tests/btrfs/140:# Copyright (c) 2017 Liu Bo.  All Rights Reserved.
tests/btrfs/141:# Copyright (c) 2017 Liu Bo.  All Rights Reserved.
tests/btrfs/142:# Copyright (c) 2017 Liu Bo.  All Rights Reserved.
tests/btrfs/143:# Copyright (c) 2017 Liu Bo.  All Rights Reserved.
tests/generic/406:# Copyright (c) 2017 Liu Bo.  All Rights Reserved.

all have this same ambiguity - personal copyright with employer
signoff in the commit. This definitely needs clarification and
fixing if it is wrong


> +disable_io_failure()
> +{
> +echo 0 > $SYSFS_BDEV/make-it-fail
> +echo 0 > $DEBUGFS_MNT/fail_make_request/probability
> +echo 0 > $DEBUGFS_MNT/fail_make_request/times
> +}
> +
> +_scratch_pool_mkfs "-d raid1 -b 1G" >> $seqres.full 2>&1
> +
> +# It doesn't matter which compression algorithm we use.
> +_scratch_mount -ocompress
> +
> +# Create a file with all data being compressed
> +$XFS_IO_PROG -f -c "pwrite -W 0 8K" $SCRATCH_MNT/foobar | _filter_xfs_io

needs an fsync to reach disk.

> +# Raid1 consists of two copies and btrfs decides which copy to read by 
> reader's
> +# %pid.  Now we inject errors to copy #1 and copy #0 is good.  We want to 
> read
> +# the bad copy to trigger read-repair.
> +while true; do
> + disable_io_failure
> + # invalidate the page cache
> + $XFS_IO_PROG -f -c "fadvise -d 0 128K" $SCRATCH_MNT/foobar | 
> _filter_xfs_io
> +
> + enable_io_failure
> + od -x $SCRATCH_MNT/foobar > /dev/null &

why are you using od to read the data when the output is piped to
dev/null? why not just xfs_io -c "pread 0 8k" ?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ERROR: parent determination failed (btrfs send-receive)

2017-09-20 Thread Dave
On Tue, Sep 19, 2017 at 3:37 PM, Andrei Borzenkov <arvidj...@gmail.com> wrote:
> 18.09.2017 09:10, Dave пишет:
>> I use snap-sync to create and send snapshots.
>>
>> GitHub - wesbarnett/snap-sync: Use snapper snapshots to backup to external 
>> drive
>> https://github.com/wesbarnett/snap-sync
>>
>
> Are you trying to backup top-level subvolume?

No. I never mount the top-level subvolume.

>  I just reproduced this > behavior with this tool. The problem is, snapshots 
> of top-level
> subvolume do not have parent UUID (I am not even sure if UUID exists at
> all TBH). If you mount any other subvolume, it will work.

Well, that's not exactly my issue but it still helps me gain
additional understanding.  It may help me reproduce my original
problem too. Thank you.

>
>
> As I told you in the first reply, showing output of "btrfs su li -qu
> /path/to/src" would explain your problem much earlier.
>
> Actually if snap-sync used "btrfs send -p" instead of "btrfs send -c" it
> would work as well, as then no parent search would be needed (and as I
> mentioned in another mail both commands are functionally equivalent).
> But this becomes really off-topic on this list. As already suggested,
> open issue for snap-sync.

Thank you for this also. I will simply edit my current copy of the
snap-sync script to use "-p" now. That's a simple change I am
implement immediately.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-09-20 Thread Dave
>On Thu 2017-08-31 (09:05), Ulli Horlacher wrote:
>> When I do a
>> btrfs filesystem defragment -r /directory
>> does it defragment really all files in this directory tree, even if it
>> contains subvolumes?
>> The man page does not mention subvolumes on this topic.
>
>No answer so far :-(
>
>But I found another problem in the man-page:
>
>  Defragmenting with Linux kernel versions < 3.9 or >= 3.14-rc2 as well as
>  with Linux stable kernel versions >= 3.10.31, >= 3.12.12 or >= 3.13.4
>  will break up the ref-links of COW data (for example files copied with
>  cp --reflink, snapshots or de-duplicated data). This may cause
>  considerable increase of space usage depending on the broken up
>  ref-links.
>
>I am running Ubuntu 16.04 with Linux kernel 4.10 and I have several
>snapshots.
>Therefore, I better should avoid calling "btrfs filesystem defragment -r"?
>
>What is the defragmenting best practice?
>Avoid it completly?

My question is the same as the OP in this thread, so I came here to
read the answers before asking. However, it turns out that I still
need to ask something. Should I ask here or start a new thread? (I'll
assume here, since the topic is the same.)

Based on the answers here, it sounds like I should not run defrag at
all. However, I have a performance problem I need to solve, so if I
don't defrag, I need to do something else.

Here's my scenario. Some months ago I built an over-the-top powerful
desktop computer / workstation and I was looking forward to really
fantastic performance improvements over my 6 year old Ubuntu machine.
I installed Arch Linux on BTRFS on the new computer (on an SSD). To my
shock, it was no faster than my old machine. I focused a lot on
Firefox performance because I use Firefox a lot and that was one of
the applications in which I was most looking forward to better
performance.

I tried everything I could think of and everything recommended to me
in various forums (except switching to Windows) and the performance
remained very disappointing.

Then today I read the following:

Gotchas - btrfs Wiki
https://btrfs.wiki.kernel.org/index.php/Gotchas

Fragmentation: Files with a lot of random writes can become
heavily fragmented (1+ extents) causing excessive multi-second
spikes of CPU load on systems with an SSD or large amount a RAM. On
desktops this primarily affects application databases (including
Firefox). Workarounds include manually defragmenting your home
directory using btrfs fi defragment. Auto-defragment (mount option
autodefrag) should solve this problem.

Upon reading that I am wondering if fragmentation in the Firefox
profile is part of my issue. That's one thing I never tested
previously. (BTW, this system has 256 GB of RAM and 20 cores.)

Furthermore, on the same BTRFS Wiki page, it mentions the performance
penalties of many snapshots. I am keeping 30 to 50 snapshots of the
volume that contains the Firefox profile.

Would these two things be enough to turn top-of-the-line hardware into
a mediocre-preforming desktop system? (The system performs fine on
benchmarks -- it's real life usage, particularly with Firefox where it
is disappointing.)

After reading the info here, I am wondering if I should make a new
subvolume just for my Firefox profile(s) and not use COW and/or not
keep snapshots on it and mount it with the autodefrag option.

As part of this strategy, I could send snapshots to another disk using
btrfs send-receive. That way I would have the benefits of snapshots
(which are important to me), but by not keeping any snapshots on the
live subvolume I could avoid the performance problems.

What would you guys do in this situation?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snapshots of encrypted directories?

2017-09-19 Thread Dave
On Fri, Sep 15, 2017 at 6:01 AM, Ulli Horlacher
 wrote:
>
> On Fri 2017-09-15 (06:45), Andrei Borzenkov wrote:
>
> > The actual question is - do you need to mount each individual btrfs
> > subvolume when using encfs?
>
> And even worse it goes with ecryptfs: I do not know at all how to mount a
> snapshot, so that the user has access to it.
>
> It seems snapshots are incompatible with encrypted filesystems :-(


My experience is the opposite. I use dm-crypt as well as encfs with
BTRFS and everything, including snapshots, works as I would expect it
to work.

I have been able to successfully restore snapshots that contained
encrypted data.

I think the other answers have already provided more details than I
could provide, so I just wanted to add the fact that my experience has
been positive with BTRFS snapshots and encryption.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ERROR: parent determination failed (btrfs send-receive)

2017-09-19 Thread Dave
On Mon, Sep 18, 2017 at 12:23 AM, Andrei Borzenkov <arvidj...@gmail.com> wrote:
>
> 18.09.2017 05:31, Dave пишет:
> > Sometimes when using btrfs send-receive, I get errors like this:
> >
> > ERROR: parent determination failed for 
> >
> > When this happens, btrfs send-receive backups fail. And all subsequent
> > backups fail too.
> >
> > The issue seems to stem from the fact that an automated cleanup
> > process removes certain earlier subvolumes. (I'm using Snapper.)
> >
> > I'd like to understand exactly what is happening so that my backups do
> > not unexpectedly fail.
> >
>
> You try to send incremental changes but you deleted subvolume to compute
> changes against. It is hard to tell more without seeing subvolume list
> with uuid/parent uuid.
>

Thanks to Duncan <1i5t5.dun...@cox.net> I have a bit better
understanding of -c and -p now.

My backup tool is using only the -c option. (The tool is GitHub -
wesbarnett/snap-sync https://github.com/wesbarnett/snap-sync .)

No subvolume at the destination had ever been deleted.

At the source, a number (about 30 in this case) preceding snapshots
(subvolumes) exist, but some others have been cleaned up with
Snapper's timeline algorithm.

I think I understand that with the -c option, it is **not** strictly
necessary that the specified snapshots exist on both the source and
destination. It seems I had sufficient subvolumes available for the
incremental send to work with the -c option.

Therefore, it still isn't clear why I got the error: ERROR: parent
determination failed.

Further general comments will be helpful.

I understand the limits in getting too specific in replies because I
cannot give subvolume lists with uuid's.

As mentioned, I cannot give that info because I tried to fix the above
error by sending a subvolume from the destination back to the target.
This resulted in my source volume having a "Received UUID" which
wrecked all subsequent backups.

I now understand that (for my use cases) a source subvolume for a
send-receive should **never** have a Received UUID. (If that is indeed
a general rule, it seems btrfs tools should check it. Or possibly
something about this the pitfalls of a "Received UUID" in a source
could be listed on the BTRFS incremental backup wiki page?)

I was previously referred to the FAQ here:
https://github.com/digint/btrbk/blob/master/doc/FAQ.md#im-getting-an-error-aborted-received-uuid-is-set

But in the end, I decided the safest strategy was to delete all prior
backup subvolumes so I could be sure my backups were valid going
forward.

I am asking these questions now to avoid getting into a situation like
this again (hopefully). Any general comments are appreciated.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


difference between -c and -p for send-receive?

2017-09-18 Thread Dave
new subject for new question

On Mon, Sep 18, 2017 at 1:37 PM, Andrei Borzenkov  wrote:

> >> What scenarios can lead to "ERROR: parent determination failed"?
> >
> > The man page for btrfs-send is reasonably clear on the requirements
> > btrfs imposes. If you want to use incremental sends (i.e. the -c or -p
> > options) then the specified snapshots must exist on both the source and
> > destination. If you don't have a suitable existing snapshot then don't
> > use -c or -p and just do a full send.
> >
>
> Well, I do not immediately see why -c must imply incremental send. We
> want to reduce amount of data that is transferred, so reuse data from
> existing snapshots, but it is really orthogonal to whether we send full
> subvolume or just changes since another snapshot.
>

Starting months ago when I began using btrfs serious, I have been
reading, rereading and trying to understand this:

FAQ - btrfs Wiki
https://btrfs.wiki.kernel.org/index.php/FAQ#What_is_the_difference_between_-c_and_-p_in_send.3F

The comment above suddenly gives me another clue...

However, I still don't understand terms like "clone range ioctl",
although I can guess it is something like a hard link.

Would it be correct to say the following?

1. "-c" causes (appropriate) files in the newly transferred snapshot
to be "hard linked" to existing files in another snapshot on the
destination. Doesn't "-p" do something equivalent though?

2. The -c and -p options can be used together or individually.

Questions:

If "-c" "will send all of the metadata of @B.1, but will leave out the
data for @B.1/bigfile, because it's already in the backups filesystem,
and can be reflinked from there" what will -p do in contrast?

Will "-p" not send all the metadata?

Will "-p" also leave out the data for @B.1/bigfile, when it's also
already in the backups?

What would make me choose one of these options over the other? I still
struggle to see the difference.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ERROR: parent determination failed (btrfs send-receive)

2017-09-18 Thread Dave
On Mon, Sep 18, 2017 at 12:23 AM, Andrei Borzenkov <arvidj...@gmail.com> wrote:
> 18.09.2017 05:31, Dave пишет:
>> Sometimes when using btrfs send-receive, I get errors like this:
>>
>> ERROR: parent determination failed for 
>>
>> When this happens, btrfs send-receive backups fail. And all subsequent
>> backups fail too.
>>
>> The issue seems to stem from the fact that an automated cleanup
>> process removes certain earlier subvolumes. (I'm using Snapper.)
>>
>> I'd like to understand exactly what is happening so that my backups do
>> not unexpectedly fail.
>>
>
> You try to send incremental changes but you deleted subvolume to compute
> changes against. It is hard to tell more without seeing subvolume list
> with uuid/parent uuid.

I do not have a current subvolume list to provide UUID's. To ensure
integrity of my backups, I was forced to delete all backup snapshots
and start over. (After this initial parent determination error, my
attempted solution created a new problem related to Received UUID's,
so removing all backups was the best solution in the end.)

For my understanding, what are the restrictions on deleting snapshots?

What scenarios can lead to "ERROR: parent determination failed"?

After all, it should not be necessary to retain every snapshot ever
made. We have to delete snapshots periodically.

In my case, I still retained every snapshot on the target volume. None
had ever been deleted (yet). And I also retained around 30 recent
snapshots on the source, according to the snapper timeline cleanup
config. Yet I still ran into "ERROR: parent determination failed".

>
>> In my scenario, no parent subvolumes have been deleted from the
>> target. Some subvolumes have been deleted from the source, but why
>> does that matter? I am able to take a valid snapshot at this time and
>> every snapshot ever taken continues to reside at the target backup
>> destination (seemingly meaning that a parent subvolume can be found at
>> the target).
>>
>> This issue seems to make btrfs send-receive a very fragile backup
>> solution.
>
> btrfs send/receive is not backup solution - it is low level tool that
> does exactly what it is told to do. You may create backup solution that
> is using btrfs send/receive to transfer data stream, but then do not
> blame tool for incorrect usage.
>
> To give better advice how to fix your situation you need to describe
> your backup solution - how exactly you select/create snapshots.

I use snap-sync to create and send snapshots.

GitHub - wesbarnett/snap-sync: Use snapper snapshots to backup to external drive
https://github.com/wesbarnett/snap-sync

>
> I hope, instead, there is some knowledge I'm missing, that
>> when learned, will make this a robust backup solution.
>>
>> Thanks
>> --
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ERROR: parent determination failed (btrfs send-receive)

2017-09-17 Thread Dave
Sometimes when using btrfs send-receive, I get errors like this:

ERROR: parent determination failed for 

When this happens, btrfs send-receive backups fail. And all subsequent
backups fail too.

The issue seems to stem from the fact that an automated cleanup
process removes certain earlier subvolumes. (I'm using Snapper.)

I'd like to understand exactly what is happening so that my backups do
not unexpectedly fail.

In my scenario, no parent subvolumes have been deleted from the
target. Some subvolumes have been deleted from the source, but why
does that matter? I am able to take a valid snapshot at this time and
every snapshot ever taken continues to reside at the target backup
destination (seemingly meaning that a parent subvolume can be found at
the target).

This issue seems to make btrfs send-receive a very fragile backup
solution. I hope, instead, there is some knowledge I'm missing, that
when learned, will make this a robust backup solution.

Thanks
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: send | receive: received snapshot is missing recent files

2017-09-13 Thread Dave
On Mon, Sep 11, 2017 at 11:19 PM, Andrei Borzenkov <arvidj...@gmail.com> wrote:
> 11.09.2017 20:53, Axel Burri пишет:
>> On 2017-09-08 06:44, Dave wrote:
>>> I'm referring to the link below. Using "btrfs subvolume snapshot -r"
>>> copies the Received UUID from the source into the new snapshot. The
>>> btrbk FAQ entry suggests otherwise. Has something changed?
>>
>> I don't think something has changed, the description for the read-only
>> subvolumes on the btrbk FAQ was just wrong (fixed now).
>>
>>> The only way I see to remove a Received UUID is to create a rw
>>> snapshot (above command without the "-r"), which is not ideal in this
>>> situation when cleaning up readonly source snapshots.
>>>
>>> Any suggestions? Thanks
>>
>> No suggestions from my part, as far as I know there is no way to easily
>> remove/change a received_uuid from a subvolume.
>>
>
> There is BTRFS_IOC_SET_RECEIVED_SUBVOL IOCTL which is used by "btrfs
> received". My understanding is that it can also be set to empty (this
> clearing it). You could write small program to do it.
>
> In general it sounds like a bug - removing read-only flag from subvolume
> by any means should also clear Received UUID as we cannot anymore
> guarantee that subvolume content is the same.

Yes! That makes a great deal of sense.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: send | receive: received snapshot is missing recent files

2017-09-07 Thread Dave
I'm referring to the link below. Using "btrfs subvolume snapshot -r"
copies the Received UUID from the source into the new snapshot. The
btrbk FAQ entry suggests otherwise. Has something changed?

The only way I see to remove a Received UUID is to create a rw
snapshot (above command without the "-r"), which is not ideal in this
situation when cleaning up readonly source snapshots.

Any suggestions? Thanks

On Thu, Sep 7, 2017 at 10:33 AM, Axel Burri <a...@tty0.ch> wrote:
>
> Having a received_uuid set on the source volume ("/home" in your case)
> is indeed a bad thing when it comes to send/receive. You probably
> restored a backup with send/receive, and made it read/write using "btrfs
> property set -ts /home ro false". This is a an evil thing, as it leaves
> received_uuid intact. In order to make a subvolume read-write, I
> recommend to use "btrfs subvolume snapshot  ".
>
> There is a FAQ entry on btrbk on how to fix this:
>
> https://github.com/digint/btrbk/blob/master/doc/FAQ.md#im-getting-an-error-aborted-received-uuid-is-set
>
>
> On 2017-09-07 15:34, Dave wrote:
> > I just ran a test. The btrfs send - receive problem I described is
> > indeed fully resolved by removing the "problematic" snapshot on the
> > target device. I did not make any changes to the source volume. I did
> > not make any other changes in my steps (see earlier message for my
> > exact steps).
> >
> > Therefore, the problem I described in my earlier message is not due
> > exclusively to having a Received UUID on the source volume (or to any
> > other feature of the source volume). It is not related to any feature
> > of the directly specified parent volume either. More details are
> > included in my earlier email.
> >
> > Thanks for any further feedback, including answers to my questions and
> > comments about whether this is a known issue.
> >
> >
> > On Thu, Sep 7, 2017 at 8:39 AM, Dave <davestechs...@gmail.com> wrote:
> >>
> >> Hello. Can anyone further explain this issue ("you have a Received UUID on 
> >> the source volume")?
> >>
> >> How does it happen?
> >> How does one remove a Received UUID from the source volume?
> >>
> >> And how does that explain my results where I showed that the problem
> >> is not dependent upon the source volume but is instead dependent upon
> >> some existing snapshot on the target volume?
> >>
> >> My results do not appear to be fully explained by a Received UUID on the 
> >> source volume, as my prior message hopefully shows clearly.
> >>
> >> Thank you.
> >>
> >> On Thu, Sep 7, 2017 at 2:24 AM, A L <crimsoncott...@gmail.com> wrote:
> >>> The problem can be that you have a Received UUID on the source volume. 
> >>> This breaks send-receive.
> >>>
> >>>  From: Dave <davestechs...@gmail.com> -- Sent: 2017-09-07 - 06:43 
> >>>
> >>>> Here is more info and a possible (shocking) explanation. This
> >>>> aggregates my prior messages and it provides an almost complete set of
> >>>> steps to reproduce this problem.
> >>>>
> >>>> Linux srv 4.9.41-1-lts #1 SMP Mon Aug 7 17:32:35 CEST 2017 x86_64 
> >>>> GNU/Linux
> >>>> btrfs-progs v4.12
> >>>>
> >>>> My steps:
> >>>>
> >>>> [root@srv]# sync
> >>>> [root@srv]# mkdir /home/.snapshots/test1
> >>>> [root@srv]# btrfs su sn -r /home/ /home/.snapshots/test1/
> >>>> Create a readonly snapshot of '/home/' in '/home/.snapshots/test1//home'
> >>>> [root@srv]# sync
> >>>> [root@srv]# mkdir /mnt/x5a/home/test1
> >>>> [root@srv]# btrfs send /home/.snapshots/test1/home/ | btrfs receive
> >>>> /mnt/x5a/home/test1/
> >>>> At subvol /home/.snapshots/test1/home/
> >>>> At subvol home
> >>>> [root@srv]# ls -la /mnt/x5a/home/test1/home/user1/
> >>>> NOTE: all recent files are present
> >>>> [root@srv]# ls -la /mnt/x5a/home/test1/home/user2/Documents/
> >>>> NOTE: all recent files are present
> >>>> [root@srv]# mkdir /home/.snapshots/test2
> >>>> [root@srv]# mkdir /mnt/x5a/home/test2
> >>>> [root@srv]# btrfs su sn -r /home/ /home/.snapshots/test2/
> >>>> Create a readonly snapshot of '/home/' in '/home/.snapshots/test2//home'
> >>>> [root@srv]# sync
> >>>> [root@srv]# btrfs send -p 

Re: send | receive: received snapshot is missing recent files

2017-09-07 Thread Dave
I just ran a test. The btrfs send - receive problem I described is
indeed fully resolved by removing the "problematic" snapshot on the
target device. I did not make any changes to the source volume. I did
not make any other changes in my steps (see earlier message for my
exact steps).

Therefore, the problem I described in my earlier message is not due
exclusively to having a Received UUID on the source volume (or to any
other feature of the source volume). It is not related to any feature
of the directly specified parent volume either. More details are
included in my earlier email.

Thanks for any further feedback, including answers to my questions and
comments about whether this is a known issue.


On Thu, Sep 7, 2017 at 8:39 AM, Dave <davestechs...@gmail.com> wrote:
>
> Hello. Can anyone further explain this issue ("you have a Received UUID on 
> the source volume")?
>
> How does it happen?
> How does one remove a Received UUID from the source volume?
>
> And how does that explain my results where I showed that the problem
> is not dependent upon the source volume but is instead dependent upon
> some existing snapshot on the target volume?
>
> My results do not appear to be fully explained by a Received UUID on the 
> source volume, as my prior message hopefully shows clearly.
>
> Thank you.
>
> On Thu, Sep 7, 2017 at 2:24 AM, A L <crimsoncott...@gmail.com> wrote:
> > The problem can be that you have a Received UUID on the source volume. This 
> > breaks send-receive.
> >
> >  From: Dave <davestechs...@gmail.com> -- Sent: 2017-09-07 - 06:43 
> >
> >> Here is more info and a possible (shocking) explanation. This
> >> aggregates my prior messages and it provides an almost complete set of
> >> steps to reproduce this problem.
> >>
> >> Linux srv 4.9.41-1-lts #1 SMP Mon Aug 7 17:32:35 CEST 2017 x86_64 GNU/Linux
> >> btrfs-progs v4.12
> >>
> >> My steps:
> >>
> >> [root@srv]# sync
> >> [root@srv]# mkdir /home/.snapshots/test1
> >> [root@srv]# btrfs su sn -r /home/ /home/.snapshots/test1/
> >> Create a readonly snapshot of '/home/' in '/home/.snapshots/test1//home'
> >> [root@srv]# sync
> >> [root@srv]# mkdir /mnt/x5a/home/test1
> >> [root@srv]# btrfs send /home/.snapshots/test1/home/ | btrfs receive
> >> /mnt/x5a/home/test1/
> >> At subvol /home/.snapshots/test1/home/
> >> At subvol home
> >> [root@srv]# ls -la /mnt/x5a/home/test1/home/user1/
> >> NOTE: all recent files are present
> >> [root@srv]# ls -la /mnt/x5a/home/test1/home/user2/Documents/
> >> NOTE: all recent files are present
> >> [root@srv]# mkdir /home/.snapshots/test2
> >> [root@srv]# mkdir /mnt/x5a/home/test2
> >> [root@srv]# btrfs su sn -r /home/ /home/.snapshots/test2/
> >> Create a readonly snapshot of '/home/' in '/home/.snapshots/test2//home'
> >> [root@srv]# sync
> >> [root@srv]# btrfs send -p /home/.snapshots/test1/home/
> >> /home/.snapshots/test2/home/ | btrfs receive /mnt/x5a/home/test2/
> >> At subvol /home/.snapshots/test2/home/
> >> At snapshot home
> >> [root@srv]# ls -la /mnt/x5a/home/test2/home/user1/
> >> NOTE: all recent files are MISSING
> >> [root@srv]# ls -la /mnt/x5a/home/test2/home/user2/Documents/
> >> NOTE: all recent files are MISSING
> >>
> >> Below I am including some rsync output to illustrate when a snapshot
> >> is missing files (or not):
> >>
> >> [root@srv]# rsync -aniv /home/.snapshots/test1/home/
> >> /home/.snapshots/test2/home/
> >> sending incremental file list
> >>
> >> sent 1,143,286 bytes  received 1,123 bytes  762,939.33 bytes/sec
> >> total size is 3,642,972,271  speedup is 3,183.28 (DRY RUN)
> >>
> >> This indicates that these two subvolumes contain the same files, which
> >> they should because test2 is a snapshot of test1 without any changes
> >> to files, and it was not sent to another physical device.
> >>
> >> The problem is when test2 is sent to another device as shown by the
> >> rsync results below.
> >>
> >> [root@srv]# rsync -aniv /home/.snapshots/test2/home/ 
> >> /mnt/x5a/home/test2/home/
> >> sending incremental file list
> >> .d..t.. ./
> >> .d..t.. user1/
> >>>f.st.. user1/.bash_history
> >>>f.st.. user1/.bashrc
> >>>f+ user1/test2017-09-06.txt
> >> ...
> >> and a long list of other missing files
> >>
> >> The incrementally sent 

Re: send | receive: received snapshot is missing recent files

2017-09-07 Thread Dave
Hello. Can anyone further explain this issue ("you have a Received
UUID on the source volume")?

How does it happen?
How does one remove a Received UUID from the source volume?

And how does that explain my results where I showed that the problem
is not dependent upon the source volume but is instead dependent upon
some existing snapshot on the target volume?

My results do not appear to be fully explained by a Received UUID on
the source volume, as my prior message hopefully shows clearly.

Thank you.

On Thu, Sep 7, 2017 at 2:24 AM, A L <crimsoncott...@gmail.com> wrote:
> The problem can be that you have a Received UUID on the source volume. This 
> breaks send-receive.
>
>  From: Dave <davestechs...@gmail.com> -- Sent: 2017-09-07 - 06:43 
>
>> Here is more info and a possible (shocking) explanation. This
>> aggregates my prior messages and it provides an almost complete set of
>> steps to reproduce this problem.
>>
>> Linux srv 4.9.41-1-lts #1 SMP Mon Aug 7 17:32:35 CEST 2017 x86_64 GNU/Linux
>> btrfs-progs v4.12
>>
>> My steps:
>>
>> [root@srv]# sync
>> [root@srv]# mkdir /home/.snapshots/test1
>> [root@srv]# btrfs su sn -r /home/ /home/.snapshots/test1/
>> Create a readonly snapshot of '/home/' in '/home/.snapshots/test1//home'
>> [root@srv]# sync
>> [root@srv]# mkdir /mnt/x5a/home/test1
>> [root@srv]# btrfs send /home/.snapshots/test1/home/ | btrfs receive
>> /mnt/x5a/home/test1/
>> At subvol /home/.snapshots/test1/home/
>> At subvol home
>> [root@srv]# ls -la /mnt/x5a/home/test1/home/user1/
>> NOTE: all recent files are present
>> [root@srv]# ls -la /mnt/x5a/home/test1/home/user2/Documents/
>> NOTE: all recent files are present
>> [root@srv]# mkdir /home/.snapshots/test2
>> [root@srv]# mkdir /mnt/x5a/home/test2
>> [root@srv]# btrfs su sn -r /home/ /home/.snapshots/test2/
>> Create a readonly snapshot of '/home/' in '/home/.snapshots/test2//home'
>> [root@srv]# sync
>> [root@srv]# btrfs send -p /home/.snapshots/test1/home/
>> /home/.snapshots/test2/home/ | btrfs receive /mnt/x5a/home/test2/
>> At subvol /home/.snapshots/test2/home/
>> At snapshot home
>> [root@srv]# ls -la /mnt/x5a/home/test2/home/user1/
>> NOTE: all recent files are MISSING
>> [root@srv]# ls -la /mnt/x5a/home/test2/home/user2/Documents/
>> NOTE: all recent files are MISSING
>>
>> Below I am including some rsync output to illustrate when a snapshot
>> is missing files (or not):
>>
>> [root@srv]# rsync -aniv /home/.snapshots/test1/home/
>> /home/.snapshots/test2/home/
>> sending incremental file list
>>
>> sent 1,143,286 bytes  received 1,123 bytes  762,939.33 bytes/sec
>> total size is 3,642,972,271  speedup is 3,183.28 (DRY RUN)
>>
>> This indicates that these two subvolumes contain the same files, which
>> they should because test2 is a snapshot of test1 without any changes
>> to files, and it was not sent to another physical device.
>>
>> The problem is when test2 is sent to another device as shown by the
>> rsync results below.
>>
>> [root@srv]# rsync -aniv /home/.snapshots/test2/home/ 
>> /mnt/x5a/home/test2/home/
>> sending incremental file list
>> .d..t.. ./
>> .d..t.. user1/
>>>f.st.. user1/.bash_history
>>>f.st.. user1/.bashrc
>>>f+ user1/test2017-09-06.txt
>> ...
>> and a long list of other missing files
>>
>> The incrementally sent snapshot at /mnt/x5a/home/test2/home/ is
>> missing all recent files (any files from the month of August or
>> September), as my prior visual inspections had indicated. The same
>> files are missing every time. There is no randomness to the missing
>> data.
>>
>> The problem does not happen for me if the receive command target is
>> located on the same physical device as shown next. (However, I suspect
>> there's more to it than that, as explained further below.)
>>
>> [root@srv]# mkdir /home/.snapshots/test2rec
>> [root@srv]# btrfs send -p /home/.snapshots/test1/home/
>> /home/.snapshots/test2/home/ | btrfs receive
>> /home/.snapshots/test2rec/
>> At subvol /home/.snapshots/test2/home/
>>
>> # rsync -aniv /home/.snapshots/test2/home/ /home/.snapshots/test2rec/home/
>> sending incremental file list
>>
>> sent 1,143,286 bytes  received 1,123 bytes  2,288,818.00 bytes/sec
>> total size is 3,642,972,271  speedup is 3,183.28 (DRY RUN)
>>
>> The above (as well as visual inspection of files) indicates that these
>> two subvolumes contain the same files, which was not the case when the
>> 

Re: send | receive: received snapshot is missing recent files

2017-09-06 Thread Dave
Here is more info and a possible (shocking) explanation. This
aggregates my prior messages and it provides an almost complete set of
steps to reproduce this problem.

Linux srv 4.9.41-1-lts #1 SMP Mon Aug 7 17:32:35 CEST 2017 x86_64 GNU/Linux
btrfs-progs v4.12

My steps:

[root@srv]# sync
[root@srv]# mkdir /home/.snapshots/test1
[root@srv]# btrfs su sn -r /home/ /home/.snapshots/test1/
Create a readonly snapshot of '/home/' in '/home/.snapshots/test1//home'
[root@srv]# sync
[root@srv]# mkdir /mnt/x5a/home/test1
[root@srv]# btrfs send /home/.snapshots/test1/home/ | btrfs receive
/mnt/x5a/home/test1/
At subvol /home/.snapshots/test1/home/
At subvol home
[root@srv]# ls -la /mnt/x5a/home/test1/home/user1/
NOTE: all recent files are present
[root@srv]# ls -la /mnt/x5a/home/test1/home/user2/Documents/
NOTE: all recent files are present
[root@srv]# mkdir /home/.snapshots/test2
[root@srv]# mkdir /mnt/x5a/home/test2
[root@srv]# btrfs su sn -r /home/ /home/.snapshots/test2/
Create a readonly snapshot of '/home/' in '/home/.snapshots/test2//home'
[root@srv]# sync
[root@srv]# btrfs send -p /home/.snapshots/test1/home/
/home/.snapshots/test2/home/ | btrfs receive /mnt/x5a/home/test2/
At subvol /home/.snapshots/test2/home/
At snapshot home
[root@srv]# ls -la /mnt/x5a/home/test2/home/user1/
NOTE: all recent files are MISSING
[root@srv]# ls -la /mnt/x5a/home/test2/home/user2/Documents/
NOTE: all recent files are MISSING

Below I am including some rsync output to illustrate when a snapshot
is missing files (or not):

[root@srv]# rsync -aniv /home/.snapshots/test1/home/
/home/.snapshots/test2/home/
sending incremental file list

sent 1,143,286 bytes  received 1,123 bytes  762,939.33 bytes/sec
total size is 3,642,972,271  speedup is 3,183.28 (DRY RUN)

This indicates that these two subvolumes contain the same files, which
they should because test2 is a snapshot of test1 without any changes
to files, and it was not sent to another physical device.

The problem is when test2 is sent to another device as shown by the
rsync results below.

[root@srv]# rsync -aniv /home/.snapshots/test2/home/ /mnt/x5a/home/test2/home/
sending incremental file list
.d..t.. ./
.d..t.. user1/
>f.st.. user1/.bash_history
>f.st.. user1/.bashrc
>f+ user1/test2017-09-06.txt
...
and a long list of other missing files

The incrementally sent snapshot at /mnt/x5a/home/test2/home/ is
missing all recent files (any files from the month of August or
September), as my prior visual inspections had indicated. The same
files are missing every time. There is no randomness to the missing
data.

The problem does not happen for me if the receive command target is
located on the same physical device as shown next. (However, I suspect
there's more to it than that, as explained further below.)

[root@srv]# mkdir /home/.snapshots/test2rec
[root@srv]# btrfs send -p /home/.snapshots/test1/home/
/home/.snapshots/test2/home/ | btrfs receive
/home/.snapshots/test2rec/
At subvol /home/.snapshots/test2/home/

# rsync -aniv /home/.snapshots/test2/home/ /home/.snapshots/test2rec/home/
sending incremental file list

sent 1,143,286 bytes  received 1,123 bytes  2,288,818.00 bytes/sec
total size is 3,642,972,271  speedup is 3,183.28 (DRY RUN)

The above (as well as visual inspection of files) indicates that these
two subvolumes contain the same files, which was not the case when the
same command had a target located on another physical device. Of
course, a snapshot which resides on the same physical device is not a
very good backup. So I do need to send it to another device, but that
results in missing files when the -p or -c options are used with btrfs
send. (Non-incremental sending to another physical device does work.)

I can think of a couple possible explanations.

One is that there is a problem when using the -p or -c options with
btrfs send when the target is another physical device. I suspect this
is the actual explanation, however.

A second possibility is that the presence of prior existing snapshots
at the target location (even if old and not referenced in any current
btrfs command), can determine the outcome and final contents of an
incremental send operation. I believe the info below suggests this to
be the case.

[root@srv]# btrfs su show /home/.snapshots/test2/home/
test2/home
Name:   home
UUID:   292e8bbf-a95f-2a4e-8280-129202d389dc
Parent UUID:62418df6-a1f8-d74a-a152-11f519593053
Received UUID:  e00d5318-6efd-824e-ac91-f25efa5c2a74
Creation time:  2017-09-06 15:38:16 -0400
Subvolume ID:   2000
Generation: 5020
Gen at creation:5020
Parent ID:  257
Top level ID:   257
Flags:  readonly
Snapshot(s):

[root@srv]# btrfs su show /mnt/x5a/home/test1/home
home/test1/home
Name:   home
UUID: 

Re: send | receive: received snapshot is missing recent files

2017-09-06 Thread Dave
This is an even better set of steps for reproducing the problem.

[root@srv]# sync
[root@srv]# mkdir /home/.snapshots/test1
[root@srv]# btrfs su sn -r /home/ /home/.snapshots/test1/
Create a readonly snapshot of '/home/' in '/home/.snapshots/test1//home'
[root@srv]# sync
[root@srv]# mkdir /mnt/x5a/home/test1
[root@srv]# btrfs send /home/.snapshots/test1/home/ | btrfs receive
/mnt/x5a/home/test1/
At subvol /home/.snapshots/test1/home/
At subvol home
[root@srv]# ls -la /mnt/x5a/home/test1/home/user1/
NOTE: all recent files are present
[root@srv]# ls -la /mnt/x5a/home/test1/home/user2/Documents/
NOTE: all recent files are present
[root@srv]# mkdir /home/.snapshots/test2
[root@srv]# mkdir /mnt/x5a/home/test2
[root@srv]# btrfs su sn -r /home/ /home/.snapshots/test2/
Create a readonly snapshot of '/home/' in '/home/.snapshots/test2//home'
[root@srv]# sync
[root@srv]# btrfs send -p /home/.snapshots/test1/home/
/home/.snapshots/test2/home/ | btrfs receive /mnt/x5a/home/test2/
At subvol /home/.snapshots/test2/home/
At snapshot home
[root@srv]# ls -la /mnt/x5a/home/test2/home/user1/
NOTE: all recent files are MISSING
[root@srv]# ls -la /mnt/x5a/home/test2/home/user2/Documents/
NOTE: all recent files are MISSING

Any ideas what could be causing this problem with incremental backups?


On Wed, Sep 6, 2017 at 3:23 PM, Dave <davestechs...@gmail.com> wrote:
>
> Here is more info on this problem. I can reproduce this without using my 
> script. Simple btrfs commands will reproduce the problem every time. The same 
> files are missing every time. There is no randomness to the missing data.
>
> Here are my steps:
>
> 1. snapper -c home create
> result is a valid snapshot at /home/.snapshots/1704/snapshot
> 2. btrfs send /home/.snapshots/1704/snapshot | btrfs receive 
> /mnt/x5a/home/1704
> 3. snapper -c home create
> result is a valid snapshot at /home/.snapshots/1716/snapshot
> 4. btrfs send -c /home/.snapshots/1704/snapshot/ 
> /home/.snapshots/1716/snapshot/ | btrfs receive /mnt/x5a/home/1716/
>
> I expect /mnt/x5a/home/1716/snapshot to be identical to 
> /home/.snapshots/1716/snapshot. However, it is not.
> The result is that /mnt/x5a/home/1716/snapshot is missing all recent files.
>
> Next step was to delete snapshot 1716 (in both locations) and repeat the send 
> | receive using -p
>
> btrfs su del /mnt/x5a/home/1716/snapshot
> snapper -c home delete 1716
> snapper -c home create
> btrfs send -p /home/.snapshots/1704/snapshot/ /home/.snapshots/1716/snapshot/ 
> | btrfs receive /mnt/x5a/home/1716/
>
> The result is once again that /mnt/x5a/home/1716/snapshot is missing all 
> recent files. However, the other snapshots are all valid:
> /home/.snapshots/1704/snapshot is valid & complete
> /mnt/x5a/home/1704/snapshot -- non-incremental send: snapshot is valid & 
> complete
> /home/.snapshots/1716/snapshot is valid & complete
> /mnt/x5a/home/1716/snapshot -- incrementally sent snapshot is missing all 
> recent files whether sent with -c or -p
>
> The incrementally sent snapshot is even missing files that are present in the 
> reference snapshot /mnt/x5a/home/1704/snapshot.
>
>
>
> On Wed, Sep 6, 2017 at 1:37 AM, Dave <davestechs...@gmail.com> wrote:
>>
>> I'm running Arch Linux on BTRFS. I use Snapper to take hourly
>> snapshots and it works without any issues.
>>
>> I have a bash script that uses send | receive to transfer snapshots to
>> a couple external HDD's. The script runs daily on a systemd timer. I
>> set all this up recently and I first noticed that it runs every day
>> and that the expected snapshots are received.
>>
>> At a glance, everything looked correct. However, today was my day to
>> drill down and really make sure everything was working.
>>
>> To my surprise, the newest received incremental snapshots are missing
>> all recent files. These new snapshots reflect the system state from
>> weeks ago and no files more recent than a certain date are in the
>> snapshots.
>>
>> However, the snapshots are newly created and newly received. The work
>> is being done fresh each day when my script runs, but the results are
>> anchored back in time at this earlier date. Weird.
>>
>> I'm not really sure where to start troubleshooting, so I'll start by
>> sharing part of my script. I'm sure the problem is in my script, and
>> is not related to BTRFS or snapper functionality. (As I said, the
>> Snapper snapshots are totally OK before being sent | received.
>>
>> These are the key lines of the script I'm using to send | receive a snapshot:
>>
>> old_num=$(snapper -c "$config" list -t single | awk
>> '/'"$selected_uuid"'/ {print $1}')
>> old_snap=$SUBVOLUME/.s

send | receive: received snapshot is missing recent files

2017-09-05 Thread Dave
I'm running Arch Linux on BTRFS. I use Snapper to take hourly
snapshots and it works without any issues.

I have a bash script that uses send | receive to transfer snapshots to
a couple external HDD's. The script runs daily on a systemd timer. I
set all this up recently and I first noticed that it runs every day
and that the expected snapshots are received.

At a glance, everything looked correct. However, today was my day to
drill down and really make sure everything was working.

To my surprise, the newest received incremental snapshots are missing
all recent files. These new snapshots reflect the system state from
weeks ago and no files more recent than a certain date are in the
snapshots.

However, the snapshots are newly created and newly received. The work
is being done fresh each day when my script runs, but the results are
anchored back in time at this earlier date. Weird.

I'm not really sure where to start troubleshooting, so I'll start by
sharing part of my script. I'm sure the problem is in my script, and
is not related to BTRFS or snapper functionality. (As I said, the
Snapper snapshots are totally OK before being sent | received.

These are the key lines of the script I'm using to send | receive a snapshot:

old_num=$(snapper -c "$config" list -t single | awk
'/'"$selected_uuid"'/ {print $1}')
old_snap=$SUBVOLUME/.snapshots/$old_num/snapshot
new_num=$(snapper -c "$config" create --print-number)
new_snap=$SUBVOLUME/.snapshots/$new_num/snapshot
btrfs send -c "$old_snap" "$new_snap" | $ssh btrfs receive
"$backup_location"

I have to admit that even after reading the following page half a
dozen times, I barely understand the difference between -c and -p.
https://btrfs.wiki.kernel.org/index.php/FAQ#What_is_the_difference_between_-c_and_-p_in_send.3F

After reading that page again today, I feel like I should switch to -p
(maybe). However, the -c vs -p choice probably isn't my problem.

Any ideas what my problem could be?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   5   6   7   >