Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)

2018-09-20 Thread Zygo Blaxell
On Fri, Sep 21, 2018 at 12:59:31PM +1000, Dave Chinner wrote:
> On Wed, Sep 19, 2018 at 12:12:03AM -0400, Zygo Blaxell wrote:
[...]
> With no DMAPI in the future, people with custom HSM-like interfaces
> based on dmapi are starting to turn to fanotify and friends to
> provide them with the change notifications they require

I had a fanotify-based scanner once, before I noticed btrfs effectively
had timestamps all over its metadata.

fanotify won't tell me which parts of a file were modified (unless it
got that feature in the last few years?).  fanotify was pretty useless
when the only file on the system that was being modified was a 13TB
VM image.  Or even a little 16GB one.  Has to scan the whole file to
find the one new byte.  Even on desktops the poor thing spends most of
its time looping over /var/log/messages.  It was sad.

If fanotify gave me (inode, offset, length) tuples of dirty pages in
cache, I could look them up and use a dedupe_file_range call to replace
the dirty pages with a reference to an existing disk block.  If my
listener can do that fast enough, it's in-band dedupe; if it doesn't,
the data gets flushed to disk as normal, and I fall back to a scan of
the filesystem to clean it up later.

> > > e.g. a soft requirement is that we need to scan the entire fs at
> > > least once a month. 
> > 
> > I have to scan and dedupe multiple times per hour.  OK, the first-ever
> > scan of a non-empty filesystem is allowed to take much longer, but after
> > that, if you have enough spare iops for continuous autodefrag you should
> > also have spare iops for continuous dedupe.
> 
> Yup, but using notifications avoids the for even these scans - you'd
> know exactly what data has changed, when it changed, and know
> exactly that you needed to read to calculate the new hashes.

...if the scanner can keep up with the notifications; otherwise, the
notification receiver has to log them somewhere for the scanner to
catch up.  If there are missed or dropped notifications--or 23 hours a
day we're not listening for notifications because we only have an hour
a day maintenance window--some kind of filesystem scan has to be done
after the fact anyway.

> > > A simple piece-wise per-AG scanning algorithm (like we use in
> > > xfs_repair) could easily work within a 3GB RAM per AG constraint and
> > > would scale very well. We'd only need to scan 30-40 AGs in the hour,
> > > and a single AG at 1GB/s will only take 2 minutes to scan. We can
> > > then do the processing while the next AG gets scanned. If we've got
> > > 10-20GB RAM to use (and who doesn't when they have 1PB of storage?)
> > > then we can scan 5-10AGs at once to keep the IO rate up, and process
> > > them in bulk as we scan more.
> > 
> > How do you match dupe blocks from different AGs if you only keep RAM for
> > the duration of one AG scan?  Do you not dedupe across AG boundaries?
> 
> We could, but do we need too? There's a heap of runtime considerations
> at the filesystem level we need to take into consideration here, and
> there's every chance that too much consolidation creates
> unpredictable bottlenecks in overwrite workloads that need to break
> the sharing (i.e. COW operations).

I'm well aware of that.  I have a bunch of hacks in bees to not be too
efficient lest it push the btrfs reflink bottlenecks too far.

> e.g. An AG contains up to 1TB of data which is more than enough to
> get decent AG-internal dedupe rates. If we've got 1PB of data spread
> across 1000AGs, deduping a million copies of a common data pattern
> spread across the entire filesystem down to one per AG (i.e. 10^6
> copies down to 10^3) still gives a massive space saving.

That's true for 1000+ AG filesystems, but it's a bigger problem for
filesystems of 2-5 AGs, where each AG holds one copy of 20-50% of the
duplicates on the filesystem.

OTOH, a filesystem that small could just be done in one pass with a
larger but still reasonable amount of RAM.

> > What you've described so far means the scope isn't limited anyway.  If the
> > call is used to dedupe two heavily-reflinked extents together (e.g.
> > both duplicate copies are each shared by thousands of snapshots that
> > have been created during the month-long period between dedupe runs),
> > it could always be stuck doing a lot of work updating dst owners.
> > Was there an omitted detail there?
> 
> As I said early in the discussion - if both copies of identical data
> are already shared hundreds or thousands of times each, then it
> makes no sense to dedupe them again. All that does is create huge
> amounts of work updating metadata for very little additional gain.

I've had a user complain about the existing 2560-reflink limit in bees,
because they were starting with 3000 

Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)

2018-09-18 Thread Zygo Blaxell
On Mon, Sep 10, 2018 at 07:06:46PM +1000, Dave Chinner wrote:
> On Thu, Sep 06, 2018 at 11:53:06PM -0400, Zygo Blaxell wrote:
> > On Thu, Sep 06, 2018 at 06:38:09PM +1000, Dave Chinner wrote:
> > > On Fri, Aug 31, 2018 at 01:10:45AM -0400, Zygo Blaxell wrote:
> > > > On Thu, Aug 30, 2018 at 04:27:43PM +1000, Dave Chinner wrote:
> > > > > On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote:
> > > > For future development I've abandoned the entire dedupe_file_range
> > > > approach.  I need to be able to read and dedupe the data blocks of
> > > > the filesystem directly without having to deal with details like which
> > > > files those blocks belong to, especially on filesystems with lots of
> > > > existing deduped blocks and snapshots. 
> > > 
> > > IOWs, your desired OOB dedupe algorithm is:
> > > 
> > >   a) ask the filesystem where all it's file data is
> > 
> > Actually, it's "ask the filesystem where all the *new* file data is"
> > since we don't want to read any unique data twice on subsequent runs.
> 
> Sorry, how do you read "unique data" twice? By definition, unique
> data only occurs once

...but once it has been read, we don't want to read it again.  Ever.
Even better would be to read unique data less than 1.0 times on average.

> Oh, and you still need to hash the old data so you can find
> collisions with the new data that got written. Unless, of course,
> you are keeping your hash tree in a persistent database 

I do that.

> and can work out how to prune stale entries out of it efficiently

I did that first.

Well, more like I found that even a bad algorithm can still find
most of the duplicate data in a typical filesystem, and there's a
steep diminishing returns curve the closer you get to 100% efficiency.
So I just used a bad algorithm (random drop with a bias toward keeping
hashes that matched duplicate blocks).  There's room to improve that,
but the possible gains are small, so it's at least #5 on the performance
whack-a-mole list and probably lower.

The randomness means each full-filesystem sweep finds a different subset
of duplicates, so I can arbitrarily cut hash table size in half and get
almost all of the match rate back by doing two full scans.  Or I cut
the filesystem up into a few large pieces and feed the pieces through in
different orders on different scan runs, so different subsets of data in
the hash table meet different subsets of data on disk during each scan.
An early prototype of bees worked that way, but single-digit efficiency
gains were not worth doubling iops, so I stopped.

> [...]I thought that "details omitted for
> reasons of brevity" would be understood, not require omitted details
> to be explained to me.

Sorry.  I don't know what you already know.

> > Bees also operates under a constant-RAM constraint, so it doesn't operate
> > in two distinct "collect data" and "act on data collected" passes,
> > and cannot spend memory to store data about more than a few extents at
> > any time.
> 
> I suspect that I'm thinking at a completely different scale to you.
> I don't really care for highly constrained or optimal dedupe
> algorithms  because those last few dedupe percentages really don't
> matter that much to me. 

At large scales RAM is always constrained.  It's the dedupe triangle of
RAM, iops, and match hit rate--any improvement in one comes at the cost
of the others.  Any dedupe can go faster or use less RAM by raising the
block size or partitioning the input data set to make it smaller.

bees RAM usage is a bit more explicitly controlled--the admin tells bees
how much RAM to use, and bees scales the other parameters to fit that.
Other dedupe engines make the admin do math to set parameters to avoid
overflowing RAM with dynamic memory allocations, or leave the admin to
discover what their RAM constraint is the hard way.

One big difference I am noticing in our approaches is latency.  ZFS (and
in-kernel btrfs dedupe) provides minimal dedupe latency (duplicate
data occupies disk space for zero time as it is never written to disk
at all) but it requires more RAM for a given dedupe hit rate than any
other dedupe implementation I've seen.  What you've written tells me
XFS saves RAM by partitioning the data and relying on an existing but
very large source of iops (sharing scrub reads with dedupe), but then
the dedupe latency is the same as the scrub interval (the worst so far).
bees aims to have latency of a few minutes (ideally scanning data while
it's still dirty in cache, but there's no good userspace API for that)
though it's obviously not there yet.

> I care much more about using all the
> resources we can and running as fast as we possibly can, then
> pr

Re: dduper - Offline btrfs deduplication tool

2018-09-07 Thread Zygo Blaxell
On Fri, Sep 07, 2018 at 09:27:28AM +0530, Lakshmipathi.G wrote:
> > 
> > One question:
> > Why not ioctl_fideduperange?
> > i.e. you kill most of benefits from that ioctl - atomicity.
> > 
> I plan to add fideduperange as an option too. User can
> choose between fideduperange and ficlonerange call.
> 
> If I'm not wrong, with fideduperange, kernel performs
> comparsion check before dedupe. And it will increase
> time to dedupe files.

Creating the backup reflink file takes far more time than you will ever
save from fideduperange.

You don't need the md5sum either, unless you have a data set that is
full of crc32 collisions (e.g. a file format that puts a CRC32 at the
end of each 4K block).  The few people who have such a data set can
enable md5sums, everyone else can have md5sums disabled by default.

> I believe the risk involved with ficlonerange is  minimized 
> by having a backup file(reflinked). We can revert to older 
> original file, if we encounter some problems.

With fideduperange the risk is more than minimized--it's completely
eliminated.

If you don't use fideduperange you can't use the tool on a live data
set at all.

> > 
> > -- 
> > Have a nice day,
> > Timofey.
> 
> Cheers.
> Lakshmipathi.G


signature.asc
Description: PGP signature


Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)

2018-08-30 Thread Zygo Blaxell
On Thu, Aug 30, 2018 at 04:27:43PM +1000, Dave Chinner wrote:
> On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote:
> > On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote:
> > > On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote:
> > > > - is documenting rejection on request alignment grounds
> > > >   (i.e. EINVAL) in the man page sufficient for app
> > > >   developers to understand what is going on here?
> > > 
> > > I think so.  The manpage says: "The filesystem does not support
> > > reflinking the ranges of the given files", which (to my mind) covers
> > > this case of not supporting dedupe of EOF blocks.
> > 
> > Older versions of btrfs dedupe (before v4.2 or so) used to do exactly
> > this; however, on btrfs, not supporting dedupe of EOF blocks means small
> > files (one extent) cannot be deduped at all, because the EOF block holds
> > a reference to the entire dst extent.  If a dedupe app doesn't go all the
> > way to EOF on btrfs, then it should not attempt to dedupe any part of the
> > last extent of the file as the benefit would be zero or slightly negative.
> 
> That's a filesystem implementation issue, not an API or application
> issue.

The API and application issue remains even if btrfs is not considered.
btrfs is just the worst case outcome.  Other filesystems still have
fragmentation issues, and applications have efficiency-vs-capability
tradeoffs to make if they can't rely on dedupe-to-EOF being available.

Tools like 'cp --reflink=auto' work by trying the best case, then falling
back to a second choice if the first choice returns an error.  If the
second choice fails too, the surprising behavior can make inattentive
users lose data.

> > The app developer would need to be aware that such a restriction could
> > exist on some filesystems, and be able to distinguish this from other
> > cases that could lead to EINVAL.  Portable code would have to try a dedupe
> > up to EOF, then if that failed, round down and retry, and if that failed
> > too, the app would have to figure out which filesystem it's running on
> > to know what to do next.  Performance demands the app know what the FS
> > will do in advance, and avoid a whole class of behavior.
> 
> Nobody writes "portable" applications like that. 

As an app developer, and having studied other applications' revision
histories, and having followed IRC and mailing list conversations
involving other developers writing these applications, I can assure
you that is _exactly_ how portable applications get written around
the dedupe function.

Usually people start from experience with tools that use hardlinks to
implement dedupe, so the developer's mental model starts with deduping
entire files.  Their first attempt does this:

stat(fd, );
dedupe( ..., src_offset = 0, dst_offset = 0, length = st.st_size);

then subsequent revisions of their code cope with limits on length,
and then deal with EINVAL on odd lengths, because those are the problems
that are encountered as the code runs for the first time on an expanding
set of filesystems.  After that, they deal with implementation-specific
performance issues.

Other app developers start by ignoring incomplete blocks, then compare
their free-space-vs-time graphs with other dedupe apps on the same
filesystem, then either adapt to handle EOF properly, or just accept
being uncompetitive.

> They read the man
> page first, and work out what the common subset of functionality is
> and then code from that. 

> Man page says:
> 
> "Disk filesystems generally require the offset and length arguments
> to be aligned to the fundamental block size."

> IOWs, code compatible with starts with supporting the general case.
> i.e. a range rounded to filesystem block boundaries (it's already
> run fstat() on the files it wants to dedupe to find their size,
> yes?), hence ignoring the partial EOF block. Will just work on
> everything.

Will cause a significant time/space performance hit too.  EOFs are
everywhere, and they have a higher-than-average duplication rate
for their size.  If an application assumes EOF can't be deduped on
every filesystem, then it leaves a non-trivial amount of free space
unrecovered on filesystems that can dedupe EOF.  It also necessarily
increases fragmentation unless the filesystem implements file tails
(where it keeps fragmentation constant as the tail won't be stored
contiguously in any case).

> Code that then wants to optimise for btrfs/xfs/ocfs quirks runs
> fstatvfs to determine what fs it's operating on and applies the
> necessary quirks. For btrfs it can extend the range to include the
> partial EOF block, and hence will handle the implem

Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)

2018-08-23 Thread Zygo Blaxell
On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote:
> On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote:
> > On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote:
> > >   - should we just round down the EOF dedupe request to the
> > > block before EOF so dedupe still succeeds?
> > 
> > I've often wondered if the interface should (have) be(en) that we start
> > at src_off/dst_off and share as many common blocks as possible until we
> > find a mismatch, then tell userspace where we stopped... instead of like
> > now where we compare the entire extent and fail if any part of it
> > doesn't match.
> 
> The usefulness or harmfulness of that approach depends a lot on what
> the application expects the filesystem to do.

Here are some concrete examples.

In the following, letters are 4K disk blocks and also inode offsets
(i.e. "A" means a block containing 4096 x "A" located at inode offset 0,
"B" contains "B" located at inode offset 1, etc).  "|" indicates
a physical discontinuity of the blocks on disk.  Lowercase "a" has
identical content to uppercase "A", but they are located in different
physical blocks on disk.

Suppose you have two identical files with different write histories,
so they have different on-disk layouts:

Inode 1:  ABCDEFGH|IJ|KL|M|N|O|PQ|RST|UV|WX|YZ

Inode 2:  a|b|c|d|e|f|g|hijklmnopqrstuvwxyz

A naive dedupe app might pick src and dst at random, and do this:

// dedupe(length, src_ino, src_off, dst_ino, dst_off)

dedupe(length 26, Inode 1, Offset 0, Inode 2, Offset 0)

with the result having 11 fragments in each file, all from the
original Inode 1:

Inode 1:  ABCDEFGH|IJ|KL|M|N|O|PQ|RST|UV|WX|YZ

Inode 2:  ABCDEFGH|IJ|KL|M|N|O|PQ|RST|UV|WX|YZ

A smarter dedupe app might choose src and dst based on logical proximity
and/or physical seek distance, or the app might choose dst with the
smallest number of existing references in the filesystem, or the app might
simply choose the longest available src extents to minimize fragmentation:

dedupe(length 7, Inode 1, Offset 0, Inode 2, Offset 0)

dedupe(length 19, Inode 2, Offset 7, Inode 1, Offset 7)

with the result having 2 fragments in each file, each chosen
from a different original inode:

Inode 1:  ABCDEFG|hijklmnopqrstuvwxyz

Inode 2:  ABCDEFG|hijklmnopqrstuvwxyz

If the kernel continued past the 'length 7' size specified in the first
dedupe, then the 'hijklmnopqrstuvwxyz' would be *lost*, and the second
dedupe would be an expensive no-op because both Inode 1 and Inode 2
refer to the same physical blocks:

Inode 1:  ABCDEFGH|IJ|KL|M|N|O|PQ|RST|UV|WX|YZ

  [---] - app asked for this
Inode 2:  ABCDEFGH|IJ|KL|M|N|O|PQ|RST|UV|WX|YZ
kernel does this too - [-]
and "hijklmnopqrstuvwxyz" no longer exists for second dedupe

A dedupe app willing to spend more on IO can create its own better src
with only one fragment:

open(with O_TMPFILE) -> Inode 3

copy(length 7, Inode 1, Offset 0, Inode 3, Offset 0)

copy(length 19, Inode 2, Offset 7, Inode 3, Offset 7)

dedupe(length 26, Inode 3, Offset 0, Inode 1, Offset 0)

dedupe(length 26, Inode 3, Offset 0, Inode 2, Offset 0)

close(Inode 3)

Now there is just one fragment referenced from two places:

Inode 1:  αβξδεφγηιςκλμνοπθρστυвшχψζ

Inode 2:  αβξδεφγηιςκλμνοπθρστυвшχψζ

[If encoding goes horribly wrong, the above are a-z transcoded as a mix
of Greek and Cyrillic Unicode characters.]

Real filesystems sometimes present thousands of possible dedupe
src/dst permutations to choose from.  The kernel shouldn't be trying to
second-guess an application that may have access to external information
to make better decisions (e.g. the full set of src extents available,
or knowledge of other calls the app will issue in the future).

> In btrfs, the dedupe operation acts on references to data, not the
> underlying data blocks.  If there are 1000 slightly overlapping references
> to a single contiguous range of data blocks in dst on disk, each dedupe
> operation acts on only one of those, leaving the other 999 untouched.
> If the app then submits 999 other dedupe requests, no references to the
> dst blocks remain and the underlying data blocks can be deleted.
> 
> In a parallel universe (or a better filesystem, or a userspace emulation
> built out of dedupe and other ioctls), dedupe could work at the extent
> data (physical) level.  The app points at src and dst extent references
> (inode/offset/length tuples), and the filesystem figures out which
> physical blocks these point to, then adjusts all the references to the
> dst blocks at once, dealing with partial overlaps and

Re: Reproducer for "compressed data + hole data corruption bug, 2018 editiion"

2018-08-23 Thread Zygo Blaxell
On Thu, Aug 23, 2018 at 01:10:48PM +0800, Qu Wenruo wrote:
> On 2018/8/23 上午11:11, Zygo Blaxell wrote:
> > This is a repro script for a btrfs bug that causes corrupted data reads
> > when reading a mix of compressed extents and holes.  The bug is
> > reproducible on at least kernels v4.1..v4.18.
> 
> This bug already sounds more serious than previous nodatasum +
> compression bug.

Maybe.  "compression + holes corruption bug 2017" could be avoided with
the max-inline=0 mount option without disabling compression.  This time,
the workaround is more intrusive:  avoid all applications that use dedup
or hole-punching.

> > Some more observations and background follow, but first here is the
> > script and some sample output:
> > 
> > root@rescue:/test# cat repro-hole-corruption-test
> > #!/bin/bash
> > 
> > # Write a 4096 byte block of something
> > block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }
> > 
> > # Here is some test data with holes in it:
> > for y in $(seq 0 100); do
> > for x in 0 1; do
> > block 0;
> > block 21;
> > block 0;
> > block 22;
> > block 0;
> > block 0;
> > block 43;
> > block 44;
> > block 0;
> > block 0;
> > block 61;
> > block 62;
> > block 63;
> > block 64;
> > block 65;
> > block 66;>  done
> 
> Does the content has any difference on this bug?
> It's just 16 * 4K * 2 * 101 data write *without* any hole so far.

The content of the extents doesn't seem to matter, other than it needs to
be compressible so that the extents on disk are compressed.  The bug is
also triggered by writing non-zero data to all blocks, and then punching
the holes later with "fallocate -p -l 4096 -o $(( insert math here ))".

The layout of the extents matters a lot.  I have to loop hundreds or
thousands of times to hit the bug if the first block in the pattern is
not a hole, or if the non-hole extents are different sizes or positions
than above.

I tried random patterns of holes and extent refs, and most of them have
an order of magnitude lower hit rates than the above.  This might be due
to some relationship between the alignment of read() request boundaries
with extent boundaries, but I haven't done any tests designed to detect
such a relationship.

In the wild, corruption happens on some files much more often than others.
This seems to be correlated with the extent layout as well.

I discovered the bug by examining files that were intermittently but
repeatedly failing routine data integrity checks, and found that in every
case they had similar hole + extent patterns near the point where data
was corrupted.

I did a search on some big filesystems for the
hole-refExtentA-hole-refExtentA pattern and found several files with
this pattern that had passed previous data integrity checks, but would
fail randomly in the sha1sum/drop-caches loop.

> This should indeed cause 101 128K compressed data extent.
> But I'm wondering the description about 'holes'.

The holes are coming, wait for it... ;)

> > done > am
> > sync
> > 
> > # Now replace those 101 distinct extents with 101 references to the 
> > first extent
> > btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 
> > 131072)); done) 2>&1 | tail
> 
> Will this bug still happen by creating one extent and then reflink it
> 101 times?

Yes.  I used btrfs-extent-same because a binary is included in the
Debian duperemove package, but I use it only for convenience.

It's not necessary to have hundreds of references to the same extent--even
two refs to a single extent plus a hole can trigger the bug sometimes.
100 references in a single file will trigger the bug so often that it
can be detected within the first 20 sha1sum loops.

When the corruption occurs, it affects around 90 of the original 101
extents.  The different sha1sum results are due to different extents
giving bad data on different runs.

> > # Punch holes into the extent refs
> > fallocate -v -d am
> 
> Hole-punch in fact happens here.
> 
> BTW, will add a "sync" here change the result?

No.  You can reboot the machine here if you like, it does not change
anything that happens during reads later.

Looking at the extent tree in btrfs-debug-tree, the data on disk
looks correct, and btrfs does read it correctly most of the time (the
correct sha1sum below is 6926a34e0ab

Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)

2018-08-23 Thread Zygo Blaxell
On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote:
> On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote:
> > - is documenting rejection on request alignment grounds
> >   (i.e. EINVAL) in the man page sufficient for app
> >   developers to understand what is going on here?
> 
> I think so.  The manpage says: "The filesystem does not support
> reflinking the ranges of the given files", which (to my mind) covers
> this case of not supporting dedupe of EOF blocks.

Older versions of btrfs dedupe (before v4.2 or so) used to do exactly
this; however, on btrfs, not supporting dedupe of EOF blocks means small
files (one extent) cannot be deduped at all, because the EOF block holds
a reference to the entire dst extent.  If a dedupe app doesn't go all the
way to EOF on btrfs, then it should not attempt to dedupe any part of the
last extent of the file as the benefit would be zero or slightly negative.

The app developer would need to be aware that such a restriction could
exist on some filesystems, and be able to distinguish this from other
cases that could lead to EINVAL.  Portable code would have to try a dedupe
up to EOF, then if that failed, round down and retry, and if that failed
too, the app would have to figure out which filesystem it's running on
to know what to do next.  Performance demands the app know what the FS
will do in advance, and avoid a whole class of behavior.

btrfs dedupe reports success if the src extent is inline and the same
size as the dst extent (i.e. file is smaller than one page).  No dedupe
can occur in such cases--a clone results in a simple copy, so the best
a dedupe could do would be a no-op.  Returning EINVAL there would break
a few popular tools like "cp --reflink".  Returning OK but doing nothing
seems to be the best option in that case.

> > - should we just round down the EOF dedupe request to the
> >   block before EOF so dedupe still succeeds?
> 
> I've often wondered if the interface should (have) be(en) that we start
> at src_off/dst_off and share as many common blocks as possible until we
> find a mismatch, then tell userspace where we stopped... instead of like
> now where we compare the entire extent and fail if any part of it
> doesn't match.

The usefulness or harmfulness of that approach depends a lot on what
the application expects the filesystem to do.

In btrfs, the dedupe operation acts on references to data, not the
underlying data blocks.  If there are 1000 slightly overlapping references
to a single contiguous range of data blocks in dst on disk, each dedupe
operation acts on only one of those, leaving the other 999 untouched.
If the app then submits 999 other dedupe requests, no references to the
dst blocks remain and the underlying data blocks can be deleted.

In a parallel universe (or a better filesystem, or a userspace emulation
built out of dedupe and other ioctls), dedupe could work at the extent
data (physical) level.  The app points at src and dst extent references
(inode/offset/length tuples), and the filesystem figures out which
physical blocks these point to, then adjusts all the references to the
dst blocks at once, dealing with partial overlaps and snapshots and
nodatacow and whatever other exotic features might be lurking in the
filesystem, ending with every reference to every part of dst replaced
by the longest possible contiguous reference(s) to src.

Problems arise if the length deduped is not exactly the length requested.
If the search continues until a mismatch is found, where does the search
for a mismatch lead?  Does the search follow physically contiguous
blocks on disk, or would dedupe follow logically contiguous blocks in
the src and dst files?  Or the intersection of those, i.e. physically
contiguous blocks that are logically contiguous in _any_ two files,
not limited to src and dst.

There is also the problem where the files could have been previously
deduped and then partially overwritten with identical data.  If the
application cannot control where the dedupe search for identical data
ends, it can end up accidentally creating new references to extents
while it is trying to eliminate those extents.  The kernel might do a
lot of extra work from looking ahead that the application has to undo
immediately (e.g. after the first few blocks of dst, the app wants to
do another dedupe with a better src extent elsewhere on the filesystem,
but the kernel goes ahead and dedupes with an inferior src beyond the
end of what the app asked for).

bees tries to determine exactly the set of dedupe requests required to
remove all references to duplicate extents (and maybe someday do defrag
as well).  If the kernel deviates from the requested sizes (e.g. because
the data changed on the filesystem between dedup requests), the final
extent layout after the dedupe requests are finished won't match what
bees expected it to be, so bees has to reexamine the filesystem and
either retry with a fresh set of exact 

Reproducer for "compressed data + hole data corruption bug, 2018 editiion"

2018-08-22 Thread Zygo Blaxell
This is a repro script for a btrfs bug that causes corrupted data reads
when reading a mix of compressed extents and holes.  The bug is
reproducible on at least kernels v4.1..v4.18.

Some more observations and background follow, but first here is the
script and some sample output:

root@rescue:/test# cat repro-hole-corruption-test
#!/bin/bash

# Write a 4096 byte block of something
block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }

# Here is some test data with holes in it:
for y in $(seq 0 100); do
for x in 0 1; do
block 0;
block 21;
block 0;
block 22;
block 0;
block 0;
block 43;
block 44;
block 0;
block 0;
block 61;
block 62;
block 63;
block 64;
block 65;
block 66;
done
done > am
sync

# Now replace those 101 distinct extents with 101 references to the 
first extent
btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 
131072)); done) 2>&1 | tail

# Punch holes into the extent refs
fallocate -v -d am

# Do some other stuff on the machine while this runs, and watch the 
sha1sums change!
while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 
1; done

root@rescue:/test# ./repro-hole-corruption-test
i: 91, status: 0, bytes_deduped: 131072
i: 92, status: 0, bytes_deduped: 131072
i: 93, status: 0, bytes_deduped: 131072
i: 94, status: 0, bytes_deduped: 131072
i: 95, status: 0, bytes_deduped: 131072
i: 96, status: 0, bytes_deduped: 131072
i: 97, status: 0, bytes_deduped: 131072
i: 98, status: 0, bytes_deduped: 131072
i: 99, status: 0, bytes_deduped: 131072
13107200 total bytes deduped in this operation
am: 4.8 MiB (4964352 bytes) converted to sparse holes.
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
072a152355788c767b97e4e4c0e4567720988b84 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
bf00d862c6ad436a1be2be606a8ab88d22166b89 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
0d44cdf030fb149e103cfdc164da3da2b7474c17 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
60831f0e7ffe4b49722612c18685c09f4583b1df am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
a19662b294a3ccdf35dbb18fdd72c62018526d7d am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am
^C

Corruption occurs most often when there is a sequence like this in a file:

ref 1: hole
ref 2: extent A, offset 0
ref 3: hole
ref 4: extent A, offset 8192

This scenario typically arises due to hole-punching or deduplication.
Hole-punching replaces one extent ref with two references to the same
extent with a hole between them, so:

ref 1:  extent A, offset 0, length 16384

becomes:

ref 1:  extent A, offset 0, length 4096
ref 2:  hole, length 8192
ref 3:  extent A, offset 12288, length 4096

Deduplication replaces two distinct extent refs surrounding a hole with
two references to one of the duplicate extents, turning this:

ref 1:  extent A, offset 0, length 4096
ref 2:  hole, length 8192
ref 3:  extent B, offset 0, length 4096

into this:

ref 1:  extent A, offset 0, length 4096
ref 2:  hole, length 8192
ref 3:  extent A, offset 0, length 4096

Compression is required (zlib, zstd, or lzo) for corruption to occur.
I am not able to reproduce the issue with an uncompressed extent nor
have I observed any such corruption in the wild.

The presence or absence of the no-holes filesystem feature has no effect.

Ordinary writes can lead to pairs of extent references to 

Deadlock between dedup and rename

2018-08-15 Thread Zygo Blaxell
Every month or two I hit a btrfs deadlock like this:

dedup and rsync are both operating on the same file when the filesystem
locked up.  The deadlock happens at the moment when rsync renames its
temporary file (the dedup dst file) to replace the old version of the
file (the dedup src file).

Dedup ended up stuck with this stack trace:

[] call_rwsem_down_write_failed+0x13/0x20
[] down_write_nested+0x87/0xb0
[] btrfs_dedupe_file_range+0xdc/0x5f0
[] vfs_dedupe_file_range+0x210/0x240
[] do_vfs_ioctl+0x236/0x6b0
[] SyS_ioctl+0x76/0x90
[] do_syscall_64+0x70/0x190
[] entry_SYSCALL_64_after_hwframe+0x42/0xb7
[] 0x

and rsync ended up stuck with this stack trace:

[] call_rwsem_down_write_failed+0x13/0x20
[] down_write_nested+0x87/0xb0
[] vfs_rename+0x18e/0x8c0
[] SyS_renameat2+0x4ce/0x520
[] do_syscall_64+0x70/0x190
[] entry_SYSCALL_64_after_hwframe+0x42/0xb7
[] 0x

The file in question was somewhat large (>4GB) so there was probably some
dirty page flushing going on in the background, which may or may not
matter for reproducing the bug.

This is a fairly common occurrence when rsyncing large files while
bees is running, as the rsync temporary file is often a copy of its own
previous version, and bees will start deduplication at the head of the
temporary file before rsync finishes writing at the tail end.


signature.asc
Description: PGP signature


Re: List of known BTRFS Raid 5/6 Bugs?

2018-08-14 Thread Zygo Blaxell
t;/dev/sda3.19MiB
>/dev/sdb3.19MiB
>/dev/sdc3.19MiB
>/dev/sdd3.19MiB
>/dev/sde3.19MiB
> 
> Unallocated:
>/dev/sda5.63TiB
>/dev/sdb5.63TiB
>/dev/sdc5.63TiB
>/dev/sdd5.63TiB
>/dev/sde5.63TiB
> menion@Menionubuntu:~$
> menion@Menionubuntu:~$ sf -h
> The program 'sf' is currently not installed. You can install it by typing:
> sudo apt install ruby-sprite-factory
> menion@Menionubuntu:~$ df -h
> Filesystem  Size  Used Avail Use% Mounted on
> udev934M 0  934M   0% /dev
> tmpfs   193M   22M  171M  12% /run
> /dev/mmcblk0p3   28G   12G   15G  44% /
> tmpfs   962M 0  962M   0% /dev/shm
> tmpfs   5,0M 0  5,0M   0% /run/lock
> tmpfs   962M 0  962M   0% /sys/fs/cgroup
> /dev/mmcblk0p1  188M  3,4M  184M   2% /boot/efi
> /dev/mmcblk0p3   28G   12G   15G  44% /home
> /dev/sda 37T  6,6T   29T  19% /media/storage/das1
> tmpfs   193M 0  193M   0% /run/user/1000
> menion@Menionubuntu:~$ btrfs --version
> btrfs-progs v4.17
> 
> So I don't fully understand where the scrub data size comes from
> Il giorno lun 13 ago 2018 alle ore 23:56  ha scritto:
> >
> > Running time of 55:06:35 indicates that the counter is right, it is not 
> > enough time to scrub the entire array using hdd.
> >
> > 2TiB might be right if you only scrubbed one disc, "sudo btrfs scrub start 
> > /dev/sdx1" only scrubs the selected partition,
> > whereas "sudo btrfs scrub start /media/storage/das1" scrubs the actual 
> > array.
> >
> > Use "sudo btrfs scrub status -d " to view per disc scrubbing statistics and 
> > post the output.
> > For live statistics, use "sudo watch -n 1".
> >
> > By the way:
> > 0 errors despite multiple unclean shutdowns? I assumed that the write hole 
> > would corrupt parity the first time around, was i wrong?
> >
> > Am 13-Aug-2018 09:20:36 +0200 schrieb men...@gmail.com:
> > > Hi
> > > I have a BTRFS RAID5 array built on 5x8TB HDD filled with, well :),
> > > there are contradicting opinions by the, well, "several" ways to check
> > > the used space on a BTRFS RAID5 array, but I should be aroud 8TB of
> > > data.
> > > This array is running on kernel 4.17.3 and it definitely experienced
> > > power loss while data was being written.
> > > I can say that it wen through at least a dozen of unclear shutdown
> > > So following this thread I started my first scrub on the array. and
> > > this is the outcome (after having resumed it 4 times, two after a
> > > power loss...):
> > >
> > > menion@Menionubuntu:~$ sudo btrfs scrub status /media/storage/das1/
> > > scrub status for 931d40c6-7cd7-46f3-a4bf-61f3a53844bc
> > > scrub resumed at Sun Aug 12 18:43:31 2018 and finished after 55:06:35
> > > total bytes scrubbed: 2.59TiB with 0 errors
> > >
> > > So, there are 0 errors, but I don't understand why it says 2.59TiB of
> > > scrubbed data. Is it possible that also this values is crap, as the
> > > non zero counters for RAID5 array?
> > > Il giorno sab 11 ago 2018 alle ore 17:29 Zygo Blaxell
> > >  ha scritto:
> > > >
> > > > On Sat, Aug 11, 2018 at 08:27:04AM +0200, erentheti...@mail.de wrote:
> > > > > I guess that covers most topics, two last questions:
> > > > >
> > > > > Will the write hole behave differently on Raid 6 compared to Raid 5 ?
> > > >
> > > > Not really. It changes the probability distribution (you get an extra
> > > > chance to recover using a parity block in some cases), but there are
> > > > still cases where data gets lost that didn't need to be.
> > > >
> > > > > Is there any benefit of running Raid 5 Metadata compared to Raid 1 ?
> > > >
> > > > There may be benefits of raid5 metadata, but they are small compared to
> > > > the risks.
> > > >
> > > > In some configurations it may not be possible to allocate the last
> > > > gigabyte of space. raid1 will allocate 1GB chunks from 2 disks at a
> > > > time while raid5 will allocate 1GB chunks from N disks at a time, and if
> > > > N is an odd number there could be one chunk left over in the array that
> > > > is unusable. Most users will find this irrelevant because a large disk
> > > > array that is filled to the last GB will become quite slow due

Re: List of known BTRFS Raid 5/6 Bugs?

2018-08-13 Thread Zygo Blaxell
On Mon, Aug 13, 2018 at 11:56:05PM +0200, erentheti...@mail.de wrote:
> Running time of 55:06:35 indicates that the counter is right, it is
> not enough time to scrub the entire array using hdd.
> 
> 2TiB might be right if you only scrubbed one disc, "sudo btrfs scrub
> start /dev/sdx1" only scrubs the selected partition,
> whereas "sudo btrfs scrub start /media/storage/das1" scrubs the actual array.
> 
> Use "sudo btrfs scrub status -d " to view per disc scrubbing statistics
> and post the output.
> For live statistics, use "sudo watch -n 1".
> 
> By the way:
> 0 errors despite multiple unclean shutdowns? I assumed that the write
> hole would corrupt parity the first time around, was i wrong?

You won't see the write hole from just a power failure.  You need a
power failure *and* a disk failure, and writes need to be happening at
the moment power fails.

Write hole breaks parity.  Scrub silently(!) fixes parity.  Scrub reads
the parity block and compares it to the computed parity, and if it's
wrong, scrub writes the computed parity back.  Normal RAID5 reads with
all disks online read only the data blocks, so they won't read the parity
block and won't detect wrong parity.

I did a couple of order-of-magnitude estimations of how likely a power
failure is to trash a btrfs RAID system and got a probability between 3%
and 30% per power failure if there were writes active at the time, and
a disk failed to join the array after boot.  That was based on 5 disks
having 31 writes queued with one of the disks being significantly slower
than the others (as failing disks often are) with continuous write load.

If you have a power failure on an array that isn't writing anything at
the time, nothing happens.

> 
> Am 13-Aug-2018 09:20:36 +0200 schrieb men...@gmail.com: 
> > Hi
> > I have a BTRFS RAID5 array built on 5x8TB HDD filled with, well :),
> > there are contradicting opinions by the, well, "several" ways to check
> > the used space on a BTRFS RAID5 array, but I should be aroud 8TB of
> > data.
> > This array is running on kernel 4.17.3 and it definitely experienced
> > power loss while data was being written.
> > I can say that it wen through at least a dozen of unclear shutdown
> > So following this thread I started my first scrub on the array. and
> > this is the outcome (after having resumed it 4 times, two after a
> > power loss...):
> > 
> > menion@Menionubuntu:~$ sudo btrfs scrub status /media/storage/das1/
> > scrub status for 931d40c6-7cd7-46f3-a4bf-61f3a53844bc
> > scrub resumed at Sun Aug 12 18:43:31 2018 and finished after 55:06:35
> > total bytes scrubbed: 2.59TiB with 0 errors
> > 
> > So, there are 0 errors, but I don't understand why it says 2.59TiB of
> > scrubbed data. Is it possible that also this values is crap, as the
> > non zero counters for RAID5 array?
> > Il giorno sab 11 ago 2018 alle ore 17:29 Zygo Blaxell
> >  ha scritto:
> > >
> > > On Sat, Aug 11, 2018 at 08:27:04AM +0200, erentheti...@mail.de wrote:
> > > > I guess that covers most topics, two last questions:
> > > >
> > > > Will the write hole behave differently on Raid 6 compared to Raid 5 ?
> > >
> > > Not really. It changes the probability distribution (you get an extra
> > > chance to recover using a parity block in some cases), but there are
> > > still cases where data gets lost that didn't need to be.
> > >
> > > > Is there any benefit of running Raid 5 Metadata compared to Raid 1 ?
> > >
> > > There may be benefits of raid5 metadata, but they are small compared to
> > > the risks.
> > >
> > > In some configurations it may not be possible to allocate the last
> > > gigabyte of space. raid1 will allocate 1GB chunks from 2 disks at a
> > > time while raid5 will allocate 1GB chunks from N disks at a time, and if
> > > N is an odd number there could be one chunk left over in the array that
> > > is unusable. Most users will find this irrelevant because a large disk
> > > array that is filled to the last GB will become quite slow due to long
> > > free space search and seek times--you really want to keep usage below 95%,
> > > maybe 98% at most, and that means the last GB will never be needed.
> > >
> > > Reading raid5 metadata could theoretically be faster than raid1, but that
> > > depends on a lot of variables, so you can't assume it as a rule of thumb.
> > >
> > > Raid6 metadata is more interesting because it's the only currently
> > > supported way to get 2-disk failure tolerance in btrfs. Unfortunately
> > > that b

Re: List of known BTRFS Raid 5/6 Bugs?

2018-08-13 Thread Zygo Blaxell
On Mon, Aug 13, 2018 at 09:20:22AM +0200, Menion wrote:
> Hi
> I have a BTRFS RAID5 array built on 5x8TB HDD filled with, well :),
> there are contradicting opinions by the, well, "several" ways to check
> the used space on a BTRFS RAID5 array, but I should be aroud 8TB of
> data.
> This array is running on kernel 4.17.3 and it definitely experienced
> power loss while data was being written.
> I can say that it wen through at least a dozen of unclear shutdown
> So following this thread I started my first scrub on the array. and
> this is the outcome (after having resumed it 4 times, two after a
> power loss...):
> 
> menion@Menionubuntu:~$ sudo btrfs scrub status /media/storage/das1/
> scrub status for 931d40c6-7cd7-46f3-a4bf-61f3a53844bc
> scrub resumed at Sun Aug 12 18:43:31 2018 and finished after 55:06:35
> total bytes scrubbed: 2.59TiB with 0 errors
> 
> So, there are 0 errors, but I don't understand why it says 2.59TiB of
> scrubbed data. Is it possible that also this values is crap, as the
> non zero counters for RAID5 array?

I just tested a quick scrub with injected errors on 4.18.0 and it looks
like the garbage values are finally fixed (yay!).

I never saw invalid values for 'total bytes' from raid5; however, scrub
has (had?) trouble resuming, especially if the system was rebooted between
cancel and resume, but sometimes just if the scrub had just been suspended
too long (maybe if there are changes to the chunk tree...?).

55 hours for 2600 GB is just under 50GB per hour, which doesn't sound
too unreasonable for btrfs, though it is known to be a bit slow compared
to other raid5 implementations.

> Il giorno sab 11 ago 2018 alle ore 17:29 Zygo Blaxell
>  ha scritto:
> >
> > On Sat, Aug 11, 2018 at 08:27:04AM +0200, erentheti...@mail.de wrote:
> > > I guess that covers most topics, two last questions:
> > >
> > > Will the write hole behave differently on Raid 6 compared to Raid 5 ?
> >
> > Not really.  It changes the probability distribution (you get an extra
> > chance to recover using a parity block in some cases), but there are
> > still cases where data gets lost that didn't need to be.
> >
> > > Is there any benefit of running Raid 5 Metadata compared to Raid 1 ?
> >
> > There may be benefits of raid5 metadata, but they are small compared to
> > the risks.
> >
> > In some configurations it may not be possible to allocate the last
> > gigabyte of space.  raid1 will allocate 1GB chunks from 2 disks at a
> > time while raid5 will allocate 1GB chunks from N disks at a time, and if
> > N is an odd number there could be one chunk left over in the array that
> > is unusable.  Most users will find this irrelevant because a large disk
> > array that is filled to the last GB will become quite slow due to long
> > free space search and seek times--you really want to keep usage below 95%,
> > maybe 98% at most, and that means the last GB will never be needed.
> >
> > Reading raid5 metadata could theoretically be faster than raid1, but that
> > depends on a lot of variables, so you can't assume it as a rule of thumb.
> >
> > Raid6 metadata is more interesting because it's the only currently
> > supported way to get 2-disk failure tolerance in btrfs.  Unfortunately
> > that benefit is rather limited due to the write hole bug.
> >
> > There are patches floating around that implement multi-disk raid1 (i.e. 3
> > or 4 mirror copies instead of just 2).  This would be much better for
> > metadata than raid6--more flexible, more robust, and my guess is that
> > it will be faster as well (no need for RMW updates or journal seeks).
> >
> > > -
> > > FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT
> > >
> 


signature.asc
Description: PGP signature


Re: List of known BTRFS Raid 5/6 Bugs?

2018-08-11 Thread Zygo Blaxell
On Sat, Aug 11, 2018 at 08:27:04AM +0200, erentheti...@mail.de wrote:
> I guess that covers most topics, two last questions:
> 
> Will the write hole behave differently on Raid 6 compared to Raid 5 ?

Not really.  It changes the probability distribution (you get an extra
chance to recover using a parity block in some cases), but there are
still cases where data gets lost that didn't need to be.

> Is there any benefit of running Raid 5 Metadata compared to Raid 1 ? 

There may be benefits of raid5 metadata, but they are small compared to
the risks.

In some configurations it may not be possible to allocate the last
gigabyte of space.  raid1 will allocate 1GB chunks from 2 disks at a
time while raid5 will allocate 1GB chunks from N disks at a time, and if
N is an odd number there could be one chunk left over in the array that
is unusable.  Most users will find this irrelevant because a large disk
array that is filled to the last GB will become quite slow due to long
free space search and seek times--you really want to keep usage below 95%,
maybe 98% at most, and that means the last GB will never be needed.

Reading raid5 metadata could theoretically be faster than raid1, but that
depends on a lot of variables, so you can't assume it as a rule of thumb.

Raid6 metadata is more interesting because it's the only currently
supported way to get 2-disk failure tolerance in btrfs.  Unfortunately
that benefit is rather limited due to the write hole bug.

There are patches floating around that implement multi-disk raid1 (i.e. 3
or 4 mirror copies instead of just 2).  This would be much better for
metadata than raid6--more flexible, more robust, and my guess is that
it will be faster as well (no need for RMW updates or journal seeks).

> -
> FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT
> 


signature.asc
Description: PGP signature


Re: List of known BTRFS Raid 5/6 Bugs?

2018-08-10 Thread Zygo Blaxell
On Sat, Aug 11, 2018 at 04:18:35AM +0200, erentheti...@mail.de wrote:
> Write hole:
> 
> 
> > The data will be readable until one of the data blocks becomes
> > inaccessible (bad sector or failed disk). This is because it is only the
> > parity block that is corrupted (old data blocks are still not modified
> > due to btrfs CoW), and the parity block is only required when recovering
> > from a disk failure.
> 
> I am unsure about your meaning. 
> Assuming you perform an unclean shutdown (eg. crash), and after restart
> perform a scrub, with no additional error (bad sector, bit-rot) before
> or after the crash:
> will you loose data? 

No, the parity blocks will be ignored and RAID5 will act like slow RAID0
if no other errors occur.

> Will you be able to mount the filesystem like normal? 

Yes.

> Additionaly, will the crash create additional errors like bad
> sectors and or bit-rot aside from the parity-block corruption?

No, only parity-block corruptions should occur.

> Its actually part of my first mail, where the btrfs Raid5/6 page
> assumes no data damage while the spinics comment implies the opposite.

The above assumes no drive failures or data corruption; however, if this
were the case, you could use RAID0 instead of RAID5.

The only reason to use RAID5 is to handle cases where at least one block
(or an entire disk) fails, so the behavior of RAID5 when all disks are
working is almost irrelevant.

A drive failure could occur at any time, so even if you mount successfully,
if a disk fails immediately after, any stripes affected by write hole will
be unrecoverably corrupted.

> The write hole does not seem as dangerous if you could simply scrub
> to repair damage (On smaller discs that is, where scrub doesnt take
> enough time for additional errors to occur)

Scrub can repair parity damage on normal data and metadata--it recomputes
parity from data if the data passes a CRC check.

No repair is possible for data in nodatasum files--the parity can be
recomputed, but there is no way to determine if the result is correct.

Metadata is always checksummed and transid verified; alas, there isn't
an easy way to get btrfs to perform an urgent scrub on metadata only.

> > Put another way: if all disks are online then RAID5/6 behaves like a slow
> > RAID0, and RAID0 does not have the partial stripe update problem because
> > all of the data blocks in RAID0 are independent. It is only when a disk
> > fails in RAID5/6 that the parity block is combined with data blocks, so
> > it is only in this case that the write hole bug can result in lost data.
> 
> So data will not be lost if no drive has failed?

Correct, but the array will have reduced failure tolerance, and RAID5
only matters when a drive has failed.  It is effectively operating in
degraded mode on parts of the array affected by write hole, and no single
disk failure can be tolerated there.

It is possible to recover the parity by performing an immediate scrub
after reboot, but this cannot be as effective as a proper RAID5 update
journal which avoids making the parity bad in the first place.

> > > > If the filesystem is -draid5 -mraid1 then the metadata is not vulnerable
> > > > to the write hole, but data is. In this configuration you can determine
> > > > with high confidence which files you need to restore from backup, and
> > > > the filesystem will remain writable to replace the restored data, 
> > > > because
> > > > raid1 does not have the write hole bug.
> 
> In regards to my earlier questions, what would change if i do -draid5 -mraid1?

Metadata would be using RAID1 which is not subject to the RAID5 write
hole issue.  It is much more tolerant of unclean shutdowns especially
in degraded mode.

Data in RAID5 may be damaged when the array is in degraded mode and
a write hole occurs (in either order as long as both occur).  Due to
RAID1 metadata, the filesystem will continue to operate properly,
allowing the damaged data to be overwritten or deleted.

> Lost Writes:
> 
> > Hotplugging causes an effect (lost writes) which can behave similarly
> > to the write hole bug in some instances. The similarity ends there.
> 
> Are we speaking about the same problem that is causing transid mismatch? 

Transid mismatch is usually caused by lost writes, by any mechanism
that prevents a write from being completed after the disk reports that
it was completed.

Drives may report that data is "in stable storage", i.e. the drive
believes it can complete the write in the future even if power is lost
now because the drive or controller has capacitors or NVRAM or similar.
If the drive is reset by the SATA host because of a cable disconnect
event, the drive may forget that it has promised to do writes in the
future.  Drives may simply lie, and claim that data has been written to
disk when the data is actually in volatile RAM and will disappear in a
power failure.

btrfs uses a transaction mechanism and CoW metadata to handle lost writes
within an interrupted transaction. 

Re: List of known BTRFS Raid 5/6 Bugs?

2018-08-10 Thread Zygo Blaxell
On Fri, Aug 10, 2018 at 06:55:58PM +0200, erentheti...@mail.de wrote:
> Did i get you right?
> Please correct me if i am wrong:
> 
> Scrubbing seems to have been fixed, you only have to run it once.

Yes.

There is one minor bug remaining here:  when scrub detects an error
on any disk in a raid5/6 array, the error counts are garbage (random
numbers on all the disks).  You will need to inspect btrfs dev stats
or the kernel log messages to learn which disks are injecting errors.

This does not impair the scrubbing function, only the detailed statistics
report (scrub status -d).

If there are no errors, scrub correctly reports 0 for all error counts.
Only raid5/6 is affected this way--other RAID profiles produce correct
scrub statistics.

> Hotplugging (temporary connection loss) is affected by the write hole
> bug, and will create undetectable errors every 16 TB (crc32 limitation).

Hotplugging causes an effect (lost writes) which can behave similarly
to the write hole bug in some instances.  The similarity ends there.

They are really two distinct categories of problem.  Temporary connection
loss can do bad things to all RAID profiles on btrfs (not just RAID5/6)
and the btrfs requirements for handling connection loss and write holes
are very different.

> The write Hole Bug can affect both old and new data. 

Normally, only old data can be affected by the write hole bug.

The "new" data is not committed before the power failure (otherwise we
would call it "old" data), so any corrupted new data will be inaccessible
as a result of the power failure.  The filesytem will roll back to the
last complete committed data tree (discarding all new and modified data
blocks), then replay the fsync log (which repeats and completes some
writes that occurred since the last commit).  This process eliminates
new data from the filesystem whether the new data was corrupted by the
write hole or not.

Only corruptions that affect old data will remain, because old data is
not overwritten by data saved in the fsync log, and old data is not part
of the incomplete data tree that is rolled back after power failure.

Exception:  new data in nodatasum files can also be corrupted, but since
nodatasum disables all data integrity or recovery features it's hard to
define what "corrupted" means for a nodatasum file.

> Reason: BTRFS saves data in fixed size stripes, if the write operation
> fails midway, the stripe is lost.
> This does not matter much for Raid 1/10, data always uses a full stripe,
> and stripes are copied on write. Only new data could be lost.

This is incorrect.  Btrfs saves data in variable-sized extents (between
1 and 32768 4K data blocks) and btrfs has no concept of stripes outside of
its raid layer.  Stripes are never copied.

In RAID 1/10/DUP all data blocks are fully independent of each other,
i.e. writing to any block on these RAID profiles does not corrupt data in
any other block.  As a result these RAID profiles do not allow old data
to be corrupted by partially completed writes of new data.

There is striping in some profiles, but it is only used for performance
in these cases, and has no effect on data recovery.

> However, for some reason Raid 5/6 works with partial stripes, meaning
> that data is stored in stripes not completley filled by prior data,

In RAID 5/6 each data block is related to all other data blocks in the
same stripe with the parity block(s).  If any individual data block in the
stripe is updated, the parity block(s) must also be updated atomically,
or the wrong data will be reconstructed during RAID5/6 recovery.

Because btrfs does nothing to prevent it, some writes will occur
to RAID5/6 stripes that are already partially occupied by old data.
btrfs also does nothing to ensure that parity block updates are atomic,
so btrfs has the write hole bug as a result.

> and stripes are removed on write.

Stripes are never removed...?  A stripe is just a group of disk blocks
divided on 64K boundaries, same as mdadm and many hardware RAID5/6
implementations.

> Result: If the operation fails midway, the stripe is lost as is all
> data previously stored it.

You can only lose as many data blocks in each stripe as there are parity
disks (i.e. raid5 can lose 0 or 1 block, while raid6 can lose 0, 1, or 2
blocks); however, multiple writes can be lost affecting multiple stripes
in a single power loss event.  Losing even 1 block is often too much.  ;)

The data will be readable until one of the data blocks becomes
inaccessible (bad sector or failed disk).  This is because it is only the
parity block that is corrupted (old data blocks are still not modified
due to btrfs CoW), and the parity block is only required when recovering
from a disk failure.

Put another way:  if all disks are online then RAID5/6 behaves like a slow
RAID0, and RAID0 does not have the partial stripe update problem because
all of the data blocks in RAID0 are independent.  It is only when a disk
fails in RAID5/6 that the parity block is 

Re: List of known BTRFS Raid 5/6 Bugs?

2018-08-10 Thread Zygo Blaxell
On Fri, Aug 10, 2018 at 03:40:23AM +0200, erentheti...@mail.de wrote:
> I am searching for more information regarding possible bugs related to
> BTRFS Raid 5/6. All sites i could find are incomplete and information
> contradicts itself:
>
> The Wiki Raid 5/6 Page (https://btrfs.wiki.kernel.org/index.php/RAID56)
> warns of the write hole bug, stating that your data remains safe
> (except data written during power loss, obviously) upon unclean shutdown
> unless your data gets corrupted by further issues like bit-rot, drive
> failure etc.

The raid5/6 write hole bug exists on btrfs (as of 4.14-4.16) and there are
no mitigations to prevent or avoid it in mainline kernels.

The write hole results from allowing a mixture of old (committed) and
new (uncommitted) writes to the same RAID5/6 stripe (i.e. a group of
blocks consisting of one related data or parity block from each disk
in the array, such that writes to any of the data blocks affect the
correctness of the parity block and vice versa).  If the writes were
not completed and one or more of the data blocks are not online, the
data blocks reconstructed by the raid5/6 algorithm will be corrupt.

If all disks are online, the write hole does not immediately
damage user-visible data as the old data blocks can still be read
directly; however, should a drive failure occur later, old data may
not be recoverable because the parity block will not be correct for
reconstructing the missing data block.  A scrub can fix write hole
errors if all disks are online, and a scrub should be performed after
any unclean shutdown to recompute parity data.

The write hole always puts both old and new data at risk of damage;
however, due to btrfs's copy-on-write behavior, only the old damaged
data can be observed after power loss.  The damaged new data will have
no references to it written to the disk due to the power failure, so
there is no way to observe the new damaged data using the filesystem.
Not every interrupted write causes damage to old data, but some will.

Two possible mitigations for the write hole are:

- modify the btrfs allocator to prevent writes to partially filled
raid5/6 stripes (similar to what the ssd mount option does, except
with the correct parameters to match RAID5/6 stripe boundaries),
and advise users to run btrfs balance much more often to reclaim
free space in partially occupied raid stripes

- add a stripe write journal to the raid5/6 layer (either in
btrfs itself, or in a lower RAID5 layer).

There are assorted other ideas (e.g. copy the RAID-Z approach from zfs
to btrfs or dramatically increase the btrfs block size) that also solve
the write hole problem but are somewhat more invasive and less practical
for btrfs.

Note that the write hole also affects btrfs on top of other similar
raid5/6 implementations (e.g. mdadm raid5 without stripe journal).
The btrfs CoW layer does not understand how to allocate data to avoid RMW
raid5 stripe updates without corrupting existing committed data, and this
limitation applies to every combination of unjournalled raid5/6 and btrfs.

> The Wiki Gotchas Page (https://btrfs.wiki.kernel.org/index.php/Gotchas)
> warns of possible incorrigible "transid" mismatch, not stating which
> versions are affected or what transid mismatch means for your data. It
> does not mention the write hole at all.

Neither raid5 nor write hole are required to produce a transid mismatch
failure.  transid mismatch usually occurs due to a lost write.  Write hole
is a specific case of lost write, but write hole does not usually produce
transid failures (it produces header or csum failures instead).

During real disk failure events, multiple distinct failure modes can
occur concurrently.  i.e. both transid failure and write hole can occur
at different places in the same filesystem as a result of attempting to
use a failing disk over a long period of time.

A transid verify failure is metadata damage.  It will make the filesystem
readonly and make some data inaccessible as described below.

> This Mail Archive (linux-btrfs@vger.kernel.org/msg55161.html"
> target="_blank">https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg55161.html)
> states that scrubbing BTRFS Raid 5/6 will always repair Data Corruption,
> but may corrupt your Metadata while trying to do so - meaning you have
> to scrub twice in a row to ensure data integrity.

Simple corruption (without write hole errors) is fixed by scrubbing
as of the last...at least six months?  Kernel v4.14.xx and later can
definitely do it these days.  Both data and metadata.

If the metadata is damaged in any way (corruption, write hole, or transid
verify failure) on btrfs and btrfs cannot use the raid profile for
metadata to recover the damaged data, the filesystem is usually forever
readonly, and anywhere from 0 to 100% of the filesystem may be readable
depending on where in the metadata tree structure the error occurs (the
closer to the 

Re: RAID-1 refuses to balance large drive

2018-06-07 Thread Zygo Blaxell
On Sat, May 26, 2018 at 06:27:57PM -0700, Brad Templeton wrote:
> A few years ago, I encountered an issue (halfway between a bug and a
> problem) with attempting to grow a BTRFS 3 disk Raid 1 which was
> fairly full.   The problem was that after replacing (by add/delete) a
> small drive with a larger one, there were now 2 full drives and one
> new half-full one, and balance was not able to correct this situation
> to produce the desired result, which is 3 drives, each with a roughly
> even amount of free space.  It can't do it because the 2 smaller
> drives are full, and it doesn't realize it could just move one of the
> copies of a block off the smaller drive onto the larger drive to free
> space on the smaller drive, it wants to move them both, and there is
> nowhere to put them both.
> 
> I'm about to do it again, taking my nearly full array which is 4TB,
> 4TB, 6TB and replacing one of the 4TB with an 8TB.  I don't want to
> repeat the very time consuming situation, so I wanted to find out if
> things were fixed now.   I am running Xenial (kernel 4.4.0) and could
> consider the upgrade to  bionic (4.15) though that adds a lot more to
> my plate before a long trip and I would prefer to avoid if I can.
> 
> So what is the best strategy:
> 
> a) Replace 4TB with 8TB, resize up and balance?  (This is the "basic" 
> strategy)
> b) Add 8TB, balance, remove 4TB (automatic distribution of some blocks
> from 4TB but possibly not enough)
> c) Replace 6TB with 8TB, resize/balance, then replace 4TB with
> recently vacated 6TB -- much longer procedure but possibly better

d) Run "btrfs balance start -dlimit=3 /fs" to make some unallocated
space on all drives *before* adding disks.  Then replace, resize up,
and balance until unallocated space on all disks are equal.  There is
no need to continue balancing after that, so once that point is reached
you can cancel the balance.

A number of bad things can happen when unallocated space goes to zero,
and being unable to expand a raid1 array is only one of them.  Avoid that
situation even when not resizing the array, because some cases can be
very difficult to get out of.

Assuming your disk is not filled to the last gigabyte, you'll be able
to keep at least 1GB unallocated on every disk at all times.  Monitor
the amount of unallocated space and balance a few data block groups
(e.g. -dlimit=3) whenever unallocated space gets low.

A potential btrfs enhancement area:  allow the 'devid' parameter of
balance to specify two disks to balance block groups that contain chunks
on both disks.  We want to balance only those block groups that consist of
one chunk on each smaller drive.  This redistributes those block groups
to have one chunk on the large disk and one chunk on one of the smaller
disks, freeing space on the other small disk for the next block group.
Block groups that consist of a chunk on the big disk and one of the
small disks are already in the desired configuration, so rebalancing
them is just a waste of time.  Currently it's only possible to do this
by writing a script to select individual block groups with python-btrfs
or similar--much faster than plain btrfs balance for this case, but more
involved to set up.

> Or has this all been fixed and method A will work fine and get to the
> ideal goal -- 3 drives, with available space suitably distributed to
> allow full utilization over time?
> 
> On Sat, May 26, 2018 at 6:24 PM, Brad Templeton  wrote:
> > A few years ago, I encountered an issue (halfway between a bug and a
> > problem) with attempting to grow a BTRFS 3 disk Raid 1 which was fairly
> > full.   The problem was that after replacing (by add/delete) a small drive
> > with a larger one, there were now 2 full drives and one new half-full one,
> > and balance was not able to correct this situation to produce the desired
> > result, which is 3 drives, each with a roughly even amount of free space.
> > It can't do it because the 2 smaller drives are full, and it doesn't realize
> > it could just move one of the copies of a block off the smaller drive onto
> > the larger drive to free space on the smaller drive, it wants to move them
> > both, and there is nowhere to put them both.
> >
> > I'm about to do it again, taking my nearly full array which is 4TB, 4TB, 6TB
> > and replacing one of the 4TB with an 8TB.  I don't want to repeat the very
> > time consuming situation, so I wanted to find out if things were fixed now.
> > I am running Xenial (kernel 4.4.0) and could consider the upgrade to  bionic
> > (4.15) though that adds a lot more to my plate before a long trip and I
> > would prefer to avoid if I can.
> >
> > So what is the best strategy:
> >
> > a) Replace 4TB with 8TB, resize up and balance?  (This is the "basic"
> > strategy)
> > b) Add 8TB, balance, remove 4TB (automatic distribution of some blocks from
> > 4TB but possibly not enough)
> > c) Replace 6TB with 8TB, resize/balance, then replace 4TB with recently
> > vacated 6TB -- much longer 

Re: Any chance to get snapshot-aware defragmentation?

2018-05-31 Thread Zygo Blaxell
On Mon, May 21, 2018 at 11:38:28AM -0400, Austin S. Hemmelgarn wrote:
> On 2018-05-21 09:42, Timofey Titovets wrote:
> > пн, 21 мая 2018 г. в 16:16, Austin S. Hemmelgarn :
> > > On 2018-05-19 04:54, Niccolò Belli wrote:
> > > > On venerdì 18 maggio 2018 20:33:53 CEST, Austin S. Hemmelgarn wrote:
> > > > > With a bit of work, it's possible to handle things sanely.  You can
> > > > > deduplicate data from snapshots, even if they are read-only (you need
> > > > > to pass the `-A` option to duperemove and run it as root), so it's
> > > > > perfectly reasonable to only defrag the main subvolume, and then
> > > > > deduplicate the snapshots against that (so that they end up all being
> > > > > reflinks to the main subvolume).  Of course, this won't work if you're
> > > > > short on space, but if you're dealing with snapshots, you should have
> > > > > enough space that this will work (because even without defrag, it's
> > > > > fully possible for something to cause the snapshots to suddenly take
> > > > > up a lot more space).
> > > > 
> > > > Been there, tried that. Unfortunately even if I skip the defreg a simple
> > > > 
> > > > duperemove -drhA --dedupe-options=noblock --hashfile=rootfs.hash rootfs
> > > > 
> > > > is going to eat more space than it was previously available (probably
> > > > due to autodefrag?).
> > > It's not autodefrag (that doesn't trigger on use of the EXTENT_SAME
> > > ioctl).  There's two things involved here:
> > 
> > > * BTRFS has somewhat odd and inefficient handling of partial extents.
> > > When part of an extent becomes unused (because of a CLONE ioctl, or an
> > > EXTENT_SAME ioctl, or something similar), that part stays allocated
> > > until the whole extent would be unused.
> > > * You're using the default deduplication block size (128k), which is
> > > larger than your filesystem block size (which is at most 64k, most
> > > likely 16k, but might be 4k if it's an old filesystem), so deduplicating
> > > can split extents.
> > 
> > That's a metadata node leaf != fs block size.
> > btrfs fs block size == machine page size currently.
> You're right, I keep forgetting about that (probably because BTRFS is pretty
> much the only modern filesystem that doesn't let you change the block size).
> > 
> > > Because of this, if a duplicate region happens to overlap the front of
> > > an already shared extent, and the end of said shared extent isn't
> > > aligned with the deduplication block size, the EXTENT_SAME call will
> > > deduplicate the first part, creating a new shared extent, but not the
> > > tail end of the existing shared region, and all of that original shared
> > > region will stick around, taking up extra space that it wasn't before.
> > 
> > > Additionally, if only part of an extent is duplicated, then that area of
> > > the extent will stay allocated, because the rest of the extent is still
> > > referenced (so you won't necessarily see any actual space savings).
> > 
> > > You can mitigate this by telling duperemove to use the same block size
> > > as your filesystem using the `-b` option.   Note that using a smaller
> > > block size will also slow down the deduplication process and greatly
> > > increase the size of the hash file.
> > 
> > duperemove -b control "how hash data", not more or less and only support
> > 4KiB..1MiB
> And you can only deduplicate the data at the granularity you hashed it at.
> In particular:
> 
> * The total size of a region being deduplicated has to be an exact multiple
> of the hash block size (what you pass to `-b`).  So for the default 128k
> size, you can only deduplicate regions that are multiples of 128k long
> (128k, 256k, 384k, 512k, etc).   This is a simple limit derived from how
> blocks are matched for deduplication.
> * Because duperemove uses fixed hash blocks (as opposed to using a rolling
> hash window like many file synchronization tools do), the regions being
> deduplicated also have to be exactly aligned to the hash block size.  So,
> with the default 128k size, you can only deduplicate regions starting at 0k,
> 128k, 256k, 384k, 512k, etc, but not ones starting at, for example, 64k into
> the file.
> > 
> > And size of block for dedup will change efficiency of deduplication,
> > when count of hash-block pairs, will change hash file size and time
> > complexity.
> > 
> > Let's assume that: 'A' - 1KiB of data '' - 4KiB with repeated pattern.
> > 
> > So, example, you have 2 of 2x4KiB blocks:
> > 1: ''
> > 2: ''
> > 
> > With -b 8KiB hash of first block not same as second.
> > But with -b 4KiB duperemove will see both '' and ''
> > And then that blocks will be deduped.
> This supports what I'm saying though.  Your deduplication granularity is
> bounded by your hash granularity.  If in addition to the above you have a
> file that looks like:
> 
> AABBBAA
> 
> It would not get deduplicated against the first two at either `-b 4k` or `-b
> 8k` despite the middle 4k of the file being an exact duplicate 

Re: [PATCH 2/2] vfs: dedupe should return EPERM if permission is not granted

2018-05-16 Thread Zygo Blaxell
On Sun, May 13, 2018 at 11:26:39AM -0700, Darrick J. Wong wrote:
> On Sun, May 13, 2018 at 06:21:52PM +, Mark Fasheh wrote:
> > On Fri, May 11, 2018 at 05:06:34PM -0700, Darrick J. Wong wrote:
> > > On Fri, May 11, 2018 at 12:26:51PM -0700, Mark Fasheh wrote:
> > > > Right now we return EINVAL if a process does not have permission to 
> > > > dedupe a
> > > > file. This was an oversight on my part. EPERM gives a true description 
> > > > of
> > > > the nature of our error, and EINVAL is already used for the case that 
> > > > the
> > > > filesystem does not support dedupe.
> > > > 
> > > > Signed-off-by: Mark Fasheh 
> > > > ---
> > > >  fs/read_write.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/read_write.c b/fs/read_write.c
> > > > index 77986a2e2a3b..8edef43a182c 100644
> > > > --- a/fs/read_write.c
> > > > +++ b/fs/read_write.c
> > > > @@ -2038,7 +2038,7 @@ int vfs_dedupe_file_range(struct file *file, 
> > > > struct file_dedupe_range *same)
> > > > info->status = -EINVAL;
> > > > } else if (!(is_admin || (dst_file->f_mode & 
> > > > FMODE_WRITE) ||
> > > >  uid_eq(current_fsuid(), dst->i_uid))) {
> > > > -   info->status = -EINVAL;
> > > > +   info->status = -EPERM;
> > > 
> > > Hmm, are we allowed to change this aspect of the kabi after the fact?
> > > 
> > > Granted, we're only trading one error code for another, but will the
> > > existing users of this care?  xfs_io won't and I assume duperemove won't
> > > either, but what about bees? :)
> > 
> > Yeah if you see my initial e-mail I check bees and also rust-btrfs. I think
> > this is fine as we're simply expanding on an error code return. There's no
> > magic behavior expected with respect to these error codes either.
> 
> Ok.  No objections from me, then.
> 
> Acked-by: Darrick J. Wong 

For what it's worth, no objection from me either.  ;)

bees runs only with admin privilege and will never hit the modified line.

If bees is started without admin privilege, the TREE_SEARCH_V2 ioctl
fails.  bees uses this ioctl to walk over all the data in the filesystem,
so without admin privilege, bees never opens, reads, or dedupes anything.

bees relies on having an accurate internal model of btrfs structure and
behavior to issue dedup commands that will work and do useful things;
however, unexpected kernel behavior or concurrent user data changes
will make some dedups fail.  When that happens bees just abandons the
extent immediately:  a user data change will be handled in the next pass
over the filesystem, but an unexpected kernel behavior needs bees code
changes to correctly predict the new kernel behavior before the dedup
can be reattempted.

> --D
> 
> > --Mark
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: PGP signature


Re: Hard link not persisted on fsync

2018-04-19 Thread Zygo Blaxell
On Mon, Apr 16, 2018 at 09:35:24AM -0500, Jayashree Mohan wrote:
> Hi,
> 
> The following seems to be a crash consistency bug on btrfs, where in
> the link count is not persisted even after a fsync on the original
> file.
> 
> Consider the following workload :
> creat foo
> link (foo, A/bar)
> fsync(foo)
> ---Crash---
> 
> Now, on recovery we expect the metadata of foo to be persisted i.e
> have a link count of 2. However in btrfs, the link count is 1 and file
> A/bar is not persisted. The expected behaviour would be to persist the
> dependencies of inode foo. That is to say, shouldn't fsync of foo
> persist A/bar and correctly update the link count?

Those dependencies are backward.  foo's inode doesn't depend on anything
but the data in the file foo, and foo's inode itself.

"foo" and "A/bar" are dirents that both depend on the inode of foo, which
implies that "A" and "." must be updated atomically with foo's inode.
If you had called fsync(A) then we'd expect A/bar to exist and the inode
to have a link count of 2.  If you'd called fsync(.) then...well, you
didn't modify "." at all, so I guess either outcome is valid as long as
the inode link count matches the number of dirents referencing the inode.

But then...why does foo exist at all?  I'd expect at least some tests
would end without foo on disk either, since all that was fsync()ed was the
foo inode, not the foo dirent in the directory '.'.  Does btrfs combine
creating foo and updating foo's inode into a single atomic operation?
I vaguely recall that it does exactly that, in order to solve a bug
some years ago.  What happens if you add a rename, e.g.

unlink foo2 # make sure foo2 doesn't exist
creat foo
rename(foo, foo2)
link(foo2, A/bar)
fsync(foo2)

Do you get foo or foo2?  I'd expect foo since you didn't fsync '.',
but maybe rename implies flush and you get foo2.

That's not to say that fsync is not a rich source of filesystem bugs.
btrfs did once have (and maybe still has?) a bug where renames and fsync
can create a dirent with no inode, e.g.

loop continuously:
creat foo
write(foo, data)
fsync(foo)
rename(foo, bar)

and crash somewhere in the middle of the loop, which will create a
dirent "foo" that points to a non-existent inode.

Removing the "fsync" works around the bug.  rename() does a flush anyway,
so the fsync() wasn't needed, but fsync() shouldn't _create_ a filesystem
inconsistency, especially when Googling recommends app developers to
sprinkle fsync()s indiscriminately in their code to prevent their data
from being mangled.

I haven't been tracking to see if that's fixed yet.  I last saw it on
4.11, but I have been aggressively avoiding fsync with eatmydata for
some years now.

> Note that ext4, xfs and f2fs recover to the correct link count of 2
> for the above workload.

Do those filesystems also work if you remove the fsync?  That may be
your answer:  they could be flushing the other metadata earlier, before
you call fsync().

> Let us know what you think about this behavior.
> 
> Thanks,
> Jayashree Mohan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: PGP signature


Re: Status of RAID5/6

2018-04-04 Thread Zygo Blaxell
On Wed, Apr 04, 2018 at 11:31:33PM +0200, Goffredo Baroncelli wrote:
> On 04/04/2018 08:01 AM, Zygo Blaxell wrote:
> > On Wed, Apr 04, 2018 at 07:15:54AM +0200, Goffredo Baroncelli wrote:
> >> On 04/04/2018 12:57 AM, Zygo Blaxell wrote:
> [...]
> >> Before you pointed out that the non-contiguous block written has
> >> an impact on performance. I am replaying  that the switching from a
> >> different BG happens at the stripe-disk boundary, so in any case the
> >> block is physically interrupted and switched to another disk
> > 
> > The difference is that the write is switched to a different local address
> > on the disk.
> > 
> > It's not "another" disk if it's a different BG.  Recall in this plan
> > there is a full-width BG that is on _every_ disk, which means every
> > small-width BG shares a disk with the full-width BG.  Every extent tail
> > write requires a seek on a minimum of two disks in the array for raid5,
> > three disks for raid6.  A tail that is strip-width minus one will hit
> > N - 1 disks twice in an N-disk array.
> 
> Below I made a little simulation; my results telling me another thing:
> 
> Current BTRFS (w/write hole)
> 
> Supposing 5 disk raid 6 and stripe size=64kb*3=192kb (disk stripe=64kb)
> 
> Case A.1): extent size = 192kb:
> 5 writes of 64kb spread on 5 disks (3data + 2 parity)
> 
> Case A.2.2): extent size = 256kb: (optimistic case: contiguous space 
> available)
> 5 writes of 64kb spread on 5 disks (3 data + 2 parity)
> 2 reads of 64 kb spread on 2 disks (two old data of the stripe) [**]
> 3 writes of 64 kb spread on 3 disks (data + 2 parity)
> 
> Note that the two reads are contiguous to the 5 writes both in term of
> space and time. The three writes are contiguous only in terms of space,
> but not in terms of time, because these could happen only after the 2
> reads and the consequent parities computations. So we should consider
> that between these two events, some disk activities happen; this means
> seeks between the 2 reads and the 3 writes
> 
> 
> BTRFS with multiple BG (wo/write hole)
> 
> Supposing 5 disk raid 6 and stripe size=64kb*3=192kb (disk stripe=64kb)
> 
> Case B.1): extent size = 192kb:
> 5 writes of 64kb spread on 5 disks
> 
> Case B.2): extent size = 256kb:
> 5 writes of 64kb spread on 5 disks in BG#1
> 3 writes of 64 kb spread on 3 disks in BG#2 (which requires 3 seeks)
> 
> So if I count correctly:
> - case B1 vs A1: these are equivalent
> - case B2 vs A2.1/A2.2:
>   8 writes vs 8 writes
>   3 seeks vs 3 seeks
>   0 reads vs 2 reads
> 
> So to me it seems that the cost of doing a RMW cycle is worse than
> seeking to another BG.

Well, RMW cycles are dangerous, so being slow as well is just a second
reason never to do them.

> Anyway I am reaching the conclusion, also thanks of this discussion,
> that this is not enough. Even if we had solve the problem of the
> "extent smaller than stripe" write, we still face gain this issue when
> part of the file is changed.
> In this case the file update breaks the old extent and will create a
> three extents: the first part, the new part, the last part. Until that
> everything is OK. However the "old" part of the file would be marked
> as free space. But using this part could require a RMW cycle

You cannot use that free space within RAID stripes because it would
require RMW, and RMW causes write hole.  The space would have to be kept
unavailable until the rest of the RAID stripe was deleted.

OTOH, if you can solve that free space management problem, you don't
have to do anything else to solve write hole.  If you never RMW then
you never have the write hole in the first place.

> I am concluding that the only two reliable solution are 
> a) variable stripe size (like ZFS does) 
> or b) logging the RMW cycle of a stripe 

Those are the only solutions that don't require a special process for
reclaiming unused space in RAID stripes.  If you have that, you have a
few more options; however, they all involve making a second copy of the
data at a later time (as opposed to option b, which makes a second
copy of the data during the original write).

a) also doesn't support nodatacow files (AFAIK ZFS doesn't have those)
and it would require defrag to get the inefficiently used space back.

b) is the best of the terrible options.  It minimizes the impact on the
rest of the filesystem since it can fix RMW inconsistency without having
to eliminate the RMW cases.  It doesn't require rewriting the allocator
nor does it require users to run defrag or balance periodically.

> [**] Does someone know if the checksum are checked during this read ?
> [...]
>  
> BR
> G.Baroncelli
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli 
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


signature.asc
Description: PGP signature


Re: Status of RAID5/6

2018-04-04 Thread Zygo Blaxell
On Tue, Apr 03, 2018 at 09:08:01PM -0600, Chris Murphy wrote:
> On Tue, Apr 3, 2018 at 11:03 AM, Goffredo Baroncelli <kreij...@inwind.it> 
> wrote:
> > On 04/03/2018 02:31 AM, Zygo Blaxell wrote:
> >> On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote:
> >>> On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote:
> >>>> On 2018-04-02 11:18, Goffredo Baroncelli wrote:
> >>>>> I thought that a possible solution is to create BG with different
> >>>> number of data disks. E.g. supposing to have a raid 6 system with 6
> >>>> disks, where 2 are parity disk; we should allocate 3 BG
> >>>>> BG #1: 1 data disk, 2 parity disks
> >>>>> BG #2: 2 data disks, 2 parity disks,
> >>>>> BG #3: 4 data disks, 2 parity disks
> >>>>>
> >>>>> For simplicity, the disk-stripe length is assumed = 4K.
> >>>>>
> >>>>> So If you have a write with a length of 4 KB, this should be placed
> >>>> in BG#1; if you have a write with a length of 4*3KB, the first 8KB,
> >>>> should be placed in in BG#2, then in BG#1.
> >>>>> This would avoid space wasting, even if the fragmentation will
> >>>> increase (but shall the fragmentation matters with the modern solid
> >>>> state disks ?).
> >>> I don't really see why this would increase fragmentation or waste space.
> >
> >> Oh, wait, yes I do.  If there's a write of 6 blocks, we would have
> >> to split an extent between BG #3 (the first 4 blocks) and BG #2 (the
> >> remaining 2 blocks).  It also flips the usual order of "determine size
> >> of extent, then allocate space for it" which might require major surgery
> >> on the btrfs allocator to implement.
> >
> > I have to point out that in any case the extent is physically interrupted 
> > at the disk-stripe size. Assuming disk-stripe=64KB, if you want to write 
> > 128KB, the first half is written in the first disk, the other in the 2nd 
> > disk.  If you want to write 96kb, the first 64 are written in the first 
> > disk, the last part in the 2nd, only on a different BG.
> > So yes there is a fragmentation from a logical point of view; from a 
> > physical point of view the data is spread on the disks in any case.
> >
> > In any case, you are right, we should gather some data, because the 
> > performance impact are no so clear.
> 
> They're pretty clear, and there's a lot written about small file size
> and parity raid performance being shit, no matter the implementation
> (md, ZFS, Btrfs, hardware maybe less so just because of all the
> caching and extra processing hardware that's dedicated to the task).

Pretty much everything goes fast if you put a faster non-volatile cache
in front of it.

> The linux-raid@ list is full of optimizations for this that are use
> case specific. One of those that often comes up is how badly suited
> raid56 are for e.g. mail servers, tons of small file reads and writes,
> and all the disk contention that comes up, and it's even worse when
> you lose a disk, and even if you're running raid 6 and lose two disk
> it's really god awful. It can be unexpectedly a disqualifying setup
> without prior testing in that condition: can your workload really be
> usable for two or three days in a double degraded state on that raid6?
> *shrug*
> 
> Parity raid is well suited for full stripe reads and writes, lots of
> sequential writes. Ergo a small file is anything less than a full
> stripe write. Of course, delayed allocation can end up making for more
> full stripe writes. But now you have more RMW which is the real
> performance killer, again no matter the raid.

RMW isn't necessary if you have properly configured COW on top.
ZFS doesn't do RMW at all.  OTOH for some workloads COW is a step in a
different wrong direction--the btrfs raid5 problems with nodatacow
files can be solved by stripe logging and nothing else.

Some equivalent of autodefrag that repacks your small RAID stripes
into bigger ones will burn 3x your write IOPS eventually--it just
lets you defer the inevitable until a hopefully more convenient time.
A continuously loaded server never has a more convenient time, so it
needs a different solution.

> > I am not worried abut having different BG; we have problem with these 
> > because we never developed tool to handle this issue properly (i.e. a 
> > daemon which starts a balance when needed). But I hope that this will be 
> > solved in future.
> >
> > In any case, the all solutions proposed have their trade off:
> >
> > - a) as is: write hole 

Re: Status of RAID5/6

2018-04-04 Thread Zygo Blaxell
On Wed, Apr 04, 2018 at 07:15:54AM +0200, Goffredo Baroncelli wrote:
> On 04/04/2018 12:57 AM, Zygo Blaxell wrote:
> >> I have to point out that in any case the extent is physically
> >> interrupted at the disk-stripe size. Assuming disk-stripe=64KB, if
> >> you want to write 128KB, the first half is written in the first disk,
> >> the other in the 2nd disk.  If you want to write 96kb, the first 64
> >> are written in the first disk, the last part in the 2nd, only on a
> >> different BG.
> > The "only on a different BG" part implies something expensive, either
> > a seek or a new erase page depending on the hardware.  Without that,
> > nearby logical blocks are nearby physical blocks as well.
> 
> In any case it happens on a different disk

No it doesn't.  The small-BG could be on the same disk(s) as the big-BG.

> >> So yes there is a fragmentation from a logical point of view; from a
> >> physical point of view the data is spread on the disks in any case.
> 
> > What matters is the extent-tree point of view.  There is (currently)
> > no fragmentation there, even for RAID5/6.  The extent tree is unaware
> > of RAID5/6 (to its peril).
> 
> Before you pointed out that the non-contiguous block written has
> an impact on performance. I am replaying  that the switching from a
> different BG happens at the stripe-disk boundary, so in any case the
> block is physically interrupted and switched to another disk

The difference is that the write is switched to a different local address
on the disk.

It's not "another" disk if it's a different BG.  Recall in this plan
there is a full-width BG that is on _every_ disk, which means every
small-width BG shares a disk with the full-width BG.  Every extent tail
write requires a seek on a minimum of two disks in the array for raid5,
three disks for raid6.  A tail that is strip-width minus one will hit
N - 1 disks twice in an N-disk array.

> However yes: from an extent-tree point of view there will be an increase
> of number extents, because the end of the writing is allocated to
> another BG (if the size is not stripe-boundary)
> 
> > If an application does a loop writing 68K then fsync(), the multiple-BG
> > solution adds two seeks to read every 68K.  That's expensive if sequential
> > read bandwidth is more scarce than free space.
> 
> Why you talk about an additional seeks? In any case (even without the
> additional BG) the read happens from another disks

See above:  not another disk, usually a different location on two or
more of the same disks.

> >> * c),d),e) are applied only for the tail of the extent, in case the
> > size is less than the stripe size.
> > 
> > It's only necessary to split an extent if there are no other writes
> > in the same transaction that could be combined with the extent tail
> > into a single RAID stripe.  As long as everything in the RAID stripe
> > belongs to a single transaction, there is no write hole
> 
> May be that a more "simpler" optimization would be close the transaction
> when the data reach the stripe boundary... But I suspect that it is
> not so simple to implement.

Transactions exist in btrfs to batch up writes into big contiguous extents
already.  The trick is to _not_ do that when one transaction ends and
the next begins, i.e. leave a space at the end of the partially-filled
stripe so that the next transaction begins in an empty stripe.

This does mean that there will only be extra seeks during transaction
commit and fsync()--which were already very seeky to begin with.  It's
not necessary to write a partial stripe when there are other extents to
combine.

So there will be double the amount of seeking, but depending on the
workload, it could double a very small percentage of writes.

> > Not for d.  Balance doesn't know how to get rid of unreachable blocks
> > in extents (it just moves the entire extent around) so after a balance
> > the writes would still be rounded up to the stripe size.  Balance would
> > never be able to free the rounded-up space.  That space would just be
> > gone until the file was overwritten, deleted, or defragged.
> 
> If balance is capable to move the extent, why not place one near the
> other during a balance ? The goal is not to limit the the writing of
> the end of a extent, but avoid writing the end of an extent without
> further data (e.g. the gap to the stripe has to be filled in the
> same transaction)

That's plan f (leave gaps in RAID stripes empty).  Balance will repack
short extents into RAID stripes nicely.

Plan d can't do that because plan d overallocates the extent so that
the extent fills the stripe (only some of the extent is used for data).
Small but important difference.

> BR
> G.Baroncelli
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli 
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


signature.asc
Description: PGP signature


Re: Status of RAID5/6

2018-04-03 Thread Zygo Blaxell
On Tue, Apr 03, 2018 at 07:03:06PM +0200, Goffredo Baroncelli wrote:
> On 04/03/2018 02:31 AM, Zygo Blaxell wrote:
> > On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote:
> >> On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote:
> >>> On 2018-04-02 11:18, Goffredo Baroncelli wrote:
> >>>> I thought that a possible solution is to create BG with different
> >>> number of data disks. E.g. supposing to have a raid 6 system with 6
> >>> disks, where 2 are parity disk; we should allocate 3 BG
> >>>> BG #1: 1 data disk, 2 parity disks
> >>>> BG #2: 2 data disks, 2 parity disks,
> >>>> BG #3: 4 data disks, 2 parity disks
> >>>>
> >>>> For simplicity, the disk-stripe length is assumed = 4K.
> >>>>
> >>>> So If you have a write with a length of 4 KB, this should be placed
> >>> in BG#1; if you have a write with a length of 4*3KB, the first 8KB,
> >>> should be placed in in BG#2, then in BG#1.
> >>>> This would avoid space wasting, even if the fragmentation will
> >>> increase (but shall the fragmentation matters with the modern solid
> >>> state disks ?).
> >> I don't really see why this would increase fragmentation or waste space.
> 
> > Oh, wait, yes I do.  If there's a write of 6 blocks, we would have
> > to split an extent between BG #3 (the first 4 blocks) and BG #2 (the
> > remaining 2 blocks).  It also flips the usual order of "determine size
> > of extent, then allocate space for it" which might require major surgery
> > on the btrfs allocator to implement.
> 
> I have to point out that in any case the extent is physically
> interrupted at the disk-stripe size. Assuming disk-stripe=64KB, if
> you want to write 128KB, the first half is written in the first disk,
> the other in the 2nd disk.  If you want to write 96kb, the first 64
> are written in the first disk, the last part in the 2nd, only on a
> different BG.

The "only on a different BG" part implies something expensive, either
a seek or a new erase page depending on the hardware.  Without that,
nearby logical blocks are nearby physical blocks as well.

> So yes there is a fragmentation from a logical point of view; from a
> physical point of view the data is spread on the disks in any case.

What matters is the extent-tree point of view.  There is (currently)
no fragmentation there, even for RAID5/6.  The extent tree is unaware
of RAID5/6 (to its peril).

ZFS makes its thing-like-the-extent-tree aware of RAID5/6, and it can
put a stripe of any size anywhere.  If we're going to do that in btrfs,
you might as well just do what ZFS does.

OTOH, variable-size block groups give us read-compatibility with old
kernel versions (and write-compatibility for that matter--a kernel that
didn't know about the BG separation would just work but have write hole).

If an application does a loop writing 68K then fsync(), the multiple-BG
solution adds two seeks to read every 68K.  That's expensive if sequential
read bandwidth is more scarce than free space.

> In any case, you are right, we should gather some data, because the
> performance impact are no so clear.
> 
> I am not worried abut having different BG; we have problem with these
> because we never developed tool to handle this issue properly (i.e. a
> daemon which starts a balance when needed). But I hope that this will
> be solved in future.

Balance daemons are easy to the point of being trivial to write in Python.

The balancing itself is quite expensive and invasive:  can't usefully
ionice it, can only abort it on block group boundaries, can't delete
snapshots while it's running.

If balance could be given a vrange that was the size of one extent...then
we could talk about daemons.

> In any case, the all solutions proposed have their trade off:
> 
> - a) as is: write hole bug
> - b) variable stripe size (like ZFS): big impact on how btrfs handle
> the extent. limited waste of space
> - c) logging data before writing: we wrote the data two times in a
> short time window. Moreover the log area is written several order of
> magnitude more than the other area; there was some patch around
> - d) rounding the writing to the stripe size: waste of space; simple
> to implement;
> - e) different BG with different stripe size: limited waste of space;
> logical fragmentation.

Also:

  - f) avoiding writes to partially filled stripes: free space
  fragmentation; simple to implement (ssd_spread does it accidentally)

The difference between d) and f) is that d) allocates the space to the
extent while f) leaves the space unallocated, but skips any free space
fragments smaller than the stripe size when allocati

Re: Status of RAID5/6

2018-04-02 Thread Zygo Blaxell
On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote:
> On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote:
> > On 2018-04-02 11:18, Goffredo Baroncelli wrote:
> > > I thought that a possible solution is to create BG with different
> > number of data disks. E.g. supposing to have a raid 6 system with 6
> > disks, where 2 are parity disk; we should allocate 3 BG
> > > 
> > > BG #1: 1 data disk, 2 parity disks
> > > BG #2: 2 data disks, 2 parity disks,
> > > BG #3: 4 data disks, 2 parity disks
> > > 
> > > For simplicity, the disk-stripe length is assumed = 4K.
> > > 
> > > So If you have a write with a length of 4 KB, this should be placed
> > in BG#1; if you have a write with a length of 4*3KB, the first 8KB,
> > should be placed in in BG#2, then in BG#1.
> > > 
> > > This would avoid space wasting, even if the fragmentation will
> > increase (but shall the fragmentation matters with the modern solid
> > state disks ?).
> 
> I don't really see why this would increase fragmentation or waste space.

Oh, wait, yes I do.  If there's a write of 6 blocks, we would have
to split an extent between BG #3 (the first 4 blocks) and BG #2 (the
remaining 2 blocks).  It also flips the usual order of "determine size
of extent, then allocate space for it" which might require major surgery
on the btrfs allocator to implement.

If we round that write up to 8 blocks (so we can put both pieces in
BG #3), it degenerates into the "pretend partially filled RAID stripes
are completely full" case, something like what ssd_spread already does.
That trades less file fragmentation for more free space fragmentation.

> The extent size is determined before allocation anyway, all that changes
> in this proposal is where those small extents ultimately land on the disk.
> 
> If anything, it might _reduce_ fragmentation since everything in BG #1
> and BG #2 will be of uniform size.
> 
> It does solve write hole (one transaction per RAID stripe).
> 
> > Also, you're still going to be wasting space, it's just that less space will
> > be wasted, and it will be wasted at the chunk level instead of the block
> > level, which opens up a whole new set of issues to deal with, most
> > significantly that it becomes functionally impossible without brute-force
> > search techniques to determine when you will hit the common-case of -ENOSPC
> > due to being unable to allocate a new chunk.
> 
> Hopefully the allocator only keeps one of each size of small block groups
> around at a time.  The allocator can take significant short cuts because
> the size of every extent in the small block groups is known (they are
> all the same size by definition).
> 
> When a small block group fills up, the next one should occupy the
> most-empty subset of disks--which is the opposite of the usual RAID5/6
> allocation policy.  This will probably lead to "interesting" imbalances
> since there are now two allocators on the filesystem with different goals
> (though it is no worse than -draid5 -mraid1, and I had no problems with
> free space when I was running that).
> 
> There will be an increase in the amount of allocated but not usable space,
> though, because now the amount of free space depends on how much data
> is batched up before fsync() or sync().  Probably best to just not count
> any space in the small block groups as 'free' in statvfs terms at all.
> 
> There are a lot of variables implied there.  Without running some
> simulations I have no idea if this is a good idea or not.
> 
> > > Time to time, a re-balance should be performed to empty the BG #1,
> > and #2. Otherwise a new BG should be allocated.
> 
> That shouldn't be _necessary_ (the filesystem should just allocate
> whatever BGs it needs), though it will improve storage efficiency if it
> is done.
> 
> > > The cost should be comparable to the logging/journaling (each
> > data shorter than a full-stripe, has to be written two times); the
> > implementation should be quite easy, because already NOW btrfs support
> > BG with different set of disks.
> 




signature.asc
Description: PGP signature


Re: Status of RAID5/6

2018-04-02 Thread Zygo Blaxell
On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote:
> On 2018-04-02 11:18, Goffredo Baroncelli wrote:
> > On 04/02/2018 07:45 AM, Zygo Blaxell wrote:
> > [...]
> > > It is possible to combine writes from a single transaction into full
> > > RMW stripes, but this *does* have an impact on fragmentation in btrfs.
> > > Any partially-filled stripe is effectively read-only and the space within
> > > it is inaccessible until all data within the stripe is overwritten,
> > > deleted, or relocated by balance.
> > > 
> > > btrfs could do a mini-balance on one RAID stripe instead of a RMW stripe
> > > update, but that has a significant write magnification effect (and before
> > > kernel 4.14, non-trivial CPU load as well).
> > > 
> > > btrfs could also just allocate the full stripe to an extent, but emit
> > > only extent ref items for the blocks that are in use.  No fragmentation
> > > but lots of extra disk space used.  Also doesn't quite work the same
> > > way for metadata pages.
> > > 
> > > If btrfs adopted the ZFS approach, the extent allocator and all higher
> > > layers of the filesystem would have to know about--and skip over--the
> > > parity blocks embedded inside extents.  Making this change would mean
> > > that some btrfs RAID profiles start interacting with stuff like balance
> > > and compression which they currently do not.  It would create a new
> > > block group type and require an incompatible on-disk format change for
> > > both reads and writes.
> > 
> > I thought that a possible solution is to create BG with different
> number of data disks. E.g. supposing to have a raid 6 system with 6
> disks, where 2 are parity disk; we should allocate 3 BG
> > 
> > BG #1: 1 data disk, 2 parity disks
> > BG #2: 2 data disks, 2 parity disks,
> > BG #3: 4 data disks, 2 parity disks
> > 
> > For simplicity, the disk-stripe length is assumed = 4K.
> > 
> > So If you have a write with a length of 4 KB, this should be placed
> in BG#1; if you have a write with a length of 4*3KB, the first 8KB,
> should be placed in in BG#2, then in BG#1.
> > 
> > This would avoid space wasting, even if the fragmentation will
> increase (but shall the fragmentation matters with the modern solid
> state disks ?).

I don't really see why this would increase fragmentation or waste space.
The extent size is determined before allocation anyway, all that changes
in this proposal is where those small extents ultimately land on the disk.

If anything, it might _reduce_ fragmentation since everything in BG #1
and BG #2 will be of uniform size.

It does solve write hole (one transaction per RAID stripe).

> Also, you're still going to be wasting space, it's just that less space will
> be wasted, and it will be wasted at the chunk level instead of the block
> level, which opens up a whole new set of issues to deal with, most
> significantly that it becomes functionally impossible without brute-force
> search techniques to determine when you will hit the common-case of -ENOSPC
> due to being unable to allocate a new chunk.

Hopefully the allocator only keeps one of each size of small block groups
around at a time.  The allocator can take significant short cuts because
the size of every extent in the small block groups is known (they are
all the same size by definition).

When a small block group fills up, the next one should occupy the
most-empty subset of disks--which is the opposite of the usual RAID5/6
allocation policy.  This will probably lead to "interesting" imbalances
since there are now two allocators on the filesystem with different goals
(though it is no worse than -draid5 -mraid1, and I had no problems with
free space when I was running that).

There will be an increase in the amount of allocated but not usable space,
though, because now the amount of free space depends on how much data
is batched up before fsync() or sync().  Probably best to just not count
any space in the small block groups as 'free' in statvfs terms at all.

There are a lot of variables implied there.  Without running some
simulations I have no idea if this is a good idea or not.

> > Time to time, a re-balance should be performed to empty the BG #1,
> and #2. Otherwise a new BG should be allocated.

That shouldn't be _necessary_ (the filesystem should just allocate
whatever BGs it needs), though it will improve storage efficiency if it
is done.

> > The cost should be comparable to the logging/journaling (each
> data shorter than a full-stripe, has to be written two times); the
> implementation should be quite easy, because already NOW btrfs support
> BG with different set of disks.



signature.asc
Description: PGP signature


Re: Status of RAID5/6

2018-04-01 Thread Zygo Blaxell
On Sun, Apr 01, 2018 at 03:11:04PM -0600, Chris Murphy wrote:
> (I hate it when my palm rubs the trackpad and hits send prematurely...)
> 
> 
> On Sun, Apr 1, 2018 at 2:51 PM, Chris Murphy  wrote:
> 
> >> Users can run scrub immediately after _every_ unclean shutdown to
> >> reduce the risk of inconsistent parity and unrecoverable data should
> >> a disk fail later, but this can only prevent future write hole events,
> >> not recover data lost during past events.
> >
> > Problem is, Btrfs assumes a leaf is correct if it passes checksum. And
> > such a leaf containing EXTENT_CSUM means that EXTENT_CSUM
> 
> means that EXTENT_CSUM is assumed to be correct. But in fact it could
> be stale. It's just as possible the metadata and superblock update is
> what's missing due to the interruption, while both data and parity
> strip writes succeeded. The window for either the data or parity write
> to fail is way shorter of a time interval, than that of the numerous
> metadata writes, followed by superblock update. 

csums cannot be wrong due to write interruption.  The data and metadata
blocks are written first, then barrier, then superblock updates pointing
to the data and csums previously written in the same transaction.
Unflushed data is not included in the metadata.  If there is a write
interruption then the superblock update doesn't occur and btrfs reverts
to the previous unmodified data+csum trees.

This works on non-raid5/6 because all the writes that make up a
single transaction are ordered and independent, and no data from older
transactions is modified during any tree update.

On raid5/6 every RMW operation modifies data from old transactions
by creating data/parity inconsistency.  If there was no data in the
stripe from an old transaction, the operation would be just a write,
no read and modify.  In the write hole case, the csum *is* correct,
it is the data that is wrong.

> In such a case, the
> old metadata is what's pointed to, including EXTENT_CSUM. Therefore
> your scrub would always show csum error, even if both data and parity
> are correct. You'd have to init-csum in this case, I suppose.

No, the csums are correct.  The data does not match the csum because the
data is corrupted.  Assuming barriers work on your disk, and you're not
having some kind of direct IO data consistency bug, and you can read the
csum tree at all, then the csums are correct, even with write hole.

When write holes and other write interruption patterns affect the csum
tree itself, this results in parent transid verify failures, csum tree
page csum failures, or both.  This forces the filesystem read-only so
it's easy to spot when it happens.

Note that the data blocks with wrong csum from raid5/6 reconstruction
after a write hole event always belong to _old_ transactions damaged
by the write hole.  If the writes are interrupted, the new data blocks
in a RMW stripe will not be committed and will have no csums to verify,
so they can't have _wrong_ csums.  The old data blocks do not have their
csum changed by the write hole (the csum is stored on a separate tree
in a different block group) so the csums are intact.  When a write hole
event corrupts the data reconstruction on a degraded array, the csum
doesn't match because the csum is correct and the data is not.

> Pretty much it's RMW with a (partial) stripe overwrite upending COW,
> and therefore upending the atomicity, and thus consistency of Btrfs in
> the raid56 case where any portion of the transaction is interrupted.

Not any portion, only the RMW stripe update can produce data loss due
to write interruption (well, that, and fsync() log-tree replay bugs).

If any other part of the transaction is interrupted then btrfs recovers
just fine with its COW tree update algorithm and write barriers.

> And this is amplified if metadata is also raid56.

Data and metadata are mangled the same way.  The difference is the impact.

btrfs tolerates exactly 0 bits of damaged metadata after RAID recovery,
and enforces this intolerance with metadata transids and csums, so write
hole on metadata _always_ breaks the filesystem.

> ZFS avoids the problem at the expense of probably a ton of
> fragmentation, by taking e.g. 4KiB RMW and writing a full length
> stripe of 8KiB fully COW, rather than doing stripe modification with
> an overwrite. And that's because it has dynamic stripe lengths. 

I think that's technically correct but could be clearer.

ZFS never does RMW.  It doesn't need to.  Parity blocks are allocated
at the extent level and RAID stripes are built *inside* the extents (or
"groups of contiguous blocks written in a single transaction" which
seems to be the closest ZFS equivalent of the btrfs extent concept).

Since every ZFS RAID stripe is bespoke sized to exactly fit a single
write operation, no two ZFS transactions can ever share a RAID stripe.
No transactions sharing a stripe means no write hole.

There is no impact on fragmentation on ZFS--space is 

Re: Status of RAID5/6

2018-03-31 Thread Zygo Blaxell
On Sat, Mar 31, 2018 at 04:34:58PM -0600, Chris Murphy wrote:
> On Sat, Mar 31, 2018 at 12:57 AM, Goffredo Baroncelli
> <kreij...@inwind.it> wrote:
> > On 03/31/2018 07:03 AM, Zygo Blaxell wrote:
> >>>> btrfs has no optimization like mdadm write-intent bitmaps; recovery
> >>>> is always a full-device operation.  In theory btrfs could track
> >>>> modifications at the chunk level but this isn't even specified in the
> >>>> on-disk format, much less implemented.
> >>> It could go even further; it would be sufficient to track which
> >>> *partial* stripes update will be performed before a commit, in one
> >>> of the btrfs logs. Then in case of a mount of an unclean filesystem,
> >>> a scrub on these stripes would be sufficient.
> >
> >> A scrub cannot fix a raid56 write hole--the data is already lost.
> >> The damaged stripe updates must be replayed from the log.
> >
> > Your statement is correct, but you doesn't consider the COW nature of btrfs.
> >
> > The key is that if a data write is interrupted, all the transaction is 
> > interrupted and aborted. And due to the COW nature of btrfs, the "old 
> > state" is restored at the next reboot.
> >
> > What is needed in any case is rebuild of parity to avoid the "write-hole" 
> > bug.
> 
> Write hole happens on disk in Btrfs, but the ensuing corruption on
> rebuild is detected. Corrupt data never propagates. 

Data written with nodatasum or nodatacow is corrupted without detection
(same as running ext3/ext4/xfs on top of mdadm raid5 without a parity
journal device).

Metadata always has csums, and files have checksums if they are created
with default attributes and mount options.  Those cases are covered,
any corrupted data will give EIO on reads (except once per 4 billion
blocks, where the corrupted CRC matches at random).

> The problem is that Btrfs gives up when it's detected.

Before recent kernels (4.14 or 4.15) btrfs would not attempt all possible
combinations of recovery blocks for raid6, and earlier kernels than
those would not recover correctly for raid5 either.  I think this has
all been fixed in recent kernels but I haven't tested these myself so
don't quote me on that.

Other than that, btrfs doesn't give up in the write hole case.
It rebuilds the data according to the raid5/6 parity algorithm, but
the algorithm doesn't produce correct data for interrupted RMW writes
when there is no stripe update journal.  There is nothing else to try
at that point.  By the time the error is detected the opportunity to
recover the data has long passed.

The data that comes out of the recovery algorithm is a mixture of old
and new data from the filesystem.  The "new" data is something that
was written just before a failure, but the "old" data could be data
of any age, even a block of free space, that previously existed on the
filesystem.  If you bypass the EIO from the failing csums (e.g. by using
btrfs rescue) it will appear as though someone took the XOR of pairs of
random blocks from the disk and wrote it over one of the data blocks
at random.  When this happens to btrfs metadata, it is effectively a
fuzz tester for tools like 'btrfs check' which will often splat after
a write hole failure happens.

> If it assumes just a bit flip - not always a correct assumption but
> might be reasonable most of the time, it could iterate very quickly.

That is not how write hole works (or csum recovery for that matter).
Write hole producing a single bit flip would occur extremely rarely
outside of contrived test cases.

Recall that in a write hole, one or more 4K blocks are updated on some
of the disks in a stripe, but other blocks retain their original values
from prior to the update.  This is OK as long as all disks are online,
since the parity can be ignored or recomputed from the data blocks.  It is
also OK if the writes on all disks are completed without interruption,
since the data and parity eventually become consistent when all writes
complete as intended.  It is also OK if the entire stripe is written at
once, since then there is only one transaction referring to the stripe,
and if that transaction is not committed then the content of the stripe
is irrelevant.

The write hole error event is when all of the following occur:

- a stripe containing committed data from one or more btrfs
transactions is modified by raid5/6 RMW update in a new
transaction.  This is the usual case on a btrfs filesystem
with the default, 'nossd' or 'ssd' mount options.

- the write is not completed (due to crash, power failure, disk
failure, bad sector, SCSI timeout, bad cable, firmware bug, etc),
so the parity block is out of sync with modified data blocks
(before or af

Re: Status of RAID5/6

2018-03-31 Thread Zygo Blaxell
On Sat, Mar 31, 2018 at 11:36:50AM +0300, Andrei Borzenkov wrote:
> 31.03.2018 11:16, Goffredo Baroncelli пишет:
> > On 03/31/2018 09:43 AM, Zygo Blaxell wrote:
> >>> The key is that if a data write is interrupted, all the transaction
> >>> is interrupted and aborted. And due to the COW nature of btrfs, the
> >>> "old state" is restored at the next reboot.
> > 
> >> This is not presently true with raid56 and btrfs.  RAID56 on btrfs uses
> >> RMW operations which are not COW and don't provide any data integrity
> >> guarantee.  Old data (i.e. data from very old transactions that are not
> >> part of the currently written transaction) can be destroyed by this.
> > 
> > Could you elaborate a bit ?
> > 
> > Generally speaking, updating a part of a stripe require a RMW cycle, because
> > - you need to read all data stripe (with parity in case of a problem)
> > - then you should write
> > - the new data
> > - the new parity (calculated on the basis of the first read, and the 
> > new data)
> > 
> > However the "old" data should be untouched; or you are saying that the 
> > "old" data is rewritten with the same data ? 
> > 
> 
> If old data block becomes unavailable, it can no more be reconstructed
> because old content of "new data" and "new priority" blocks are lost.
> Fortunately if checksum is in use it does not cause silent data
> corruption but it effectively means data loss.
> 
> Writing of data belonging to unrelated transaction affects previous
> transactions precisely due to RMW cycle. This fundamentally violates
> btrfs claim of always having either old or new consistent state.

Correct.

To fix this, any RMW stripe update on raid56 has to be written to a
log first.  All RMW updates must be logged because a disk failure could
happen at any time.

Full stripe writes don't need to be logged because all the data in the
stripe belongs to the same transaction, so if a disk fails the entire
stripe is either committed or it is not.

One way to avoid the logging is to change the btrfs allocation parameters
so that the filesystem doesn't allocate data in RAID stripes that are
already occupied by data from older transactions.  This is similar to
what 'ssd_spread' does, although the ssd_spread option wasn't designed
for this and won't be effective on large arrays.  This avoids modifying
stripes that contain old committed data, but it also means the free space
on the filesystem will become heavily fragmented over time.  Users will
have to run balance *much* more often to defragment the free space.



signature.asc
Description: PGP signature


Re: Status of RAID5/6

2018-03-31 Thread Zygo Blaxell
On Sat, Mar 31, 2018 at 08:57:18AM +0200, Goffredo Baroncelli wrote:
> On 03/31/2018 07:03 AM, Zygo Blaxell wrote:
> >>> btrfs has no optimization like mdadm write-intent bitmaps; recovery
> >>> is always a full-device operation.  In theory btrfs could track
> >>> modifications at the chunk level but this isn't even specified in the
> >>> on-disk format, much less implemented.
> >> It could go even further; it would be sufficient to track which
> >> *partial* stripes update will be performed before a commit, in one
> >> of the btrfs logs. Then in case of a mount of an unclean filesystem,
> >> a scrub on these stripes would be sufficient.
> 
> > A scrub cannot fix a raid56 write hole--the data is already lost.
> > The damaged stripe updates must be replayed from the log.
> 
> Your statement is correct, but you doesn't consider the COW nature of btrfs.
> 
> The key is that if a data write is interrupted, all the transaction
> is interrupted and aborted. And due to the COW nature of btrfs, the
> "old state" is restored at the next reboot.

This is not presently true with raid56 and btrfs.  RAID56 on btrfs uses
RMW operations which are not COW and don't provide any data integrity
guarantee.  Old data (i.e. data from very old transactions that are not
part of the currently written transaction) can be destroyed by this.

> What is needed in any case is rebuild of parity to avoid the
> "write-hole" bug. And this is needed only for a partial stripe
> write. For a full stripe write, due to the fact that the commit is
> not flushed, it is not needed the scrub at all.
> 
> Of course for the NODATACOW file this is not entirely true; but I
> don't see the gain to switch from the cost of COW to the cost of a log.
> 
> The above sentences are correct (IMHO) if we don't consider a power
> failure+device missing case. However in this case even logging the
> "new data" would be not sufficient.
> 
> BR
> G.Baroncelli
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli 
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


signature.asc
Description: PGP signature


Re: Status of RAID5/6

2018-03-30 Thread Zygo Blaxell
On Fri, Mar 30, 2018 at 06:14:52PM +0200, Goffredo Baroncelli wrote:
> On 03/29/2018 11:50 PM, Zygo Blaxell wrote:
> > On Wed, Mar 21, 2018 at 09:02:36PM +0100, Christoph Anton Mitterer wrote:
> >> Hey.
> >>
> >> Some things would IMO be nice to get done/clarified (i.e. documented in
> >> the Wiki and manpages) from users'/admin's  POV:
> [...]
> > 
> > btrfs has no optimization like mdadm write-intent bitmaps; recovery
> > is always a full-device operation.  In theory btrfs could track
> > modifications at the chunk level but this isn't even specified in the
> > on-disk format, much less implemented.
> 
> It could go even further; it would be sufficient to track which
> *partial* stripes update will be performed before a commit, in one
> of the btrfs logs. Then in case of a mount of an unclean filesystem,
> a scrub on these stripes would be sufficient.

A scrub cannot fix a raid56 write hole--the data is already lost.
The damaged stripe updates must be replayed from the log.

A scrub could fix raid1/raid10 partial updates but only if the filesystem
can reliably track which blocks failed to be updated by the disconnected
disks.

It would be nice if scrub could be filtered the same way balance is, e.g.
only certain block ranges, or only metadata blocks; however, this is not
presently implemented.

> BR
> G.Baroncelli
> 
> [...]
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli 
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: PGP signature


Re: Status of RAID5/6

2018-03-30 Thread Zygo Blaxell
On Fri, Mar 30, 2018 at 09:21:00AM +0200, Menion wrote:
>  Thanks for the detailed explanation. I think that a summary of this
> should go in the btrfs raid56 wiki status page, because now it is
> completely inconsistent and if a user comes there, ihe may get the
> impression that the raid56 is just broken
> Still I have the 1 bilion dollar question: from your word I understand
> that even in RAID56 the metadata are spread on the devices in a coplex
> way, but shall I assume that the array can survice to the sudden death
> of one (two for raid6) HDD in the array?

I wouldn't assume that.  There is still the write hole, and while there
is a small probability of having a write hole failure, it's a probability
that applies on *every* write in degraded mode, and since disks can fail
at any time, the array can enter degraded mode at any time.

It's similar to lottery tickets--buy one ticket, you probably won't win,
but if you buy millions of tickets, you'll claim the prize eventually.
The "prize" in this case is a severely damaged, possibly unrecoverable
filesystem.

If the data is raid5 and the metadata is raid1, the filesystem can
survive a single disk failure easily; however, some of the data may be
lost if writes to the remaining disks are interrupted by a system crash
or power failure and the write hole issue occurs.  Note that the damage
is not necessarily limited to recently written data--it's any random
data that is merely located adjacent to written data on the filesystem.

I wouldn't use raid6 until the write hole issue is resolved.  There is
no configuration where two disks can fail and metadata can still be
updated reliably.

Some users use the 'ssd_spread' mount option to reduce the probability
of write hole failure, which happens to be helpful by accident on some
array configurations, but it has a fairly high cost when the array is
not degraded due to all the extra balancing required.



> Bye


signature.asc
Description: PGP signature


Re: Status of RAID5/6

2018-03-29 Thread Zygo Blaxell
On Wed, Mar 21, 2018 at 09:02:36PM +0100, Christoph Anton Mitterer wrote:
> Hey.
> 
> Some things would IMO be nice to get done/clarified (i.e. documented in
> the Wiki and manpages) from users'/admin's  POV:
> 
> Some basic questions:

I can answer some easy ones:

>   - compression+raid?

There is no interaction between compression and raid.  They happen on
different data trees at different levels of the stack.  So if the raid
works, compression does too.

>   - rebuild / replace of devices?

"replace" needs raid-level-specific support.  If the raid level doesn't
support replace, then users have to do device add followed by device
delete, which is considerably (orders of magnitude) slower.

>   - changing raid lvls?

btrfs uses a brute-force RAID conversion algorithm which always works, but
takes zero short cuts.  e.g. there is no speed optimization implemented
for cases like "convert 2-disk raid1 to 1-disk single" which can be
very fast in theory.  The worst-case running time is the only running
time available in btrfs.

Also, users have to understand how the different raid allocators work
to understand their behavior in specific situations.  Without this
understanding, the set of restrictions that pop up in practice can seem
capricious and arbitrary.  e.g. after adding 1 disk to a nearly-full
raid1, full balance is required to make the new space available, but
adding 2 disks makes all the free space available immediately.

Generally it always works if you repeatedly run full-balances in a loop
until you stop running out of space, but again, this is the worst case.

>   - anything to consider with raid when doing snapshots, send/receive
> or defrag?

Snapshot deletes cannot run at the same time as RAID convert/device
delete/device shrink/resize.  If one is started while the other is
running, it will be blocked until the other finishes.  Internally these
operations block each other on a mutex.

I don't know if snapshot deletes interact with device replace (the case
has never come up for me).  I wouldn't expect it to as device replace
is more similar to scrub than balance, and scrub has no such interaction.

Also note you can only run one balance, device shrink, or device delete
at a time.  If you start one of these three operations while another is
already running, the new request is rejected immediately.

As far as I know there are no other restrictions.

>   => and for each of these: for which raid levels?

Most of those features don't interact with anything specific to a raid
layer, so they work on all raid levels.

Device replace is the exception: all RAID levels in use on the filesystem
must support it, or the user must use device add and device delete instead.

[Aside:  I don't know if any RAID levels that do not support device
replace still exist, which makes my answer longer than it otherwise
would be]

>   Perhaps also confirmation for previous issues:
>   - I vaguely remember there were issues with either device delete or
> replace and that one of them was possibly super-slow?

Device replace is faster than device delete.  Replace does not modify
any metadata, while delete rewrites all the metadata referring to the
removed device.

Delete can be orders of magnitude slower than expected because of the
metadata modifications required.

>   - I also remember there were cases in which a fs could end up in
> permanent read-only state?

Any unrecovered metadata error 1 bit or larger will do that.  RAID level
is relevant only in terms of how well it can recover corrupted or
unreadable metadata blocks.

> - Clarifying questions on what is expected to work and how things are
>   expected to behave, e.g.:
>   - Can one plug a device (without deleting/removing it first) just
> under operation and will btrfs survive it?

On raid1 and raid10, yes.  On raid5/6 you will be at risk of write hole
problems if the filesystem is modified while the device is unplugged.

If the device is later reconnected, you should immediately scrub to
bring the metadata on the devices back in sync.  Data written to the
filesystem while the device was offline will be corrected if the csum
is different on the removed device.  If there is no csum data will be
silently corrupted.  If the csum is correct, but the data is not (this
occurs with 2^-32 probability on random data where the CRC happens to
be identical) then the data will be silently corrupted.

A full replace of the removed device would be better than a scrub,
as that will get a known good copy of the data.

If the device is offline for a long time, it should be wiped before being
reintroduced to the rest of the array to avoid data integrity issues.

It may be necessary to specify a different device name when mounting
a filesystem that has had a disk removed and later reinserted until
the scrub or replace action above is completed.

btrfs has no optimization like mdadm write-intent bitmaps; recovery
is always a full-device operation.  In theory btrfs 

Re: [RFC PATCH v3 0/7] btrfs-progs: Allow normal user to call "subvolume list/show"

2018-03-28 Thread Zygo Blaxell
On Mon, Mar 19, 2018 at 04:30:17PM +0900, Misono, Tomohiro wrote:
> This is a part of RFC I sent last December[1] whose aim is to improve normal 
> users' usability.
> The remaining works of RFC are: 
>   - Allow "sub delete" for empty subvolume

I don't mean to scope creep on you, but I have a couple of wishes related
to this topic:

  - allow "rmdir" to remove an empty subvolume, i.e. when a subvolume is
detected in rmdir, try switching to subvol delete before returning
an error.  This lets admin tools that are not btrfs-aware do 'rm
-fr' on a user directory when it contains a subvolume.  Legacy admin
tools (or legacy tools in general) can't remove a subvol, and there
is no solution for environments where we can't just fire users who
create them.

  - mount option to restrict "sub create" and "sub snapshot" to root only.
If we get "rmdir" working then this is significantly less important.

>   - Allow "qgroup show" to check quota limit
> 
> [1] https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg70991.html



signature.asc
Description: PGP signature


[PATCH v2] btrfs: remove spurious WARN_ON(ref->count < 0) in find_parent_nodes

2018-01-23 Thread Zygo Blaxell
Until v4.14, this warning was very infrequent:

WARNING: CPU: 3 PID: 18172 at fs/btrfs/backref.c:1391 
find_parent_nodes+0xc41/0x14e0
Modules linked in: [...]
CPU: 3 PID: 18172 Comm: bees Tainted: G  D WL  4.11.9-zb64+ #1
Hardware name: System manufacturer System Product Name/M5A78L-M/USB3, 
BIOS 210112/02/2014
Call Trace:
 dump_stack+0x85/0xc2
 __warn+0xd1/0xf0
 warn_slowpath_null+0x1d/0x20
 find_parent_nodes+0xc41/0x14e0
 __btrfs_find_all_roots+0xad/0x120
 ? extent_same_check_offsets+0x70/0x70
 iterate_extent_inodes+0x168/0x300
 iterate_inodes_from_logical+0x87/0xb0
 ? iterate_inodes_from_logical+0x87/0xb0
 ? extent_same_check_offsets+0x70/0x70
 btrfs_ioctl+0x8ac/0x2820
 ? lock_acquire+0xc2/0x200
 do_vfs_ioctl+0x91/0x700
 ? __fget+0x112/0x200
 SyS_ioctl+0x79/0x90
 entry_SYSCALL_64_fastpath+0x23/0xc6
 ? trace_hardirqs_off_caller+0x1f/0x140

Starting with v4.14 (specifically 86d5f9944252 ("btrfs: convert prelimary
reference tracking to use rbtrees")) the WARN_ON occurs three orders of
magnitude more frequently--almost once per second while running workloads
like bees.

Replace the WARN_ON() with a comment rationale for its removal.
The rationale is paraphrased from an explanation by Edmund Nadolski
<enadol...@suse.de> on the linux-btrfs mailing list.

Fixes: 8da6d5815c59 ("Btrfs: added btrfs_find_all_roots()")
Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org>
---
v2: 
Replace WARN_ON with rationale instead of merely deleting it.
Trim irrelevant detail from the backtrace.  Add Fixes reference.
Fix subject line (missing "< 0").

 fs/btrfs/backref.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 7d0dc100a09a..06597c5f9f4b 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1263,7 +1263,16 @@ static int find_parent_nodes(struct btrfs_trans_handle 
*trans,
while (node) {
ref = rb_entry(node, struct prelim_ref, rbnode);
node = rb_next(>rbnode);
-   WARN_ON(ref->count < 0);
+   /*
+* ref->count < 0 can happen here if there are delayed
+* refs with a node->action of BTRFS_DROP_DELAYED_REF.
+* prelim_ref_insert() relies on this when merging
+* identical refs to keep the overall count correct.
+* prelim_ref_insert() will merge only those refs
+* which compare identically.  Any refs having
+* e.g. different offsets would not be merged,
+* and would retain their original ref->count < 0.
+*/
if (roots && ref->count && ref->root_id && ref->parent == 0) {
if (sc && sc->root_objectid &&
ref->root_id != sc->root_objectid) {
-- 
2.11.0



signature.asc
Description: PGP signature


Re: [PATCH] btrfs: remove spurious WARN_ON(ref->count) in find_parent_nodes

2018-01-22 Thread Zygo Blaxell
On Mon, Jan 22, 2018 at 11:34:52AM +0800, Lu Fengqi wrote:
> On Sun, Jan 21, 2018 at 02:08:58PM -0500, Zygo Blaxell wrote:
> >This warning appears during execution of the LOGICAL_INO ioctl and
> >appears to be spurious:
> >
> > [ cut here ]
> > WARNING: CPU: 3 PID: 18172 at fs/btrfs/backref.c:1391 
> > find_parent_nodes+0xc41/0x14e0
> > Modules linked in: ib_iser rdma_cm iw_cm ib_cm ib_core configfs 
> > iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi overlay r8169 ufs qnx4 
> > hfsplus hfs minix ntfs vfat msdos fat jfs xfs cpuid rpcsec_gss_krb5 nfsv4 
> > nfsv3 nfs fscache algif_skcipher af_alg softdog nfsd auth_rpcgss nfs_acl 
> > lockd grace sunrpc bnep cpufreq_userspace cpufreq_powersave 
> > cpufreq_conservative nfnetlink_queue nfnetlink_log nfnetlink bluetooth 
> > rfkill snd_seq_dummy snd_hrtimer snd_seq_midi snd_seq_oss 
> > snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device binfmt_misc fuse nbd 
> > xt_REDIRECT nf_nat_redirect ipt_REJECT nf_reject_ipv4 xt_nat xt_conntrack 
> > xt_tcpudp nf_log_ipv4 nf_log_common xt_LOG ip6table_nat nf_conntrack_ipv6 
> > nf_defrag_ipv6 nf_nat_ipv6 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 
> > nf_nat_ipv4 nf_nat nf_conntrack
> >  ip6table_mangle iptable_mangle ip6table_filter ip6_tables 
> > iptable_filter ip_tables x_tables tcp_cubic dummy lp dm_crypt edac_mce_amd 
> > edac_core snd_hda_codec_hdmi ppdev kvm_amd kvm irqbypass crct10dif_pclmul 
> > crc32_pclmul ghash_clmulni_intel snd_hda_codec_via pcbc amdkfd 
> > snd_hda_codec_generic amd_iommu_v2 aesni_intel snd_hda_intel radeon 
> > snd_hda_codec aes_x86_64 snd_hda_core snd_hwdep crypto_simd glue_helper sg 
> > snd_pcm_oss cryptd input_leds joydev pcspkr serio_raw snd_mixer_oss 
> > rtc_cmos snd_pcm parport_pc parport shpchp wmi acpi_cpufreq evdev snd_timer 
> > asus_atk0110 k10temp fam15h_power snd soundcore sp5100_tco hid_generic ipv6 
> > af_packet crc_ccitt raid10 raid456 async_raid6_recov async_memcpy async_pq 
> > async_xor async_tx libcrc32c raid0 multipath linear dm_mod raid1 md_mod 
> > ohci_pci ide_pci_generic
> >  sr_mod cdrom pdc202xx_new ohci_hcd crc32c_intel atiixp ehci_pci 
> > psmouse ide_core i2c_piix4 ehci_hcd xhci_pci mii xhci_hcd [last unloaded: 
> > r8169]
> > CPU: 3 PID: 18172 Comm: bees Tainted: G  D WL  4.11.9-zb64+ #1
> > Hardware name: System manufacturer System Product Name/M5A78L-M/USB3, 
> > BIOS 210112/02/2014
> > Call Trace:
> >  dump_stack+0x85/0xc2
> >  __warn+0xd1/0xf0
> >  warn_slowpath_null+0x1d/0x20
> >  find_parent_nodes+0xc41/0x14e0
> >  __btrfs_find_all_roots+0xad/0x120
> >  ? extent_same_check_offsets+0x70/0x70
> >  iterate_extent_inodes+0x168/0x300
> >  iterate_inodes_from_logical+0x87/0xb0
> >  ? iterate_inodes_from_logical+0x87/0xb0
> >  ? extent_same_check_offsets+0x70/0x70
> >  btrfs_ioctl+0x8ac/0x2820
> >  ? lock_acquire+0xc2/0x200
> >  do_vfs_ioctl+0x91/0x700
> >  ? __fget+0x112/0x200
> >  SyS_ioctl+0x79/0x90
> >  entry_SYSCALL_64_fastpath+0x23/0xc6
> > RIP: 0033:0x7f727b20be07
> > RSP: 002b:7f7279f1e018 EFLAGS: 0246 ORIG_RAX: 0010
> > RAX: ffda RBX: 9c0f4d7f RCX: 7f727b20be07
> > RDX: 7f7279f1e118 RSI: c0389424 RDI: 0003
> > RBP: 0035 R08: 7f72581bf340 R09: 
> > R10: 0020 R11: 0246 R12: 0040
> > R13: 7f725818d230 R14: 7f7279f1b640 R15: 7f725820
> >  ? trace_hardirqs_off_caller+0x1f/0x140
> > ---[ end trace 5de243350f6762c6 ]---
> > [ cut here ]
> >
> >ref->count can be below zero under normal conditions (for delayed refs),
> >so there is no need to spam dmesg when it happens.
> >
> 
> Added Edmund.
> 
> Hi,
> 
> I've also encountered the same problem when running the test case
> xfstests/btrfs/004. However, I'm not sure whether the negative ref->count
> is reasonable.
> 
> IMO, these functions (such as add_delayed_refs, add_delayed_refs,
> add_delayed_refs, add_missing_keys and resolve_indirect_refs) have been
> executed at this point in time. Hence, these references not only include
> these refs in the memory (delayed) but also include those refs in the disk
> (inline/keyed). 

I don't have the complete picture, but while looking at other code, comments,
and git log messages surrounding ref->count in btrfs, I found:

  * ref->count starts off at -1 (for a

Re: [PATCH] btrfs: remove spurious WARN_ON(ref->count) in find_parent_nodes

2018-01-22 Thread Zygo Blaxell
On Mon, Jan 22, 2018 at 09:06:23PM +0800, Lu Fengqi wrote:
> On Mon, Jan 22, 2018 at 02:38:42PM +0200, Nikolay Borisov wrote:
> >
> >
> >On 22.01.2018 14:19, Lu Fengqi wrote:
> >> On 01/22/2018 04:46 PM, Nikolay Borisov wrote:
> >>>
> >>>
> >>> On 22.01.2018 05:34, Lu Fengqi wrote:
>  According to my bisect result, The frequency of the warning occurrence
>  increased to the detectable degree after this patch
> >>>
> >>> That sentence implies that even before Ed's patch it was possible to
> >>> trigger those warnings, is that true? Personally I've never seen such
> >>> warnings while executing btrfs/004. How do you configure the filesystem
> >>> for the test runs?
> >>>
> >> 
> >> Just only default mount option.
> >> 
> >> ➜  xfstests-dev git:(master) for i in $(seq 1 100); do echo $i; if !
> >> sudo ./check btrfs/004; then break; fi; done
> >> 1
> >> 
> >> FSTYP -- btrfs
> >> 
> >> PLATFORM  -- Linux/x86_64 sarch 4.15.0-rc9
> >> 
> >> MKFS_OPTIONS  -- /dev/vdd1
> >> 
> >> MOUNT_OPTIONS -- /dev/vdd1 /mnt/scratch
> >> 
> >> 
> >> 
> >> 
> >> btrfs/004 47s ... 49s
> >> 
> >> Ran: btrfs/004
> >> 
> >> Passed all 1 tests
> >> 
> >> 
> >> 
> >> 
> >> 2
> >> 
> >> FSTYP -- btrfs
> >> 
> >> PLATFORM  -- Linux/x86_64 sarch 4.15.0-rc9
> >> 
> >> MKFS_OPTIONS  -- /dev/vdd1
> >> 
> >> MOUNT_OPTIONS -- /dev/vdd1 /mnt/scratch
> >> 
> >> 
> >> 
> >> 
> >> btrfs/004 49s ... 52s
> >> 
> >> _check_dmesg: something found in dmesg (see
> >> /home/luke/workspace/xfstests-dev/results//btrfs/004.dmesg)
> >> 
> >> Ran: btrfs/004
> >> 
> >> Failures: btrfs/004
> >> 
> >> Failed 1 of 1 tests
> >> 
> >> The probability of this warning appearing is rather low, and I only
> >> encountered 52 warnings when I looped 1008 times btrfs/004 for 20 hours
> >> in 4.15-rc6 (IOW, the probability is nearly 5%). So you want to trigger
> >> warning also need more luck or patience.
> >
> >Thanks but is this before or after the mentioned commit below?
> >
> 
> After this commit. The bisect condition I use to locate this commit is
> to repeat btrfs/004 20 times without warning (This may not be accurate enough,
> can only be used as a reference). 

I have been seeing this warning since at least 2015 (v3.18?),
possibly earlier.  In the past it has never been correlated to any
event I've need to take action to correct (i.e. no data corruption,
no crashes, no hangs, no filesystem damage, and no obvious functional
failures in userspace).

In v4.14 nothing seems to have changed, except the warning now appears
three orders of magnitude more often.  This spams console terminals and
kernel logs with gigabytes of stacktrace and bumps this phenomenon up
to the top of my priority list.

It looks like the warning has been there with only minor editorial changes
since Jan Schmidt's 2011 commit "Btrfs: added btrfs_find_all_roots()"
in v3.3-rc1.

> Maybe Zygo has found a finer way to reproduce
> it, so he reproduce this warning more frequently than me.

It's not really a finer way, but bees hits this warning most often,
sometimes many times per second in bursts lasting minutes at a time.

btrfs balance also hits the warning occasionally (it was the most common
trigger of that warning in 2015 before I was running bees everywhere).

The net effect of the bees worker loop looks fairly similar to btrfs/004,
basically calling LOGICAL_INO many times per second on a busy filesystem.

bees focuses its activity on active parts of the filesystem, which
means it's more likely to do backref walks against extents that are also
being affected by user activity and therefore more likely to encounter
delayed refs.

Contrast with 'btrfs balance' which spreads its effect across the entire
filesystem and is much less likely to collide with user activity.

Every duplicate extent hit in bees uses LOGICAL_INO at least once to map
a stored duplicate block bytenr back to something that can be passed to
open() and FILE_EXTENT_SAME.  The warnings do arrive in bursts at the
same time as bees hitting clusters of duplicate extents.



> >
> >> 
>  86d5f9944252 ("btrfs: convert prelimary reference tracking to use
>  rbtrees")
>  is committed. I understand that this does not mean that this patch
>  caused
>  the problem, but maybe Edmund can give us some help, so I added him
>  to the
>  recipient.
> >>>
> >>>
> >> 
> >> 
> >
> >
> 
> -- 
> Thanks,
> Lu
> 
> 


signature.asc
Description: PGP signature


[PATCH] btrfs: remove spurious WARN_ON(ref->count) in find_parent_nodes

2018-01-21 Thread Zygo Blaxell
This warning appears during execution of the LOGICAL_INO ioctl and
appears to be spurious:

[ cut here ]
WARNING: CPU: 3 PID: 18172 at fs/btrfs/backref.c:1391 
find_parent_nodes+0xc41/0x14e0
Modules linked in: ib_iser rdma_cm iw_cm ib_cm ib_core configfs 
iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi overlay r8169 ufs qnx4 
hfsplus hfs minix ntfs vfat msdos fat jfs xfs cpuid rpcsec_gss_krb5 nfsv4 nfsv3 
nfs fscache algif_skcipher af_alg softdog nfsd auth_rpcgss nfs_acl lockd grace 
sunrpc bnep cpufreq_userspace cpufreq_powersave cpufreq_conservative 
nfnetlink_queue nfnetlink_log nfnetlink bluetooth rfkill snd_seq_dummy 
snd_hrtimer snd_seq_midi snd_seq_oss snd_seq_midi_event snd_rawmidi snd_seq 
snd_seq_device binfmt_misc fuse nbd xt_REDIRECT nf_nat_redirect ipt_REJECT 
nf_reject_ipv4 xt_nat xt_conntrack xt_tcpudp nf_log_ipv4 nf_log_common xt_LOG 
ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 iptable_nat 
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack
 ip6table_mangle iptable_mangle ip6table_filter ip6_tables 
iptable_filter ip_tables x_tables tcp_cubic dummy lp dm_crypt edac_mce_amd 
edac_core snd_hda_codec_hdmi ppdev kvm_amd kvm irqbypass crct10dif_pclmul 
crc32_pclmul ghash_clmulni_intel snd_hda_codec_via pcbc amdkfd 
snd_hda_codec_generic amd_iommu_v2 aesni_intel snd_hda_intel radeon 
snd_hda_codec aes_x86_64 snd_hda_core snd_hwdep crypto_simd glue_helper sg 
snd_pcm_oss cryptd input_leds joydev pcspkr serio_raw snd_mixer_oss rtc_cmos 
snd_pcm parport_pc parport shpchp wmi acpi_cpufreq evdev snd_timer asus_atk0110 
k10temp fam15h_power snd soundcore sp5100_tco hid_generic ipv6 af_packet 
crc_ccitt raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor 
async_tx libcrc32c raid0 multipath linear dm_mod raid1 md_mod ohci_pci 
ide_pci_generic
 sr_mod cdrom pdc202xx_new ohci_hcd crc32c_intel atiixp ehci_pci 
psmouse ide_core i2c_piix4 ehci_hcd xhci_pci mii xhci_hcd [last unloaded: r8169]
CPU: 3 PID: 18172 Comm: bees Tainted: G  D WL  4.11.9-zb64+ #1
Hardware name: System manufacturer System Product Name/M5A78L-M/USB3, 
BIOS 210112/02/2014
Call Trace:
 dump_stack+0x85/0xc2
 __warn+0xd1/0xf0
 warn_slowpath_null+0x1d/0x20
 find_parent_nodes+0xc41/0x14e0
 __btrfs_find_all_roots+0xad/0x120
 ? extent_same_check_offsets+0x70/0x70
 iterate_extent_inodes+0x168/0x300
 iterate_inodes_from_logical+0x87/0xb0
 ? iterate_inodes_from_logical+0x87/0xb0
 ? extent_same_check_offsets+0x70/0x70
 btrfs_ioctl+0x8ac/0x2820
 ? lock_acquire+0xc2/0x200
 do_vfs_ioctl+0x91/0x700
 ? __fget+0x112/0x200
 SyS_ioctl+0x79/0x90
 entry_SYSCALL_64_fastpath+0x23/0xc6
RIP: 0033:0x7f727b20be07
RSP: 002b:7f7279f1e018 EFLAGS: 0246 ORIG_RAX: 0010
RAX: ffda RBX: 9c0f4d7f RCX: 7f727b20be07
RDX: 7f7279f1e118 RSI: c0389424 RDI: 0003
RBP: 0035 R08: 7f72581bf340 R09: 
R10: 0020 R11: 0246 R12: 0040
R13: 7f725818d230 R14: 7f7279f1b640 R15: 7f725820
 ? trace_hardirqs_off_caller+0x1f/0x140
---[ end trace 5de243350f6762c6 ]---
[ cut here ]

ref->count can be below zero under normal conditions (for delayed refs),
so there is no need to spam dmesg when it happens.

On kernel v4.14 this warning occurs 100-1000 times more frequently than
on kernels v4.2..v4.12.  In the worst case, one test machine had 59020
warnings in 24 hours on v4.14.14 compared to 55 on v4.12.14.

Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org>
---
 fs/btrfs/backref.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 7d0dc100a09a..57e8d2562ed5 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1263,7 +1263,6 @@ static int find_parent_nodes(struct btrfs_trans_handle 
*trans,
while (node) {
ref = rb_entry(node, struct prelim_ref, rbnode);
node = rb_next(>rbnode);
-   WARN_ON(ref->count < 0);
if (roots && ref->count && ref->root_id && ref->parent == 0) {
if (sc && sc->root_objectid &&
ref->root_id != sc->root_objectid) {
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] btrfs: add a flag to iterate_inodes_from_logical to find all extent refs for uncompressed extents

2017-09-22 Thread Zygo Blaxell
The LOGICAL_INO ioctl provides a backward mapping from extent bytenr and
offset (encoded as a single logical address) to a list of extent refs.
LOGICAL_INO complements TREE_SEARCH, which provides the forward mapping
(extent ref -> extent bytenr and offset, or logical address).  These are
useful capabilities for programs that manipulate extents and extent
references from userspace (e.g. dedup and defrag utilities).

When the extents are uncompressed (and not encrypted and not other),
check_extent_in_eb performs filtering of the extent refs to remove any
extent refs which do not contain the same extent offset as the 'logical'
parameter's extent offset.  This prevents LOGICAL_INO from returning
references to more than a single block.

To find the set of extent references to an uncompressed extent from [a,
b), userspace has to run a loop like this pseudocode:

for (i = a; i < b; ++i)
extent_ref_set += LOGICAL_INO(i);

At each iteration of the loop (up to 32768 iterations for a 128M extent),
data we are interested in is collected in the kernel, then deleted by
the filter in check_extent_in_eb.

When the extents are compressed (or encrypted or other), the 'logical'
parameter must be an extent bytenr (the 'a' parameter in the loop).
No filtering by extent offset is done (or possible?) so the result is
the complete set of extent refs for the entire extent.  This removes
the need for the loop, since we get all the extent refs in one call.

Add an 'ignore_offset' argument to iterate_inodes_from_logical,
[...several levels of function call graph...], and check_extent_in_eb, so
that we can disable the extent offset filtering for uncompressed extents.
This flag can be set by an improved version of the LOGICAL_INO ioctl to
get either behavior as desired.

There is no functional change in this patch.  The new flag is always
false.

Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org>
---
 fs/btrfs/backref.c| 63 ++-
 fs/btrfs/backref.h|  8 +++---
 fs/btrfs/inode.c  |  2 +-
 fs/btrfs/ioctl.c  |  2 +-
 fs/btrfs/qgroup.c |  8 +++---
 fs/btrfs/scrub.c  |  6 ++---
 fs/btrfs/send.c   |  2 +-
 fs/btrfs/tests/qgroup-tests.c | 20 +++---
 8 files changed, 63 insertions(+), 48 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index b517ef1477ea..a2609786cd86 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -40,12 +40,14 @@ static int check_extent_in_eb(const struct btrfs_key *key,
  const struct extent_buffer *eb,
  const struct btrfs_file_extent_item *fi,
  u64 extent_item_pos,
- struct extent_inode_elem **eie)
+ struct extent_inode_elem **eie,
+ bool ignore_offset)
 {
u64 offset = 0;
struct extent_inode_elem *e;
 
-   if (!btrfs_file_extent_compression(eb, fi) &&
+   if (!ignore_offset &&
+   !btrfs_file_extent_compression(eb, fi) &&
!btrfs_file_extent_encryption(eb, fi) &&
!btrfs_file_extent_other_encoding(eb, fi)) {
u64 data_offset;
@@ -84,7 +86,8 @@ static void free_inode_elem_list(struct extent_inode_elem 
*eie)
 
 static int find_extent_in_eb(const struct extent_buffer *eb,
 u64 wanted_disk_byte, u64 extent_item_pos,
-struct extent_inode_elem **eie)
+struct extent_inode_elem **eie,
+bool ignore_offset)
 {
u64 disk_byte;
struct btrfs_key key;
@@ -113,7 +116,7 @@ static int find_extent_in_eb(const struct extent_buffer *eb,
if (disk_byte != wanted_disk_byte)
continue;
 
-   ret = check_extent_in_eb(, eb, fi, extent_item_pos, eie);
+   ret = check_extent_in_eb(, eb, fi, extent_item_pos, eie, 
ignore_offset);
if (ret < 0)
return ret;
}
@@ -419,7 +422,7 @@ static int add_indirect_ref(const struct btrfs_fs_info 
*fs_info,
 static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path,
   struct ulist *parents, struct prelim_ref *ref,
   int level, u64 time_seq, const u64 *extent_item_pos,
-  u64 total_refs)
+  u64 total_refs, bool ignore_offset)
 {
int ret = 0;
int slot;
@@ -472,7 +475,7 @@ static int add_all_parents(struct btrfs_root *root, struct 
btrfs_path *path,
if (extent_item_pos) {
ret = check_extent_in_eb(, eb, fi,
 

[PATCH 3/3] btrfs: increase output size for LOGICAL_INO_V2 ioctl

2017-09-22 Thread Zygo Blaxell
Build-server workloads have hundreds of references per file after dedup.
Multiply by a few snapshots and we quickly exhaust the limit of 2730
references per extent that can fit into a 64K buffer.

Raise the limit to 16M to be consistent with other btrfs ioctls
(e.g. TREE_SEARCH_V2, FILE_EXTENT_SAME).

To minimize surprising userspace behavior, apply this change only to
the LOGICAL_INO_V2 ioctl.

Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org>
---
 fs/btrfs/ioctl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index f4281ffd1833..1940678fc440 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4554,6 +4554,7 @@ static long btrfs_ioctl_logical_to_ino(struct 
btrfs_fs_info *fs_info,
 
if (version == 1) {
ignore_offset = false;
+   size = min_t(u32, loi->size, SZ_64K);
} else {
/* All reserved bits must be 0 for now */
if (memchr_inv(loi->reserved, 0, sizeof(loi->reserved))) {
@@ -4566,6 +4567,7 @@ static long btrfs_ioctl_logical_to_ino(struct 
btrfs_fs_info *fs_info,
goto out_loi;
}
ignore_offset = loi->flags & 
BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET;
+   size = min_t(u32, loi->size, SZ_16M);
}
 
path = btrfs_alloc_path();
@@ -4574,7 +4576,6 @@ static long btrfs_ioctl_logical_to_ino(struct 
btrfs_fs_info *fs_info,
goto out;
}
 
-   size = min_t(u32, loi->size, SZ_64K);
inodes = init_data_container(size);
if (IS_ERR(inodes)) {
ret = PTR_ERR(inodes);
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] btrfs: add a flags argument to LOGICAL_INO and call it LOGICAL_INO_V2

2017-09-22 Thread Zygo Blaxell
Now that check_extent_in_eb()'s extent offset filter can be turned off,
we need a way to do it from userspace.

Add a 'flags' field to the btrfs_logical_ino_args structure to disable
extent offset filtering, taking the place of one of the existing
reserved[] fields.

Previous versions of LOGICAL_INO neglected to check whether any of the
reserved fields have non-zero values.  Assigning meaning to those fields
now may change the behavior of existing programs that left these fields
uninitialized.  The lack of a zero check also means that new programs
have no way to know whether the kernel is honoring the flags field.

To avoid these problems, define a new ioctl LOGICAL_INO_V2.  We can
use the same argument layout as LOGICAL_INO, but shorten the reserved[]
array by one element and turn it into the 'flags' field.  The V2 ioctl
explicitly checks that reserved fields and unsupported flag bits are zero
so that userspace can negotiate future feature bits as they are defined.

Since the memory layouts of the two ioctls' arguments are compatible,
there is no need for a separate function for logical_to_ino_v2 (contrast
with tree_search_v2 vs tree_search where the layout and code are quite
different).  A version parameter and an 'if' statement will suffice.

Now that we have a flags field in logical_ino_args, add a flag
BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET to get the behavior we want,
and pass it down the stack to iterate_inodes_from_logical.

Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org>
---
 fs/btrfs/ioctl.c   | 26 +++---
 include/uapi/linux/btrfs.h |  8 +++-
 2 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index b7de32568082..f4281ffd1833 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4536,13 +4536,14 @@ static int build_ino_list(u64 inum, u64 offset, u64 
root, void *ctx)
 }
 
 static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info,
-   void __user *arg)
+   void __user *arg, int version)
 {
int ret = 0;
int size;
struct btrfs_ioctl_logical_ino_args *loi;
struct btrfs_data_container *inodes = NULL;
struct btrfs_path *path = NULL;
+   bool ignore_offset;
 
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
@@ -4551,6 +4552,22 @@ static long btrfs_ioctl_logical_to_ino(struct 
btrfs_fs_info *fs_info,
if (IS_ERR(loi))
return PTR_ERR(loi);
 
+   if (version == 1) {
+   ignore_offset = false;
+   } else {
+   /* All reserved bits must be 0 for now */
+   if (memchr_inv(loi->reserved, 0, sizeof(loi->reserved))) {
+   ret = -EINVAL;
+   goto out_loi;
+   }
+   /* Only accept flags we have defined so far */
+   if (loi->flags & ~(BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET)) {
+   ret = -EINVAL;
+   goto out_loi;
+   }
+   ignore_offset = loi->flags & 
BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET;
+   }
+
path = btrfs_alloc_path();
if (!path) {
ret = -ENOMEM;
@@ -4566,7 +4583,7 @@ static long btrfs_ioctl_logical_to_ino(struct 
btrfs_fs_info *fs_info,
}
 
ret = iterate_inodes_from_logical(loi->logical, fs_info, path,
- build_ino_list, inodes, false);
+ build_ino_list, inodes, 
ignore_offset);
if (ret == -EINVAL)
ret = -ENOENT;
if (ret < 0)
@@ -4580,6 +4597,7 @@ static long btrfs_ioctl_logical_to_ino(struct 
btrfs_fs_info *fs_info,
 out:
btrfs_free_path(path);
kvfree(inodes);
+out_loi:
kfree(loi);
 
return ret;
@@ -5550,7 +5568,9 @@ long btrfs_ioctl(struct file *file, unsigned int
case BTRFS_IOC_INO_PATHS:
return btrfs_ioctl_ino_to_path(root, argp);
case BTRFS_IOC_LOGICAL_INO:
-   return btrfs_ioctl_logical_to_ino(fs_info, argp);
+   return btrfs_ioctl_logical_to_ino(fs_info, argp, 1);
+   case BTRFS_IOC_LOGICAL_INO_V2:
+   return btrfs_ioctl_logical_to_ino(fs_info, argp, 2);
case BTRFS_IOC_SPACE_INFO:
return btrfs_ioctl_space_info(fs_info, argp);
case BTRFS_IOC_SYNC: {
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 378230c163d5..99bb7988e6fe 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -608,10 +608,14 @@ struct btrfs_ioctl_ino_path_args {
 struct btrfs_ioctl_logical_ino_args {
__u64   logical;/* in */
__u64   size;   /* in */
-   __u64   reserved[4];
+   __u64   rese

[PATCH v3] btrfs: LOGICAL_INO enhancements

2017-09-22 Thread Zygo Blaxell
Changelog:

v3-v2:

- Stricter check on reserved[] field - now must be all zero, or
userspace gets EINVAL.  This prevents userspace from setting any
of the reserved bits without the kernel providing an unambiguous
interpretation of them, and doesn't require us to burn a flag
bit for each one.

- Moved 'flags' to the end of the reserved[] array.  This allows
existing source code using version 1 of the ioctl to behave the
same way when using version 2 of the btrfs_ioctl_logical_ino_args
struct definition (i.e. reserved[3] becomes an alias for 'flags',
and the addresses of reserved[0-2] don't change).

- Clarified the reasoning in the commit message for patch 2,
"btrfs: add a flags argument to LOGICAL_INO and call it LOGICAL_INO_V2".

v2:

- added patch series intro text

- rebased on 4.14-rc1.

v1:

This patch series fixes some weaknesses in the btrfs LOGICAL_INO ioctl.

Background:

Suppose we have a file with one extent:

root@tester:~# zcat /usr/share/doc/cpio/changelog.gz > /test/a
root@tester:~# sync

Split the extent by overwriting it in the middle:

root@tester:~# cat /dev/urandom | dd bs=4k seek=2 skip=2 count=1 
conv=notrunc of=/test/a

We should now have 3 extent refs to 2 extents, with one block unreachable.
The extent tree looks like:

root@tester:~# btrfs-debug-tree /dev/vdc -t 2
[...]
item 9 key (1103101952 EXTENT_ITEM 73728) itemoff 15942 itemsize 53
extent refs 2 gen 29 flags DATA
extent data backref root 5 objectid 261 offset 0 count 2
[...]
item 11 key (1103175680 EXTENT_ITEM 4096) itemoff 15865 itemsize 53
extent refs 1 gen 30 flags DATA
extent data backref root 5 objectid 261 offset 8192 count 1
[...]

and the ref tree looks like:

root@tester:~# btrfs-debug-tree /dev/vdc -t 5
[...]
item 6 key (261 EXTENT_DATA 0) itemoff 15825 itemsize 53
extent data disk byte 1103101952 nr 73728
extent data offset 0 nr 8192 ram 73728
extent compression(none)
item 7 key (261 EXTENT_DATA 8192) itemoff 15772 itemsize 53
extent data disk byte 1103175680 nr 4096
extent data offset 0 nr 4096 ram 4096
extent compression(none)
item 8 key (261 EXTENT_DATA 12288) itemoff 15719 itemsize 53
extent data disk byte 1103101952 nr 73728
extent data offset 12288 nr 61440 ram 73728
extent compression(none)
[...]

There are two references to the same extent with different, non-overlapping
byte offsets:

[--72K extent at 1103101952--]
[--8K|--4K unreachable|--60K-]
^ ^
| |
[--8K ref offset 0--][--4K ref offset 0--][--60K ref offset 12K--]
 | 
 v
 [-4K extent-] at 1103175680

We want to find all of the references to extent bytenr 1103101952.

Without the patch (and without running btrfs-debug-tree), we have to
do it with 18 LOGICAL_INO calls:

root@tester:~# btrfs ins log 1103101952 -P /test/
Using LOGICAL_INO
inode 261 offset 0 root 5

root@tester:~# for x in $(seq 0 17); do btrfs ins log $((1103101952 + x * 
4096)) -P /test/; done 2>&1 | grep inode
inode 261 offset 0 root 5
inode 261 offset 4096 root 5   <- same extent ref as offset 0
   (offset 8192 returns empty set, not 
reachable)
inode 261 offset 12288 root 5
inode 261 offset 16384 root 5  \
inode 261 offset 20480 root 5  |
inode 261 offset 24576 root 5  |
inode 261 offset 28672 root 5  |
inode 261 offset 32768 root 5  |
inode 261 offset 36864 root 5  \
inode 261 offset 40960 root 5   > all the same extent ref as offset 12288.
inode 261 offset 45056 root 5  /  More processing required in userspace
inode 261 offset 49152 root 5  |  to figure out these are all duplicates.
inode 261 offset 53248 root 5  |
inode 261 offset 57344 root 5  |
inode 261 offset 61440 root 5  |
inode 261 offset 65536 root 5  |
inode 261 offset 69632 root 5  /

In the worst case the extents are 128MB long, and we have to do 32768
iterations of the loop to find one 4K extent ref.

With the patch, we just use one call to map all refs to the extent at once:

root@tester:~# btrfs ins log 1103101952 -P /test/
Using LOGICAL_INO_V2
inode 261 offset 0 root 5
inode 261 offset 12288 root 5

The TREE_SEARCH ioctl allows userspace to retrieve the offset and
extent bytenr fields easily once the root, inode and offset are known.
This is sufficient information to 

Re: [PATCH 2/3] btrfs: add a flags argument to LOGICAL_INO and call it LOGICAL_INO_V2

2017-09-21 Thread Zygo Blaxell
On Thu, Sep 21, 2017 at 12:59:42PM -0700, Darrick J. Wong wrote:
> On Thu, Sep 21, 2017 at 12:10:15AM -0400, Zygo Blaxell wrote:
> > Now that check_extent_in_eb()'s extent offset filter can be turned off,
> > we need a way to do it from userspace.
> > 
> > Add a 'flags' field to the btrfs_logical_ino_args structure to disable 
> > extent
> > offset filtering, taking the place of one of the reserved[] fields.
> > 
> > Previous versions of LOGICAL_INO neglected to check whether any of the
> > reserved fields have non-zero values.  Assigning meaning to those fields
> > now may change the behavior of existing programs that left these fields
> > uninitialized.
> > 
> > To avoid any surprises, define a new ioctl LOGICAL_INO_V2 which uses
> > the same argument layout as LOGICAL_INO, but uses one of the reserved
> > fields for flags.  The V2 ioctl explicitly checks that unsupported flag
> > bits are zero so that userspace can probe for future feature bits as
> > they are defined.  If the other reserved fields are used in the future,
> > one of the remaining flag bits could specify that the other reserved
> > fields are valid, so we don't need to check those for now.
> > 
> > Since the memory layouts and behavior of the two ioctls' arguments
> > are almost identical, there is no need for a separate function for
> > logical_to_ino_v2 (contrast with tree_search_v2 vs tree_search).
> > A version parameter and an 'if' statement will suffice.
> > 
> > Now that we have a flags field in logical_ino_args, add a flag
> > BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET to get the behavior we want,
> > and pass it down the stack to iterate_inodes_from_logical.
> > 
> > Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org>
> > ---
> >  fs/btrfs/ioctl.c   | 21 ++---
> >  include/uapi/linux/btrfs.h |  8 +++-
> >  2 files changed, 25 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > index b7de32568082..2bc3a9588d1d 100644
> > --- a/fs/btrfs/ioctl.c
> > +++ b/fs/btrfs/ioctl.c
> > @@ -4536,13 +4536,14 @@ static int build_ino_list(u64 inum, u64 offset, u64 
> > root, void *ctx)
> >  }
> >  
> >  static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info,
> > -   void __user *arg)
> > +   void __user *arg, int version)
> >  {
> > int ret = 0;
> > int size;
> > struct btrfs_ioctl_logical_ino_args *loi;
> > struct btrfs_data_container *inodes = NULL;
> > struct btrfs_path *path = NULL;
> > +   bool ignore_offset;
> >  
> > if (!capable(CAP_SYS_ADMIN))
> > return -EPERM;
> > @@ -4551,6 +4552,17 @@ static long btrfs_ioctl_logical_to_ino(struct 
> > btrfs_fs_info *fs_info,
> > if (IS_ERR(loi))
> > return PTR_ERR(loi);
> >  
> > +   if (version == 1) {
> > +   ignore_offset = false;
> > +   } else {
> > +   /* Only accept flags we have defined so far */
> > +   if (loi->flags & ~(BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET)) {
> > +   ret = -EINVAL;
> > +   goto out_loi;
> > +   }
> > +   ignore_offset = loi->flags & 
> > BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET;
> 
> Please check loi->reserved[3] for zeroness so that the next person who
> wants to add a field to btrfs_ioctl_logical_ino_args doesn't have to
> create LOGICAL_INO_V3 for the same reason you're creating V2.

OK now I'm confused, in several distinct ways.

I wonder if you meant reserved[1] and reserved[2] there, since I'm not
checking them (for reasons stated in the commit log--we can use flags
to indicate whether and what values are present there).

But that's not the bigger problem.  Maybe you did mean reserved[3], but
there's no "reserved[3]" any more.  I shortened the reserved array from
4 elements to 3, so "reserved[3]" is no longer a valid memory reference.
Also "reserved[0]" no longer refers to the same thing it once did.

> --D
> 
> > +   }
> > +
> > path = btrfs_alloc_path();
> > if (!path) {
> > ret = -ENOMEM;
> > @@ -4566,7 +4578,7 @@ static long btrfs_ioctl_logical_to_ino(struct 
> > btrfs_fs_info *fs_info,
> > }
> >  
> > ret = iterate_inodes_from_logical(loi->logical, fs_info, path,
> > - build_ino_list, inodes, false);
> > + bui

[PATCH v2] btrfs: LOGICAL_INO enhancements (this time based on 4.14-rc1)

2017-09-20 Thread Zygo Blaxell
The previous patch series was based on v4.12.14, and this introductory
text was missing.

This patch series fixes some weaknesses in the btrfs LOGICAL_INO ioctl.

Background:

Suppose we have a file with one extent:

root@tester:~# zcat /usr/share/doc/cpio/changelog.gz > /test/a
root@tester:~# sync

Split the extent by overwriting it in the middle:

root@tester:~# cat /dev/urandom | dd bs=4k seek=2 skip=2 count=1 
conv=notrunc of=/test/a

We should now have 3 extent refs to 2 extents, with one block unreachable.
The extent tree looks like:

root@tester:~# btrfs-debug-tree /dev/vdc -t 2
[...]
item 9 key (1103101952 EXTENT_ITEM 73728) itemoff 15942 itemsize 53
extent refs 2 gen 29 flags DATA
extent data backref root 5 objectid 261 offset 0 count 2
[...]
item 11 key (1103175680 EXTENT_ITEM 4096) itemoff 15865 itemsize 53
extent refs 1 gen 30 flags DATA
extent data backref root 5 objectid 261 offset 8192 count 1
[...]

and the ref tree looks like:

root@tester:~# btrfs-debug-tree /dev/vdc -t 5
[...]
item 6 key (261 EXTENT_DATA 0) itemoff 15825 itemsize 53
extent data disk byte 1103101952 nr 73728
extent data offset 0 nr 8192 ram 73728
extent compression(none)
item 7 key (261 EXTENT_DATA 8192) itemoff 15772 itemsize 53
extent data disk byte 1103175680 nr 4096
extent data offset 0 nr 4096 ram 4096
extent compression(none)
item 8 key (261 EXTENT_DATA 12288) itemoff 15719 itemsize 53
extent data disk byte 1103101952 nr 73728
extent data offset 12288 nr 61440 ram 73728
extent compression(none)
[...]

There are two references to the same extent with different, non-overlapping
byte offsets:

[--72K extent at 1103101952--]
[--8K|--4K unreachable|--60K-]
^ ^
| |
[--8K ref offset 0--][--4K ref offset 0--][--60K ref offset 12K--]
 | 
 v
 [-4K extent-] at 1103175680

We now want to find all of the references to extent bytenr 1103101952.

Without the patch (and without running btrfs-debug-tree), we have to
do it with 18 LOGICAL_INO calls:

root@tester:~# btrfs ins log 1103101952 -P /test/
Using LOGICAL_INO
inode 261 offset 0 root 5

root@tester:~# for x in $(seq 0 17); do btrfs ins log $((1103101952 + x * 
4096)) -P /test/; done 2>&1 | grep inode
inode 261 offset 0 root 5
inode 261 offset 4096 root 5   <- same extent ref as offset 0
   (offset 8192 returns empty set, not 
reachable)
inode 261 offset 12288 root 5
inode 261 offset 16384 root 5  \
inode 261 offset 20480 root 5  |
inode 261 offset 24576 root 5  |
inode 261 offset 28672 root 5  |
inode 261 offset 32768 root 5  |
inode 261 offset 36864 root 5  \
inode 261 offset 40960 root 5   > all the same extent ref as offset 12288.
inode 261 offset 45056 root 5  /  More processing required in userspace
inode 261 offset 49152 root 5  |  to figure out these are all duplicates.
inode 261 offset 53248 root 5  |
inode 261 offset 57344 root 5  |
inode 261 offset 61440 root 5  |
inode 261 offset 65536 root 5  |
inode 261 offset 69632 root 5  /

In the worst case the extents are 128MB long, and we have to do 32768
iterations of the loop to find one 4K extent ref.

With the patch, we just use one call to map all refs to the extent at once:

root@tester:~# btrfs ins log 1103101952 -P /test/
Using LOGICAL_INO_V2
inode 261 offset 0 root 5
inode 261 offset 12288 root 5

The TREE_SEARCH ioctl allows userspace to retrieve the offset and
extent bytenr fields easily once the root, inode and offset are known.
This is sufficient information to build a complete map of the extent
and all of its references.  Userspace can use this information to make
better choices to dedup or defrag.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] btrfs: increase output size for LOGICAL_INO_V2 ioctl

2017-09-20 Thread Zygo Blaxell
Build-server workloads have hundreds of references per file after dedup.
Multiply by a few snapshots and we quickly exhaust the limit of 2730
references per extent that can fit into a 64K buffer.

Raise the limit to 16M to be consistent with other btrfs ioctls
(e.g. TREE_SEARCH_V2, FILE_EXTENT_SAME).

To minimize surprising userspace behavior, apply this change only to
the LOGICAL_INO_V2 ioctl.

Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org>
---
 fs/btrfs/ioctl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 2bc3a9588d1d..4be9b1791f58 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4554,6 +4554,7 @@ static long btrfs_ioctl_logical_to_ino(struct 
btrfs_fs_info *fs_info,
 
if (version == 1) {
ignore_offset = false;
+   size = min_t(u32, loi->size, SZ_64K);
} else {
/* Only accept flags we have defined so far */
if (loi->flags & ~(BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET)) {
@@ -4561,6 +4562,7 @@ static long btrfs_ioctl_logical_to_ino(struct 
btrfs_fs_info *fs_info,
goto out_loi;
}
ignore_offset = loi->flags & 
BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET;
+   size = min_t(u32, loi->size, SZ_16M);
}
 
path = btrfs_alloc_path();
@@ -4569,7 +4571,6 @@ static long btrfs_ioctl_logical_to_ino(struct 
btrfs_fs_info *fs_info,
goto out;
}
 
-   size = min_t(u32, loi->size, SZ_64K);
inodes = init_data_container(size);
if (IS_ERR(inodes)) {
ret = PTR_ERR(inodes);
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] btrfs: add a flags argument to LOGICAL_INO and call it LOGICAL_INO_V2

2017-09-20 Thread Zygo Blaxell
Now that check_extent_in_eb()'s extent offset filter can be turned off,
we need a way to do it from userspace.

Add a 'flags' field to the btrfs_logical_ino_args structure to disable extent
offset filtering, taking the place of one of the reserved[] fields.

Previous versions of LOGICAL_INO neglected to check whether any of the
reserved fields have non-zero values.  Assigning meaning to those fields
now may change the behavior of existing programs that left these fields
uninitialized.

To avoid any surprises, define a new ioctl LOGICAL_INO_V2 which uses
the same argument layout as LOGICAL_INO, but uses one of the reserved
fields for flags.  The V2 ioctl explicitly checks that unsupported flag
bits are zero so that userspace can probe for future feature bits as
they are defined.  If the other reserved fields are used in the future,
one of the remaining flag bits could specify that the other reserved
fields are valid, so we don't need to check those for now.

Since the memory layouts and behavior of the two ioctls' arguments
are almost identical, there is no need for a separate function for
logical_to_ino_v2 (contrast with tree_search_v2 vs tree_search).
A version parameter and an 'if' statement will suffice.

Now that we have a flags field in logical_ino_args, add a flag
BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET to get the behavior we want,
and pass it down the stack to iterate_inodes_from_logical.

Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org>
---
 fs/btrfs/ioctl.c   | 21 ++---
 include/uapi/linux/btrfs.h |  8 +++-
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index b7de32568082..2bc3a9588d1d 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4536,13 +4536,14 @@ static int build_ino_list(u64 inum, u64 offset, u64 
root, void *ctx)
 }
 
 static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info,
-   void __user *arg)
+   void __user *arg, int version)
 {
int ret = 0;
int size;
struct btrfs_ioctl_logical_ino_args *loi;
struct btrfs_data_container *inodes = NULL;
struct btrfs_path *path = NULL;
+   bool ignore_offset;
 
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
@@ -4551,6 +4552,17 @@ static long btrfs_ioctl_logical_to_ino(struct 
btrfs_fs_info *fs_info,
if (IS_ERR(loi))
return PTR_ERR(loi);
 
+   if (version == 1) {
+   ignore_offset = false;
+   } else {
+   /* Only accept flags we have defined so far */
+   if (loi->flags & ~(BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET)) {
+   ret = -EINVAL;
+   goto out_loi;
+   }
+   ignore_offset = loi->flags & 
BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET;
+   }
+
path = btrfs_alloc_path();
if (!path) {
ret = -ENOMEM;
@@ -4566,7 +4578,7 @@ static long btrfs_ioctl_logical_to_ino(struct 
btrfs_fs_info *fs_info,
}
 
ret = iterate_inodes_from_logical(loi->logical, fs_info, path,
- build_ino_list, inodes, false);
+ build_ino_list, inodes, 
ignore_offset);
if (ret == -EINVAL)
ret = -ENOENT;
if (ret < 0)
@@ -4580,6 +4592,7 @@ static long btrfs_ioctl_logical_to_ino(struct 
btrfs_fs_info *fs_info,
 out:
btrfs_free_path(path);
kvfree(inodes);
+out_loi:
kfree(loi);
 
return ret;
@@ -5550,7 +5563,9 @@ long btrfs_ioctl(struct file *file, unsigned int
case BTRFS_IOC_INO_PATHS:
return btrfs_ioctl_ino_to_path(root, argp);
case BTRFS_IOC_LOGICAL_INO:
-   return btrfs_ioctl_logical_to_ino(fs_info, argp);
+   return btrfs_ioctl_logical_to_ino(fs_info, argp, 1);
+   case BTRFS_IOC_LOGICAL_INO_V2:
+   return btrfs_ioctl_logical_to_ino(fs_info, argp, 2);
case BTRFS_IOC_SPACE_INFO:
return btrfs_ioctl_space_info(fs_info, argp);
case BTRFS_IOC_SYNC: {
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 378230c163d5..0b3de597e04f 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -608,10 +608,14 @@ struct btrfs_ioctl_ino_path_args {
 struct btrfs_ioctl_logical_ino_args {
__u64   logical;/* in */
__u64   size;   /* in */
-   __u64   reserved[4];
+   __u64   flags;  /* in, v2 only */
+   __u64   reserved[3];
/* struct btrfs_data_container  *inodes;out   */
__u64   inodes;
 };
+/* Return every ref to the extent, not just those containing logical b

[PATCH 1/3] btrfs: add a flag to iterate_inodes_from_logical to find all extent refs for uncompressed extents

2017-09-20 Thread Zygo Blaxell
The LOGICAL_INO ioctl provides a backward mapping from extent bytenr and
offset (encoded as a single logical address) to a list of extent refs.
LOGICAL_INO complements TREE_SEARCH, which provides the forward mapping
(extent ref -> extent bytenr and offset, or logical address).  These are
useful capabilities for programs that manipulate extents and extent
references from userspace (e.g. dedup and defrag utilities).

When the extents are uncompressed (and not encrypted and not other),
check_extent_in_eb performs filtering of the extent refs to remove any
extent refs which do not contain the same extent offset as the 'logical'
parameter's extent offset.  This prevents LOGICAL_INO from returning
references to more than a single block.

To find the set of extent references to an uncompressed extent from [a,
b), userspace has to run a loop like this pseudocode:

for (i = a; i < b; ++i)
extent_ref_set += LOGICAL_INO(i);

At each iteration of the loop (up to 32768 iterations for a 128M extent),
data we are interested in is collected in the kernel, then deleted by
the filter in check_extent_in_eb.

When the extents are compressed (or encrypted or other), the 'logical'
parameter must be an extent bytenr (the 'a' parameter in the loop).
No filtering by extent offset is done (or possible?) so the result is
the complete set of extent refs for the entire extent.  This removes
the need for the loop, since we get all the extent refs in one call.

Add an 'ignore_offset' argument to iterate_inodes_from_logical,
[...several levels of function call graph...], and check_extent_in_eb, so
that we can disable the extent offset filtering for uncompressed extents.
This flag can be set by an improved version of the LOGICAL_INO ioctl to
get either behavior as desired.

There is no functional change in this patch.  The new flag is always
false.

Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org>
---
 fs/btrfs/backref.c| 63 ++-
 fs/btrfs/backref.h|  8 +++---
 fs/btrfs/inode.c  |  2 +-
 fs/btrfs/ioctl.c  |  2 +-
 fs/btrfs/qgroup.c |  8 +++---
 fs/btrfs/scrub.c  |  6 ++---
 fs/btrfs/send.c   |  2 +-
 fs/btrfs/tests/qgroup-tests.c | 20 +++---
 8 files changed, 63 insertions(+), 48 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index b517ef1477ea..a2609786cd86 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -40,12 +40,14 @@ static int check_extent_in_eb(const struct btrfs_key *key,
  const struct extent_buffer *eb,
  const struct btrfs_file_extent_item *fi,
  u64 extent_item_pos,
- struct extent_inode_elem **eie)
+ struct extent_inode_elem **eie,
+ bool ignore_offset)
 {
u64 offset = 0;
struct extent_inode_elem *e;
 
-   if (!btrfs_file_extent_compression(eb, fi) &&
+   if (!ignore_offset &&
+   !btrfs_file_extent_compression(eb, fi) &&
!btrfs_file_extent_encryption(eb, fi) &&
!btrfs_file_extent_other_encoding(eb, fi)) {
u64 data_offset;
@@ -84,7 +86,8 @@ static void free_inode_elem_list(struct extent_inode_elem 
*eie)
 
 static int find_extent_in_eb(const struct extent_buffer *eb,
 u64 wanted_disk_byte, u64 extent_item_pos,
-struct extent_inode_elem **eie)
+struct extent_inode_elem **eie,
+bool ignore_offset)
 {
u64 disk_byte;
struct btrfs_key key;
@@ -113,7 +116,7 @@ static int find_extent_in_eb(const struct extent_buffer *eb,
if (disk_byte != wanted_disk_byte)
continue;
 
-   ret = check_extent_in_eb(, eb, fi, extent_item_pos, eie);
+   ret = check_extent_in_eb(, eb, fi, extent_item_pos, eie, 
ignore_offset);
if (ret < 0)
return ret;
}
@@ -419,7 +422,7 @@ static int add_indirect_ref(const struct btrfs_fs_info 
*fs_info,
 static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path,
   struct ulist *parents, struct prelim_ref *ref,
   int level, u64 time_seq, const u64 *extent_item_pos,
-  u64 total_refs)
+  u64 total_refs, bool ignore_offset)
 {
int ret = 0;
int slot;
@@ -472,7 +475,7 @@ static int add_all_parents(struct btrfs_root *root, struct 
btrfs_path *path,
if (extent_item_pos) {
ret = check_extent_in_eb(, eb, fi,
 

[PATCH 1/3] btrfs: add a flag to iterate_inodes_from_logical to find all extent refs for uncompressed extents

2017-09-20 Thread Zygo Blaxell
The LOGICAL_INO ioctl provides a backward mapping from extent bytenr and
offset (encoded as a single logical address) to a list of extent refs.
LOGICAL_INO complements TREE_SEARCH, which provides the forward mapping
(extent ref -> extent bytenr and offset, or logical address).  These are
useful capabilities for programs that manipulate extents and extent
references from userspace (e.g. dedup and defrag utilities).

When the extents are uncompressed (and not encrypted and not other),
check_extent_in_eb performs filtering of the extent refs to remove any
extent refs which do not contain the same extent offset as the 'logical'
parameter's extent offset.  This prevents LOGICAL_INO from returning
references to more than a single block.

To find the set of extent references to an uncompressed extent from [a,
b), userspace has to run a loop like this pseudocode:

for (i = a; i < b; ++i)
extent_ref_set += LOGICAL_INO(i);

At each iteration of the loop (up to 32768 iterations for a 128M extent),
data we are interested in is collected in the kernel, then deleted by
the filter in check_extent_in_eb.

When the extents are compressed (or encrypted or other), the 'logical'
parameter must be an extent bytenr (the 'a' parameter in the loop).
No filtering by extent offset is done (or possible?) so the result is
the complete set of extent refs for the entire extent.  This removes
the need for the loop, since we get all the extent refs in one call.

Add an 'ignore_offset' argument to iterate_inodes_from_logical,
[...several levels of function call graph...], and check_extent_in_eb, so
that we can disable the extent offset filtering for uncompressed extents.
This flag can be set by an improved version of the LOGICAL_INO ioctl to
get either behavior as desired.

There is no functional change in this patch.  The new flag is always
false.

Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org>
---
 fs/btrfs/backref.c | 62 --
 fs/btrfs/backref.h |  8 ---
 fs/btrfs/inode.c   |  2 +-
 fs/btrfs/ioctl.c   |  2 +-
 fs/btrfs/qgroup.c  |  8 +++
 fs/btrfs/scrub.c   |  6 +++---
 fs/btrfs/send.c|  2 +-
 7 files changed, 52 insertions(+), 38 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 1d71a5a4b1b9..3bffd36c6897 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -302,12 +302,14 @@ static int ref_tree_add(struct ref_root *ref_tree, u64 
root_id, u64 object_id,
 static int check_extent_in_eb(struct btrfs_key *key, struct extent_buffer *eb,
struct btrfs_file_extent_item *fi,
u64 extent_item_pos,
-   struct extent_inode_elem **eie)
+   struct extent_inode_elem **eie,
+   bool ignore_offset)
 {
u64 offset = 0;
struct extent_inode_elem *e;
 
-   if (!btrfs_file_extent_compression(eb, fi) &&
+   if (!ignore_offset &&
+   !btrfs_file_extent_compression(eb, fi) &&
!btrfs_file_extent_encryption(eb, fi) &&
!btrfs_file_extent_other_encoding(eb, fi)) {
u64 data_offset;
@@ -346,7 +348,8 @@ static void free_inode_elem_list(struct extent_inode_elem 
*eie)
 
 static int find_extent_in_eb(struct extent_buffer *eb, u64 wanted_disk_byte,
u64 extent_item_pos,
-   struct extent_inode_elem **eie)
+   struct extent_inode_elem **eie,
+   bool ignore_offset)
 {
u64 disk_byte;
struct btrfs_key key;
@@ -375,7 +378,7 @@ static int find_extent_in_eb(struct extent_buffer *eb, u64 
wanted_disk_byte,
if (disk_byte != wanted_disk_byte)
continue;
 
-   ret = check_extent_in_eb(, eb, fi, extent_item_pos, eie);
+   ret = check_extent_in_eb(, eb, fi, extent_item_pos, eie, 
ignore_offset);
if (ret < 0)
return ret;
}
@@ -511,7 +514,7 @@ static int __add_prelim_ref(struct list_head *head, u64 
root_id,
 static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path,
   struct ulist *parents, struct __prelim_ref *ref,
   int level, u64 time_seq, const u64 *extent_item_pos,
-  u64 total_refs)
+  u64 total_refs, bool ignore_offset)
 {
int ret = 0;
int slot;
@@ -564,7 +567,7 @@ static int add_all_parents(struct btrfs_root *root, struct 
btrfs_path *path,
if (extent_item_pos) {
ret = check_extent_in_eb(, eb, fi,
*extent_item_pos,
-   );
+   

[PATCH 2/3] btrfs: add a flags argument to LOGICAL_INO and call it LOGICAL_INO_V2

2017-09-20 Thread Zygo Blaxell
Now that check_extent_in_eb()'s extent offset filter can be turned off,
we need a way to do it from userspace.

Add a 'flags' field to the btrfs_logical_ino_args structure to disable extent
offset filtering, taking the place of one of the reserved[] fields.

Previous versions of LOGICAL_INO neglected to check whether any of the
reserved fields have non-zero values.  Assigning meaning to those fields
now may change the behavior of existing programs that left these fields
uninitialized.

To avoid any surprises, define a new ioctl LOGICAL_INO_V2 which uses
the same argument layout as LOGICAL_INO, but uses one of the reserved
fields for flags.  The V2 ioctl explicitly checks that unsupported flag
bits are zero so that userspace can probe for future feature bits as
they are defined.  If the other reserved fields are used in the future,
one of the remaining flag bits could specify that the other reserved
fields are valid, so we don't need to check those for now.

Since the memory layouts and behavior of the two ioctls' arguments
are almost identical, there is no need for a separate function for
logical_to_ino_v2 (contrast with tree_search_v2 vs tree_search).
A version parameter and an 'if' statement will suffice.

Now that we have a flags field in logical_ino_args, add a flag
BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET to get the behavior we want,
and pass it down the stack to iterate_inodes_from_logical.

Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org>
---
 fs/btrfs/ioctl.c   | 21 ++---
 include/uapi/linux/btrfs.h |  8 +++-
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index c6787660d91f..def0ab85134a 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4542,13 +4542,14 @@ static int build_ino_list(u64 inum, u64 offset, u64 
root, void *ctx)
 }
 
 static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info,
-   void __user *arg)
+   void __user *arg, int version)
 {
int ret = 0;
int size;
struct btrfs_ioctl_logical_ino_args *loi;
struct btrfs_data_container *inodes = NULL;
struct btrfs_path *path = NULL;
+   bool ignore_offset;
 
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
@@ -4557,6 +4558,17 @@ static long btrfs_ioctl_logical_to_ino(struct 
btrfs_fs_info *fs_info,
if (IS_ERR(loi))
return PTR_ERR(loi);
 
+   if (version == 1) {
+   ignore_offset = false;
+   } else {
+   /* Only accept flags we have defined so far */
+   if (loi->flags & ~(BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET)) {
+   ret = -EINVAL;
+   goto out_loi;
+   }
+   ignore_offset = loi->flags & 
BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET;
+   }
+
path = btrfs_alloc_path();
if (!path) {
ret = -ENOMEM;
@@ -4572,7 +4584,7 @@ static long btrfs_ioctl_logical_to_ino(struct 
btrfs_fs_info *fs_info,
}
 
ret = iterate_inodes_from_logical(loi->logical, fs_info, path,
- build_ino_list, inodes, false);
+ build_ino_list, inodes, 
ignore_offset);
if (ret == -EINVAL)
ret = -ENOENT;
if (ret < 0)
@@ -4586,6 +4598,7 @@ static long btrfs_ioctl_logical_to_ino(struct 
btrfs_fs_info *fs_info,
 out:
btrfs_free_path(path);
vfree(inodes);
+out_loi:
kfree(loi);
 
return ret;
@@ -5559,7 +5572,9 @@ long btrfs_ioctl(struct file *file, unsigned int
case BTRFS_IOC_INO_PATHS:
return btrfs_ioctl_ino_to_path(root, argp);
case BTRFS_IOC_LOGICAL_INO:
-   return btrfs_ioctl_logical_to_ino(fs_info, argp);
+   return btrfs_ioctl_logical_to_ino(fs_info, argp, 1);
+   case BTRFS_IOC_LOGICAL_INO_V2:
+   return btrfs_ioctl_logical_to_ino(fs_info, argp, 2);
case BTRFS_IOC_SPACE_INFO:
return btrfs_ioctl_space_info(fs_info, argp);
case BTRFS_IOC_SYNC: {
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index a456e5309238..a23555026994 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -591,10 +591,14 @@ struct btrfs_ioctl_ino_path_args {
 struct btrfs_ioctl_logical_ino_args {
__u64   logical;/* in */
__u64   size;   /* in */
-   __u64   reserved[4];
+   __u64   flags;  /* in, v2 only */
+   __u64   reserved[3];
/* struct btrfs_data_container  *inodes;out   */
__u64   inodes;
 };
+/* Return every ref to the extent, not just those containing logical b

[PATCH 3/3] btrfs: increase output size for LOGICAL_INO_V2 ioctl

2017-09-20 Thread Zygo Blaxell
Build-server workloads have hundreds of references per file after dedup.
Multiply by a few snapshots and we quickly exhaust the limit of 2730
references per extent that can fit into a 64K buffer.

Raise the limit to 16M to be consistent with other btrfs ioctls
(e.g. TREE_SEARCH_V2, FILE_EXTENT_SAME).

To minimize surprising userspace behavior, apply this change only to
the LOGICAL_INO_V2 ioctl.

Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org>
---
 fs/btrfs/ioctl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index def0ab85134a..e13fea25ecb8 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4560,6 +4560,7 @@ static long btrfs_ioctl_logical_to_ino(struct 
btrfs_fs_info *fs_info,
 
if (version == 1) {
ignore_offset = false;
+   size = min_t(u32, loi->size, SZ_64K);
} else {
/* Only accept flags we have defined so far */
if (loi->flags & ~(BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET)) {
@@ -4567,6 +4568,7 @@ static long btrfs_ioctl_logical_to_ino(struct 
btrfs_fs_info *fs_info,
goto out_loi;
}
ignore_offset = loi->flags & 
BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET;
+   size = min_t(u32, loi->size, SZ_16M);
}
 
path = btrfs_alloc_path();
@@ -4575,7 +4577,6 @@ static long btrfs_ioctl_logical_to_ino(struct 
btrfs_fs_info *fs_info,
goto out;
}
 
-   size = min_t(u32, loi->size, SZ_64K);
inodes = init_data_container(size);
if (IS_ERR(inodes)) {
ret = PTR_ERR(inodes);
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4] btrfs: add missing memset while reading compressed inline extents

2017-03-10 Thread Zygo Blaxell
t each run:

000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
*
0001740 cdcd cdcd cdcd cdcd 6c63 7400 635f 006d
0001760 5f74 6f43 7400 435f 0053 5f74 7363 7400
0002000 435f 0056 5f74 6164 7400 645f 0062 5f74
(...)

Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org>
Reviewed-by: Liu Bo <bo.li@oracle.com>
---

v4: remove WARN_ON.  Put in the comment about decompression code
filling in zeros up to the end of max_size, and why we need a
memset here.

 fs/btrfs/inode.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 25ac2cf..f41ef5d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6805,6 +6805,20 @@ static noinline int uncompress_inline(struct btrfs_path 
*path,
max_size = min_t(unsigned long, PAGE_SIZE, max_size);
ret = btrfs_decompress(compress_type, tmp, page,
   extent_offset, inline_size, max_size);
+
+   /*
+* decompression code contains a memset to fill in any space between 
the end
+* of the uncompressed data and the end of max_size in case the 
decompressed
+* data ends up shorter than ram_bytes.  That doesn't cover the hole 
between
+* the end of an inline extent and the beginning of the next block, so 
we
+* cover that region here.
+*/
+
+   if (max_size + pg_offset < PAGE_SIZE) {
+   char *map = kmap(page);
+   memset(map + pg_offset + max_size, 0, PAGE_SIZE - max_size - 
pg_offset);
+   kunmap(page);
+   }
kfree(tmp);
return ret;
 }
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: add missing memset while reading compressed inline extents

2017-03-10 Thread Zygo Blaxell
On Fri, Mar 10, 2017 at 02:12:54PM -0500, Chris Mason wrote:
> 
> 
> On 03/10/2017 01:56 PM, Zygo Blaxell wrote:
> >On Fri, Mar 10, 2017 at 11:19:24AM -0500, Chris Mason wrote:
> >>On 03/09/2017 11:41 PM, Zygo Blaxell wrote:
> >>>On Thu, Mar 09, 2017 at 10:39:49AM -0500, Chris Mason wrote:
> >>>>
> >>>>
> >>>>On 03/08/2017 09:12 PM, Zygo Blaxell wrote:
> >>>>>This is a story about 4 distinct (and very old) btrfs bugs.
> >>>>>
> >>>>
> >>>>Really great write up.
> >>>>
> >>>>[ ... ]
> >>>>
> >>>>>
> >>>>>diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> >>>>>index 25ac2cf..4d41a31 100644
> >>>>>--- a/fs/btrfs/inode.c
> >>>>>+++ b/fs/btrfs/inode.c
> >>>>>@@ -6805,6 +6805,12 @@ static noinline int uncompress_inline(struct 
> >>>>>btrfs_path *path,
> >>>>> max_size = min_t(unsigned long, PAGE_SIZE, max_size);
> >>>>> ret = btrfs_decompress(compress_type, tmp, page,
> >>>>>extent_offset, inline_size, max_size);
> >>>>>+WARN_ON(max_size + pg_offset > PAGE_SIZE);
> >>>>
> >>>>Can you please drop this WARN_ON and make the math reflect any possible
> >>>>pg_offset?  I do agree it shouldn't be happening, but its easy to correct
> >>>>for and the WARN is likely to get lost.
> >>>
> >>>I'm not sure how to do that.  It looks like I'd have to pass pg_offset
> >>>through btrfs_decompress to the decompress functions?
> >>>
> >>>   ret = btrfs_decompress(compress_type, tmp, page,
> >>>   extent_offset, inline_size, max_size, pg_offset);
> >>>
> >>>and in the compression functions get pg_offset from the argument list
> >>>instead of hardcoding zero.
> >>
> >>Yeah, it's a good point.  Both zlib and lzo are assuming a zero pg_offset
> >>right now, but just like there are wacky corners allowing inline extents
> >>followed by more data, there are a few wacky corners allowing inline extents
> >>at the end of the file.
> >>
> >>Lets not mix that change in with this one though.  For now, just get the
> >>memset right and we can pass pg_offset down in a later patch.
> >
> >Are you saying "fix the memset in the patch" (and if so, what's wrong
> >with it?), or are you saying "let's take the patch with its memset as is,
> >and fix the pg_offset > 0 issues later"?
> 
> Your WARN_ON() would fire when this math is bad:
> 
> memset(map + pg_offset + max_size, 0, PAGE_SIZE - max_size - pg_offset);
> 
> Instead or warning, just don't memset if pg_offset + max_size >= PAGE_SIZE

OK.

While I was looking at this function I noticed that there doesn't seem to be
a sanity check on the data in the extent ref.  e.g. ram_bytes could be 2GB
and nothing would notice.  I'm pretty sure that's only possible by fuzzing,
but it seemed worthwhile to log it if it ever happened.

I'll take the WARN_ON out, and also put in the comment you asked for in the
other branch of this thread.

> -chris
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: add missing memset while reading compressed inline extents

2017-03-10 Thread Zygo Blaxell
On Fri, Mar 10, 2017 at 11:19:24AM -0500, Chris Mason wrote:
> On 03/09/2017 11:41 PM, Zygo Blaxell wrote:
> >On Thu, Mar 09, 2017 at 10:39:49AM -0500, Chris Mason wrote:
> >>
> >>
> >>On 03/08/2017 09:12 PM, Zygo Blaxell wrote:
> >>>This is a story about 4 distinct (and very old) btrfs bugs.
> >>>
> >>
> >>Really great write up.
> >>
> >>[ ... ]
> >>
> >>>
> >>>diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> >>>index 25ac2cf..4d41a31 100644
> >>>--- a/fs/btrfs/inode.c
> >>>+++ b/fs/btrfs/inode.c
> >>>@@ -6805,6 +6805,12 @@ static noinline int uncompress_inline(struct 
> >>>btrfs_path *path,
> >>>   max_size = min_t(unsigned long, PAGE_SIZE, max_size);
> >>>   ret = btrfs_decompress(compress_type, tmp, page,
> >>>  extent_offset, inline_size, max_size);
> >>>+  WARN_ON(max_size + pg_offset > PAGE_SIZE);
> >>
> >>Can you please drop this WARN_ON and make the math reflect any possible
> >>pg_offset?  I do agree it shouldn't be happening, but its easy to correct
> >>for and the WARN is likely to get lost.
> >
> >I'm not sure how to do that.  It looks like I'd have to pass pg_offset
> >through btrfs_decompress to the decompress functions?
> >
> > ret = btrfs_decompress(compress_type, tmp, page,
> > extent_offset, inline_size, max_size, pg_offset);
> >
> >and in the compression functions get pg_offset from the argument list
> >instead of hardcoding zero.
> 
> Yeah, it's a good point.  Both zlib and lzo are assuming a zero pg_offset
> right now, but just like there are wacky corners allowing inline extents
> followed by more data, there are a few wacky corners allowing inline extents
> at the end of the file.
> 
> Lets not mix that change in with this one though.  For now, just get the
> memset right and we can pass pg_offset down in a later patch.

Are you saying "fix the memset in the patch" (and if so, what's wrong
with it?), or are you saying "let's take the patch with its memset as is,
and fix the pg_offset > 0 issues later"?

> -chris
> 


signature.asc
Description: Digital signature


Re: [PATCH v3] btrfs: add missing memset while reading compressed inline extents

2017-03-09 Thread Zygo Blaxell
On Thu, Mar 09, 2017 at 10:39:49AM -0500, Chris Mason wrote:
> 
> 
> On 03/08/2017 09:12 PM, Zygo Blaxell wrote:
> >This is a story about 4 distinct (and very old) btrfs bugs.
> >
> 
> Really great write up.
> 
> [ ... ]
> 
> >
> >diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> >index 25ac2cf..4d41a31 100644
> >--- a/fs/btrfs/inode.c
> >+++ b/fs/btrfs/inode.c
> >@@ -6805,6 +6805,12 @@ static noinline int uncompress_inline(struct 
> >btrfs_path *path,
> > max_size = min_t(unsigned long, PAGE_SIZE, max_size);
> > ret = btrfs_decompress(compress_type, tmp, page,
> >extent_offset, inline_size, max_size);
> >+WARN_ON(max_size + pg_offset > PAGE_SIZE);
> 
> Can you please drop this WARN_ON and make the math reflect any possible
> pg_offset?  I do agree it shouldn't be happening, but its easy to correct
> for and the WARN is likely to get lost.

I'm not sure how to do that.  It looks like I'd have to pass pg_offset
through btrfs_decompress to the decompress functions?

ret = btrfs_decompress(compress_type, tmp, page,
extent_offset, inline_size, max_size, pg_offset);

and in the compression functions get pg_offset from the argument list
instead of hardcoding zero.

But how does pg_offset become non-zero for an inline extent?  A micro-hole
before the first byte?  If the offset was >= 4096, the data wouldn't
be in the first block so there would never be an inline extent in the
first place.

> >+if (max_size + pg_offset < PAGE_SIZE) {
> >+char *map = kmap(page);
> >+memset(map + pg_offset + max_size, 0, PAGE_SIZE - max_size - 
> >pg_offset);
> >+kunmap(page);
> >+}
> 
> Both lzo and zlib have a memset to cover the gap between what they actually
> decompress and the max_size that we pass here.  That's important because
> ram_bytes may not be 100% accurate.
> 
> Can you also please toss in a comment about how the decompression code is
> responsible for the memset up to max_bytes?
> 
> -chris
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: Digital signature


Re: [PATCH] btrfs: add missing memset while reading compressed inline extents

2017-03-08 Thread Zygo Blaxell
On Wed, Mar 08, 2017 at 10:27:33AM +, Filipe Manana wrote:
> On Wed, Mar 8, 2017 at 3:18 AM, Zygo Blaxell
> <zblax...@waya.furryterror.org> wrote:
> > From: Zygo Blaxell <ce3g8...@umail.furryterror.org>
> >
> > This is a story about 4 distinct (and very old) btrfs bugs.
> >
> > Commit c8b978188c ("Btrfs: Add zlib compression support") added
> > three data corruption bugs for inline extents (bugs #1-3).
> >
> > Commit 93c82d5750 ("Btrfs: zero page past end of inline file items")
> > fixed bug #1:  uncompressed inline extents followed by a hole and more
> > extents could get non-zero data in the hole as they were read.  The fix
> > was to add a memset in btrfs_get_extent to zero out the hole.
> >
> > Commit 166ae5a418 ("btrfs: fix inline compressed read err corruption")
> > fixed bug #2:  compressed inline extents which contained non-zero bytes
> > might be replaced with zero bytes in some cases.  This patch removed an
> > unhelpful memset from uncompress_inline, but the case where memset is
> > required was missed.
> >
> > There is also a memset in the decompression code, but this only covers
> > decompressed data that is shorter than the ram_bytes from the extent
> > ref record.  This memset doesn't cover the region between the end of the
> > decompressed data and the end of the page.  It has also moved around a
> > few times over the years, so there's no single patch to refer to.
> >
> > This patch fixes bug #3:  compressed inline extents followed by a hole
> > and more extents could get non-zero data in the hole as they were read
> > (i.e. bug #3 is the same as bug #1, but s/uncompressed/compressed/).
> > The fix is the same:  zero out the hole in the compressed case too,
> > by putting a memset back in uncompress_inline, but this time with
> > correct parameters.
> >
> > The last and oldest bug, bug #0, is the cause of the offending inline
> > extent/hole/extent pattern.  Bug #0 is a subtle and mostly-harmless quirk
> > of behavior somewhere in the btrfs write code.  In a few special cases,
> > an inline extent and hole are allowed to persist where they normally
> > would be combined with later extents in the file.
> >
> > A fast reproducer for bug #0 is presented below.  A few offending extents
> > are also created in the wild during large rsync transfers with the -S
> > flag.  A Linux kernel build (git checkout; make allyesconfig; make -j8)
> > will produce a handful of offending files as well.  Once an offending
> > file is created, it can present different content to userspace each
> > time it is read.
> >
> > Bug #0 is at least 4 and possibly 8 years old.  I verified every vX.Y
> > kernel back to v3.5 has this behavior.  There are fossil records of this
> > bug's effects in commits all the way back to v2.6.32.  I have no reason
> > to believe bug #0 wasn't present at the beginning of btrfs compression
> > support in v2.6.29, but I can't easily test kernels that old to be sure.
> >
> > It is not clear whether bug #0 is worth fixing.  A fix would likely
> > require injecting extra reads into currently write-only paths, and most
> > of the exceptional cases caused by bug #0 are already handled now.
> >
> > Whether we like them or not, bug #0's inline extents followed by holes
> > are part of the btrfs de-facto disk format now, and we need to be able
> > to read them without data corruption or an infoleak.  So enough about
> > bug #0, let's get back to bug #3 (this patch).
> >
> > An example of on-disk structure leading to data corruption:
> >
> > item 61 key (606890 INODE_ITEM 0) itemoff 9662 itemsize 160
> > inode generation 50 transid 50 size 47424 nbytes 49141
> > block group 0 mode 100644 links 1 uid 0 gid 0
> > rdev 0 flags 0x0(none)
> > item 62 key (606890 INODE_REF 603050) itemoff 9642 itemsize 20
> > inode ref index 3 namelen 10 name: DB_File.so
> > item 63 key (606890 EXTENT_DATA 0) itemoff 8280 itemsize 1362
> > inline extent data size 1341 ram 4085 compress(zlib)
> > item 64 key (606890 EXTENT_DATA 4096) itemoff 8227 itemsize 53
> > extent data disk byte 5367308288 nr 20480
> > extent data offset 0 nr 45056 ram 45056
> > extent compression(zlib)
> 
> So this case is actually different from the reproducer below, because
> once a file has prealloc extents, future writes will never be
> compressed. That is, the extent at offset 4096 can not ha

[PATCH v3] btrfs: add missing memset while reading compressed inline extents

2017-03-08 Thread Zygo Blaxell
0

Actual output:  the data from byte 1000 to the end of the first
4096 byte page will be corrupt/infoleak:

000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
*
0001740 cdcd cdcd cdcd cdcd 6c63 7400 635f 006d
0001760 5f74 6f43 7400 435f 0053 5f74 7363 7400
0002000 435f 0056 5f74 6164 7400 645f 0062 5f74
(...)

Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org>
Reviewed-by: Liu Bo <bo.li@oracle.com>
---

v3: Clarify that there are two distinct methods to create the hole,
but both lead to the same corruption/infoleak when the hole is read.
No code change.

v2: I'm not able to contrive a test case where pg_offset != 0, but we   
  
might as well handle it anyway. 
  

 fs/btrfs/inode.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 25ac2cf..4d41a31 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6805,6 +6805,12 @@ static noinline int uncompress_inline(struct btrfs_path 
*path,
max_size = min_t(unsigned long, PAGE_SIZE, max_size);
ret = btrfs_decompress(compress_type, tmp, page,
   extent_offset, inline_size, max_size);
+   WARN_ON(max_size + pg_offset > PAGE_SIZE);
+   if (max_size + pg_offset < PAGE_SIZE) {
+   char *map = kmap(page);
+   memset(map + pg_offset + max_size, 0, PAGE_SIZE - max_size - 
pg_offset);
+   kunmap(page);
+   }
kfree(tmp);
return ret;
 }
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: add missing memset while reading compressed inline extents

2017-03-07 Thread Zygo Blaxell
From: Zygo Blaxell <ce3g8...@umail.furryterror.org>

This is a story about 4 distinct (and very old) btrfs bugs.

Commit c8b978188c ("Btrfs: Add zlib compression support") added
three data corruption bugs for inline extents (bugs #1-3).

Commit 93c82d5750 ("Btrfs: zero page past end of inline file items")
fixed bug #1:  uncompressed inline extents followed by a hole and more
extents could get non-zero data in the hole as they were read.  The fix
was to add a memset in btrfs_get_extent to zero out the hole.

Commit 166ae5a418 ("btrfs: fix inline compressed read err corruption")
fixed bug #2:  compressed inline extents which contained non-zero bytes
might be replaced with zero bytes in some cases.  This patch removed an
unhelpful memset from uncompress_inline, but the case where memset is
required was missed.

There is also a memset in the decompression code, but this only covers
decompressed data that is shorter than the ram_bytes from the extent
ref record.  This memset doesn't cover the region between the end of the
decompressed data and the end of the page.  It has also moved around a
few times over the years, so there's no single patch to refer to.

This patch fixes bug #3:  compressed inline extents followed by a hole
and more extents could get non-zero data in the hole as they were read
(i.e. bug #3 is the same as bug #1, but s/uncompressed/compressed/).
The fix is the same:  zero out the hole in the compressed case too,
by putting a memset back in uncompress_inline, but this time with
correct parameters.

The last and oldest bug, bug #0, is the cause of the offending inline
extent/hole/extent pattern.  Bug #0 is a subtle and mostly-harmless quirk
of behavior somewhere in the btrfs write code.  In a few special cases,
an inline extent and hole are allowed to persist where they normally
would be combined with later extents in the file.

A fast reproducer for bug #0 is presented below.  A few offending extents
are also created in the wild during large rsync transfers with the -S
flag.  A Linux kernel build (git checkout; make allyesconfig; make -j8)
will produce a handful of offending files as well.  Once an offending
file is created, it can present different content to userspace each
time it is read.

Bug #0 is at least 4 and possibly 8 years old.  I verified every vX.Y
kernel back to v3.5 has this behavior.  There are fossil records of this
bug's effects in commits all the way back to v2.6.32.  I have no reason
to believe bug #0 wasn't present at the beginning of btrfs compression
support in v2.6.29, but I can't easily test kernels that old to be sure.

It is not clear whether bug #0 is worth fixing.  A fix would likely
require injecting extra reads into currently write-only paths, and most
of the exceptional cases caused by bug #0 are already handled now.

Whether we like them or not, bug #0's inline extents followed by holes
are part of the btrfs de-facto disk format now, and we need to be able
to read them without data corruption or an infoleak.  So enough about
bug #0, let's get back to bug #3 (this patch).

An example of on-disk structure leading to data corruption:

item 61 key (606890 INODE_ITEM 0) itemoff 9662 itemsize 160
inode generation 50 transid 50 size 47424 nbytes 49141
block group 0 mode 100644 links 1 uid 0 gid 0
rdev 0 flags 0x0(none)
item 62 key (606890 INODE_REF 603050) itemoff 9642 itemsize 20
inode ref index 3 namelen 10 name: DB_File.so
item 63 key (606890 EXTENT_DATA 0) itemoff 8280 itemsize 1362
inline extent data size 1341 ram 4085 compress(zlib)
item 64 key (606890 EXTENT_DATA 4096) itemoff 8227 itemsize 53
extent data disk byte 5367308288 nr 20480
extent data offset 0 nr 45056 ram 45056
extent compression(zlib)

Different data appears in userspace during each read of the 11 bytes
between 4085 and 4096.  The extent in item 63 is not long enough to
fill the first page of the file, so a memset is required to fill the
space between item 63 (ending at 4085) and item 64 (beginning at 4096)
with zero.

Here is a reproducer from Liu Bo:

Using 'page_poison=on' kernel command line (or enable
CONFIG_PAGE_POISONING) run the following:

# touch foo
# chattr +c foo
# xfs_io -f -c "pwrite -W 0 1000" foo
# xfs_io -f -c "falloc 4 8188" foo
# od -x foo
# echo 3 >/proc/sys/vm/drop_caches
# od -x foo

This produce the following on my box:

000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
*
0001740 cdcd cdcd cdcd cdcd    
0001760        
*
002

000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
*
0001740 cdcd cdcd cdcd cdcd 6c63 7400 635f 006d
0001760 5f74 6f43 7400 435f 0053 5f74 7363 7400
0002000 435f 0056 5f74 6164 7400 645f 0062 5f74
(...)

v2: I'm not able to contrive a 

Re: [PATCH] btrfs: fix hole read corruption for compressed inline extents

2017-02-18 Thread Zygo Blaxell
Ping?

This is still reproducible on 4.9.8.

On Mon, Nov 28, 2016 at 12:03:12AM -0500, Zygo Blaxell wrote:
> Commit c8b978188c ("Btrfs: Add zlib compression support") produces
> data corruption when reading a file with a hole positioned after an
> inline extent.  btrfs_get_extent will return uninitialized kernel memory
> instead of zero bytes in the hole.
> 
> Commit 93c82d5750 ("Btrfs: zero page past end of inline file items")
> fills the hole by memset to zero after *uncompressed* inline extents.
> 
> This patch provides the missing memset for holes after *compressed*
> inline extents.
> 
> The offending holes appear in the wild and will appear during routine
> data integrity audits (e.g. comparing backups against their originals).
> They can also be created intentionally by fuzzing or crafting a filesystem
> image.
> 
> Holes like these are not intended to occur in btrfs; however, I tested
> tagged kernels between v3.5 and the present, and found that all of them
> can create such holes.  Whether we like them or not, this kind of hole
> is now part of the btrfs de-facto on-disk format, and we need to be able
> to read such holes without an infoleak or wrong data.
> 
> An example of a hole leading to data corruption:
> 
> item 61 key (606890 INODE_ITEM 0) itemoff 9662 itemsize 160
> inode generation 50 transid 50 size 47424 nbytes 49141
> block group 0 mode 100644 links 1 uid 0 gid 0
> rdev 0 flags 0x0(none)
> item 62 key (606890 INODE_REF 603050) itemoff 9642 itemsize 20
> inode ref index 3 namelen 10 name: DB_File.so
> item 63 key (606890 EXTENT_DATA 0) itemoff 8280 itemsize 1362
> inline extent data size 1341 ram 4085 compress(zlib)
> item 64 key (606890 EXTENT_DATA 4096) itemoff 8227 itemsize 53
> extent data disk byte 5367308288 nr 20480
> extent data offset 0 nr 45056 ram 45056
> extent compression(zlib)
> 
> Different data appears in userspace during each uncached read of the 10
> bytes between offset 4085 and 4095.  The extent in item 63 is not long
> enough to fill the first page of the file, so a memset is required to
> fill the space between item 63 (ending at 4085) and item 64 (beginning
> at 4096) with zero.
> 
> Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org>
> 
> ---
>  fs/btrfs/inode.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 8e3a5a2..b1314d6 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -6803,6 +6803,12 @@ static noinline int uncompress_inline(struct 
> btrfs_path *path,
>   max_size = min_t(unsigned long, PAGE_SIZE, max_size);
>   ret = btrfs_decompress(compress_type, tmp, page,
>  extent_offset, inline_size, max_size);
> + WARN_ON(max_size > PAGE_SIZE);
> + if (max_size < PAGE_SIZE) {
> + char *map = kmap(page);
> + memset(map + max_size, 0, PAGE_SIZE - max_size);
> + kunmap(page);
> + }
>   kfree(tmp);
>   return ret;
>  }
> -- 
> 2.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [markfasheh/duperemove] Why blocksize is limit to 1MB?

2017-01-08 Thread Zygo Blaxell
On Wed, Jan 04, 2017 at 07:58:55AM -0500, Austin S. Hemmelgarn wrote:
> On 2017-01-03 16:35, Peter Becker wrote:
> >As i understand the duperemove source-code right (i work on/ try to
> >improve this code since 5 or 6 weeks on multiple parts), duperemove
> >does hashing and calculation before they call extend_same.
> >Duperemove stores all in a hashfile and read this. after all files
> >hashed, and duplicates detected, the progress all in order without
> >reading new data form disk / hashfile. so the byte-by-byte comparison
> >of extend_same ioctl should consume the full possible bandwidth of the
> >disks.
> Not necessarily.  You've actually got a significant amount of processing
> between each disk operation.  General ordering inside the ioctl is:
> 1. Do generic ioctl setup.
> 2. Lock the extents.
> 3. Read the ranges into memory.
> 4. Compare the ranges.
> 5. If the ranges are identical, write out the changes needed to reflink
> them.
> 6. Unlock all the extents.
> 7. Do generic ioctl cleanup.
> 1 and 7 in particular are pretty heavy.  Ioctls were not intended to be
> called with this kind of frequency, and that fact really shows in the setup
> and teardown (overhead is way higher than a syscall).

Steps 1 and 7 are not heavy at all.  ioctl setup is an order of magnitude
higher than other system calls, but still up to 11 orders of magnitude
faster than the other steps.  The other steps are *slow*, and step 5
is orders of magnitude slower than all the others combined.

Most of the time in step 5 is spent deleting the dst extent refs
(or waiting for transaction commits, but everything waits for those).
It gets worse when you have big files (1G and larger), more extents,
and more extent references in the same inode.  On a 100G file the overhead
of manipulating shared extent refs is so large that the rest of the
extent-same ioctl is just noise by comparison (microseconds vs minutes).

The commit 1d57ee9 "btrfs: improve delayed refs iterations" (merged in
v4.10-rc1) helps a bit with this, but deleting shared refs is still
one of the most expensive things you can do in btrfs.

> The operation ended
> up being an ioctl instead of a syscall (or extension to another syscall)
> because:
> 1. Manipulating low-level filesystem state is part of what they're intended
> to be used for.
> 2. Introducing a new FS specific ioctl is a whole lot less controversial
> than introducing a new FS specific syscall.
> >
> >1. dbfile_load_hashes
> >2. find_all_dupes
> >3. dedupe_results
> >-> call the following in N threads:
> >>dedupe_extent_list
> >>>list_for_each_entry
> add_extent_to_dedupe #produce a simple list/queue
> dedupe_extents
> >btrfs_extent_same
> >>BTRFS_IOC_FILE_EXTENT_SAME
> >
> >So if this right, one of this thinks is realy slow:
> >
> >1. byte-per-byte comparison
> There's no way that this part can't be slow.  You need to load the data into
> the registers to do the comparison, you can't just point something at RAM
> and get an answer.  On x86, this in turn means that the comparison amounts
> to a loop of 2 loads followed by a compare and a branch for , repeated once
> for each range beyond the first, and that's assuming that the compiler
> optimizes it to the greatest degree possible.  On some other systems the
> compare and branch are one instruction, on others the second load might be
> eliminated, but overall it's not something that can be sped up all that
> much.

On cheap amd64 machines this can be done at gigabytes per second.  Not much
gain from optimizing this.

> >2. sets up the reflinks
> This actually is not as efficient as it sounds like it should be, adding
> reflinks means updating metadata, which means that there is some unavoidable
> overhead here.  I doubt that it's where the issue is, but I may be wrong.

Most of the time spent here is spent waiting for IO.  extent-same seems to
imply fsync() with all the performance cost thereof.

> >3. unlocks the new extent
> There's one other aspect not listed here, locking the original extents,
> which can actually add quite a lot of overhead if the files are actually
> being used.
> >
> >If i'm not wrong with my understanding of the duperemove source code,
> >this behaivor should also affected the online dedupe feature on with
> >Qu Wenruo works.
> AFAIK, that uses a different code path from the batch deduplication ioctl.
> It also doesn't have the context switches and other overhead from an ioctl
> involved, because it's done in kernel code.

No difference there--the extent-same ioctl is all kernel code too.

> >2017-01-03 21:40 GMT+01:00 Austin S. Hemmelgarn :
> >>On 2017-01-03 15:20, Peter Becker wrote:
> >>>
> >>>I think i understand. The resulting keyquestion is, how i can improve
> >>>the performance of extend_same ioctl.
> >>>I tested it with following results:
> >>>
> >>>enviorment:
> >>>2 files, called "file", size each 100GB, duperemove nofiemap-options
> >>>set, 1MB extend size.
> >>>
> >>>duperemove 

Re: [PATCH] btrfs: fix hole read corruption for compressed inline extents

2016-12-12 Thread Zygo Blaxell
05969:  1:   35913575: 
last,eof
./drivers/ata/.pata_sis.o.cmd: 9 extents found

Note that corruption can only occur if the first extent (the inline extent
at offset 0) is compressed (encoded).  Uncompressed inline extents (like
the one above) will not be corrupted due to the fix in commit 93c82d5750.

If commit 93c82d5750 is reverted, you can get corruption on uncompressed
files too.







>Thanks,
>Xin
>     
>    Sent: Saturday, December 10, 2016 at 9:16 PM
>From: "Zygo Blaxell" <ce3g8...@umail.furryterror.org>
>To: "Roman Mamedov" <r...@romanrm.net>, "Filipe Manana" 
> <fdman...@gmail.com>
>Cc: linux-btrfs@vger.kernel.org
>Subject: Re: [PATCH] btrfs: fix hole read corruption for compressed inline
>extents
>Ping?
> 
>I know at least two people have read this patch, but it hasn't appeared in
>the usual integration branches yet, and I've seen no actionable suggestion
>to improve it. I've provided two non-overlapping rationales for it.
>Is there something else you are looking for?
> 
>This patch is a fix for a simple data corruption bug. It (or some
>equivalent fix for the same bug) should be on its way to all stable
>    kernels starting from 2.6.32.
> 
>Thanks
> 
>On Mon, Nov 28, 2016 at 05:27:10PM +0500, Roman Mamedov wrote:
>> On Mon, 28 Nov 2016 00:03:12 -0500
>> Zygo Blaxell <ce3g8...@umail.furryterror.org> wrote:
>>
>> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> > index 8e3a5a2..b1314d6 100644
>> > --- a/fs/btrfs/inode.c
>> > +++ b/fs/btrfs/inode.c
>> > @@ -6803,6 +6803,12 @@ static noinline int uncompress_inline(struct
>btrfs_path *path,
>> > max_size = min_t(unsigned long, PAGE_SIZE, max_size);
>> > ret = btrfs_decompress(compress_type, tmp, page,
>> > extent_offset, inline_size, max_size);
>> > + WARN_ON(max_size > PAGE_SIZE);
>> > + if (max_size < PAGE_SIZE) {
>> > + char *map = kmap(page);
>> > + memset(map + max_size, 0, PAGE_SIZE - max_size);
>> > + kunmap(page);
>> > + }
>> > kfree(tmp);
>> > return ret;
>> > }
>>
>> Wasn't this already posted as:
>>
>> btrfs: fix silent data corruption while reading compressed inline
>extents
>> [1]https://patchwork.kernel.org/patch/9371971/
>>
>> but you don't indicate that's a V2 or something, and in fact the patch
>seems
>> exactly the same, just the subject and commit message are entirely
>different.
>> Quite confusing.
>>
>> --
>> With respect,
>> Roman
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
>in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at [2]http://vger.kernel.org/majordomo-info.html
> 
> References
> 
>Visible links
>1. https://patchwork.kernel.org/patch/9371971/
>2. http://vger.kernel.org/majordomo-info.html


signature.asc
Description: Digital signature


Re: [PATCH] btrfs: fix hole read corruption for compressed inline extents

2016-12-10 Thread Zygo Blaxell
Ping?

I know at least two people have read this patch, but it hasn't appeared in
the usual integration branches yet, and I've seen no actionable suggestion
to improve it.  I've provided two non-overlapping rationales for it.
Is there something else you are looking for?

This patch is a fix for a simple data corruption bug.  It (or some
equivalent fix for the same bug) should be on its way to all stable
kernels starting from 2.6.32.

Thanks

On Mon, Nov 28, 2016 at 05:27:10PM +0500, Roman Mamedov wrote:
> On Mon, 28 Nov 2016 00:03:12 -0500
> Zygo Blaxell <ce3g8...@umail.furryterror.org> wrote:
> 
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 8e3a5a2..b1314d6 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -6803,6 +6803,12 @@ static noinline int uncompress_inline(struct 
> > btrfs_path *path,
> > max_size = min_t(unsigned long, PAGE_SIZE, max_size);
> > ret = btrfs_decompress(compress_type, tmp, page,
> >extent_offset, inline_size, max_size);
> > +   WARN_ON(max_size > PAGE_SIZE);
> > +   if (max_size < PAGE_SIZE) {
> > +   char *map = kmap(page);
> > +   memset(map + max_size, 0, PAGE_SIZE - max_size);
> > +   kunmap(page);
> > +   }
> > kfree(tmp);
> > return ret;
> >  }
> 
> Wasn't this already posted as:
> 
> btrfs: fix silent data corruption while reading compressed inline extents
> https://patchwork.kernel.org/patch/9371971/
> 
> but you don't indicate that's a V2 or something, and in fact the patch seems
> exactly the same, just the subject and commit message are entirely different.
> Quite confusing.
> 
> -- 
> With respect,
> Roman
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: Digital signature


[PATCH v2] btrfs-progs: utils: negative numbers are more plausible than sizes over 8 EiB

2016-12-03 Thread Zygo Blaxell
I got tired of seeing "16.00EiB" whenever btrfs-progs encounters a
negative size value, e.g. during resize:

Unallocated:
   /dev/mapper/datamd18   16.00EiB

This version is much more useful:

Unallocated:
   /dev/mapper/datamd18  -26.29GiB

Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org>

---
v2: change the function prototype so it's easier to see that the
mangling implied by the name "pretty" includes "reinterpretation
of the u64 value as a signed quantity."
---
 utils.c | 12 ++--
 utils.h |  4 ++--
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/utils.c b/utils.c
index 69b580a..07e8443 100644
--- a/utils.c
+++ b/utils.c
@@ -2575,7 +2575,7 @@ out:
  * Note: this function uses a static per-thread buffer. Do not call this
  * function more than 10 times within one argument list!
  */
-const char *pretty_size_mode(u64 size, unsigned mode)
+const char *pretty_size_mode(s64 size, unsigned mode)
 {
static __thread int ps_index = 0;
static __thread char ps_array[10][32];
@@ -2594,20 +2594,20 @@ static const char* unit_suffix_binary[] =
 static const char* unit_suffix_decimal[] =
{ "B", "kB", "MB", "GB", "TB", "PB", "EB"};
 
-int pretty_size_snprintf(u64 size, char *str, size_t str_size, unsigned 
unit_mode)
+int pretty_size_snprintf(s64 size, char *str, size_t str_size, unsigned 
unit_mode)
 {
int num_divs;
float fraction;
-   u64 base = 0;
+   s64 base = 0;
int mult = 0;
const char** suffix = NULL;
-   u64 last_size;
+   s64 last_size;
 
if (str_size == 0)
return 0;
 
if ((unit_mode & ~UNITS_MODE_MASK) == UNITS_RAW) {
-   snprintf(str, str_size, "%llu", size);
+   snprintf(str, str_size, "%lld", size);
return 0;
}
 
@@ -2642,7 +2642,7 @@ int pretty_size_snprintf(u64 size, char *str, size_t 
str_size, unsigned unit_mod
   num_divs = 0;
   break;
default:
-   while (size >= mult) {
+   while ((size < 0 ? -size : size) >= mult) {
last_size = size;
size /= mult;
num_divs++;
diff --git a/utils.h b/utils.h
index 366ca29..525bde9 100644
--- a/utils.h
+++ b/utils.h
@@ -174,9 +174,9 @@ int check_mounted_where(int fd, const char *file, char 
*where, int size,
 int btrfs_device_already_in_root(struct btrfs_root *root, int fd,
 int super_offset);
 
-int pretty_size_snprintf(u64 size, char *str, size_t str_bytes, unsigned 
unit_mode);
+int pretty_size_snprintf(s64 size, char *str, size_t str_bytes, unsigned 
unit_mode);
 #define pretty_size(size)  pretty_size_mode(size, UNITS_DEFAULT)
-const char *pretty_size_mode(u64 size, unsigned mode);
+const char *pretty_size_mode(s64 size, unsigned mode);
 
 u64 parse_size(char *s);
 u64 parse_qgroupid(const char *p);
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: utils: negative numbers are more plausible than sizes over 8 EiB

2016-12-03 Thread Zygo Blaxell
On Sat, Dec 03, 2016 at 10:25:17AM -0800, Omar Sandoval wrote:
> On Sat, Dec 03, 2016 at 01:19:38AM -0500, Zygo Blaxell wrote:
> > I got tired of seeing "16.00EiB" whenever btrfs-progs encounters a
> > negative size value.
> > 
> > e.g. during filesystem shrink we see:
> > 
> > Unallocated:
> >/dev/mapper/testvol0   16.00EiB
> > 
> > Interpreting this as a signed quantity is much more useful:
> > 
> > Unallocated:
> >/dev/mapper/testvol0  -26.29GiB
> > 
> > Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org>
> > ---
> >  utils.c | 13 -
> >  1 file changed, 8 insertions(+), 5 deletions(-)
> > 
> > diff --git a/utils.c b/utils.c
> > index 69b580a..bd2b66e 100644
> > --- a/utils.c
> > +++ b/utils.c
> > @@ -2594,20 +2594,23 @@ static const char* unit_suffix_binary[] =
> >  static const char* unit_suffix_decimal[] =
> > { "B", "kB", "MB", "GB", "TB", "PB", "EB"};
> >  
> > -int pretty_size_snprintf(u64 size, char *str, size_t str_size, unsigned 
> > unit_mode)
> > +int pretty_size_snprintf(u64 usize, char *str, size_t str_size, unsigned 
> > unit_mode)
> >  {
> > int num_divs;
> > float fraction;
> > -   u64 base = 0;
> > +   s64 base = 0;
> > int mult = 0;
> > const char** suffix = NULL;
> > -   u64 last_size;
> > +   s64 last_size;
> >  
> > if (str_size == 0)
> > return 0;
> >  
> > +   /* Negative numbers are more plausible than sizes over 8 EiB. */
> > +   s64 size = (s64)usize;
> 
> Just make pretty_size_snprintf() take an s64 size so it's clear from the
> function signature that it's signed instead of hidden in the definition.

I intentionally buried the unsigned -> signed conversion in the lowest
level function so I wouldn't trigger signed/unsigned conversion warnings
at all 46 call sites for pretty_size_mode.  The btrfs code uses u64
endemically for all size data, and I wasn't about to try to change that.

The word "pretty" in the function name should imply that what comes out
is a possibly lossy transformation of what goes in.  Since "16.00EiB"
is much more lossy than "-29.96GiB", I believe I am merely reducing the
lossiness quantitatively rather than qualitatively.

On the other hand, the signed/unsigned warning isn't enabled by default
in this project.  I can certainly do it that way if you prefer.

> > +
> > if ((unit_mode & ~UNITS_MODE_MASK) == UNITS_RAW) {
> > -   snprintf(str, str_size, "%llu", size);
> > +   snprintf(str, str_size, "%lld", size);
> > return 0;
> > }
> >  
> > @@ -2642,7 +2645,7 @@ int pretty_size_snprintf(u64 size, char *str, size_t 
> > str_size, unsigned unit_mod
> >num_divs = 0;
> >break;
> > default:
> > -   while (size >= mult) {
> > +   while ((size < 0 ? -size : size) >= mult) {
> > last_size = size;
> > size /= mult;
> > num_divs++;
> > -- 
> > 2.1.4
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


signature.asc
Description: Digital signature


[PATCH] btrfs-progs: utils: negative numbers are more plausible than sizes over 8 EiB

2016-12-02 Thread Zygo Blaxell
I got tired of seeing "16.00EiB" whenever btrfs-progs encounters a
negative size value.

e.g. during filesystem shrink we see:

Unallocated:
   /dev/mapper/testvol0   16.00EiB

Interpreting this as a signed quantity is much more useful:

Unallocated:
   /dev/mapper/testvol0  -26.29GiB

Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org>
---
 utils.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/utils.c b/utils.c
index 69b580a..bd2b66e 100644
--- a/utils.c
+++ b/utils.c
@@ -2594,20 +2594,23 @@ static const char* unit_suffix_binary[] =
 static const char* unit_suffix_decimal[] =
{ "B", "kB", "MB", "GB", "TB", "PB", "EB"};
 
-int pretty_size_snprintf(u64 size, char *str, size_t str_size, unsigned 
unit_mode)
+int pretty_size_snprintf(u64 usize, char *str, size_t str_size, unsigned 
unit_mode)
 {
int num_divs;
float fraction;
-   u64 base = 0;
+   s64 base = 0;
int mult = 0;
const char** suffix = NULL;
-   u64 last_size;
+   s64 last_size;
 
if (str_size == 0)
return 0;
 
+   /* Negative numbers are more plausible than sizes over 8 EiB. */
+   s64 size = (s64)usize;
+
if ((unit_mode & ~UNITS_MODE_MASK) == UNITS_RAW) {
-   snprintf(str, str_size, "%llu", size);
+   snprintf(str, str_size, "%lld", size);
return 0;
}
 
@@ -2642,7 +2645,7 @@ int pretty_size_snprintf(u64 size, char *str, size_t 
str_size, unsigned unit_mod
   num_divs = 0;
   break;
default:
-   while (size >= mult) {
+   while ((size < 0 ? -size : size) >= mult) {
last_size = size;
size /= mult;
num_divs++;
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: raid with a variable stripe size

2016-11-29 Thread Zygo Blaxell
On Tue, Nov 29, 2016 at 02:03:58PM +0800, Qu Wenruo wrote:
> At 11/29/2016 01:51 PM, Chris Murphy wrote:
> >On Mon, Nov 28, 2016 at 5:48 PM, Qu Wenruo  wrote:
> >>
> >>
> >>At 11/19/2016 02:15 AM, Goffredo Baroncelli wrote:
> >>>
> >>>Hello,
> >>>
> >>>these are only my thoughts; no code here, but I would like to share it
> >>>hoping that it could be useful.
> >>>
> >>>As reported several times by Zygo (and others), one of the problem of
> >>>raid5/6 is the write hole. Today BTRFS is not capable to address it.
> >>
> >>
> >>I'd say, no need to address yet, since current soft RAID5/6 can't handle it
> >>yet.
> >>
> >>Personally speaking, Btrfs should implementing RAID56 support just like
> >>Btrfs on mdadm.
> >>See how badly the current RAID56 works?
> >>
> >>The marginally benefit of btrfs RAID56 to scrub data better than tradition
> >>RAID56 is just a joke in current code base.
> >
> >Btrfs is subject to the write hole problem on disk, but any read or
> >scrub that needs to reconstruct from parity that is corrupt results in
> >a checksum error and EIO. So corruption is not passed up to user
> >space. Recent versions of md/mdadm support a write journal to avoid
> >the write hole problem on disk in case of a crash.
> 
> That's interesting.
> 
> So I think it's less worthy to support RAID56 in btrfs, especially
> considering the stability.
> 
> My widest dream is, btrfs calls device mapper to build a micro RAID1/5/6/10
> device for each chunk.
> Which should save us tons of codes and bugs.
> 
> And for better recovery, enhance device mapper to provide interface to judge
> which block is correct.
> 
> Although that's just dream anyway.

It would be nice to do that for balancing.  In many balance cases
(especially device delete and full balance after device add) it's not
necessary to rewrite the data in a block group, only copy it verbatim
to a different physical location (like pvmove does) and update the chunk
tree with the new address when it's done.  No need to rewrite the whole
extent tree.

> Thanks,
> Qu
> >
> >>>The problem is that the stripe size is bigger than the "sector size" (ok
> >>>sector is not the correct word, but I am referring to the basic unit of
> >>>writing on the disk, which is 4k or 16K in btrfs).
> >>>So when btrfs writes less data than the stripe, the stripe is not filled;
> >>>when it is filled by a subsequent write, a RMW of the parity is required.
> >>>
> >>>On the best of my understanding (which could be very wrong) ZFS try to
> >>>solve this issue using a variable length stripe.
> >>
> >>
> >>Did you mean ZFS record size?
> >>IIRC that's file extent minimum size, and I didn't see how that can handle
> >>the write hole problem.
> >>
> >>Or did ZFS handle the problem?
> >
> >ZFS isn't subject to the write hole. My understanding is they get
> >around this because all writes are COW, there is no RMW.
> >But the
> >variable stripe size means they don't have to do the usual (fixed)
> >full stripe write for just, for example a 4KiB change in data for a
> >single file. Conversely Btrfs does do RMW in such a case.
> >
> >
> >>Anyway, it should be a low priority thing, and personally speaking,
> >>any large behavior modification involving  both extent allocator and bg
> >>allocator will be bug prone.
> >
> >I tend to agree. I think the non-scalability of Btrfs raid10, which
> >makes it behave more like raid 0+1, is a higher priority because right
> >now it's misleading to say the least; and then the longer term goal
> >for scaleable huge file systems is how Btrfs can shed irreparably
> >damaged parts of the file system (tree pruning) rather than
> >reconstruction.
> >
> >
> >
> 
> 


signature.asc
Description: Digital signature


Re: RFC: raid with a variable stripe size

2016-11-29 Thread Zygo Blaxell
On Tue, Nov 29, 2016 at 01:49:09PM +0800, Qu Wenruo wrote:
> >>>My proposal requires only a modification to the extent allocator.
> >>>The behavior at the block group layer and scrub remains exactly the same.
> >>>We just need to adjust the allocator slightly to take the RAID5 CoW
> >>>constraints into account.
> >>
> >>Then, you'd need to allow btrfs to split large buffered/direct write into
> >>small extents(not 128M anymore).
> >>Not sure if we need to do extra work for DirectIO.
> >
> >Nope, that's not my proposal.  My proposal is to simply ignore free
> >space whenever it's inside a partially filled raid stripe (optimization:
> >...which was empty at the start of the current transaction).
> 
> Still have problems.
> 
> Allocator must handle fs under device remove or profile converting (from 4
> disks raid5 to 5 disk raid5/6) correctly.
> Which already seems complex for me.

Those would be allocations in separate block groups with different stripe
widths.  Already handled in btrfs.

> And further more, for fs with more devices, for example, 9 devices RAID5.
> It will be a disaster to just write a 4K data and take up the whole 8 * 64K
> space.
> It will  definitely cause huge ENOSPC problem.

If you called fsync() after every 4K, yes; otherwise you can just batch
up small writes into full-size stripes.  The worst case isn't common
enough to be a serious problem for a lot of the common RAID5 use cases
(i.e. non-database workloads).  I wouldn't try running a database on
it--I'd use a RAID1 or RAID10 array for that instead, because the other
RAID5 performance issues would be deal-breakers.

On ZFS the same case degenerates into something like btrfs RAID1 over
the 9 disks, which burns over 50% of the space.  More efficient than 
wasting 99% of the space, but still wasteful.

> If you really think it's easy, make a RFC patch, which should be easy if it
> is, then run fstest auto group on it.

I plan to when I get time; however, that could be some months in the
future and I don't want to "claim" the task and stop anyone else from
taking a crack at it in the meantime.

> Easy words won't turn emails into real patch.
> 
> >That avoids modifying a stripe with committed data and therefore plugs the
> >write hole.
> >
> >For nodatacow, prealloc (and maybe directio?) extents the behavior
> >wouldn't change (you'd have write hole, but only on data blocks not
> >metadata, and only on files that were already marked as explicitly not
> >requiring data integrity).
> >
> >>And in fact, you're going to support variant max file extent size.
> >
> >The existing extent sizing behavior is not changed *at all* in my proposal,
> >only the allocator's notion of what space is 'free'.
> >
> >We can write an extent across multiple RAID5 stripes so long as we
> >finish writing the entire extent before pointing committed metadata to
> >it.  btrfs does that already otherwise checksums wouldn't work.
> >
> >>This makes delalloc more complex (Wang enhanced dealloc support for variant
> >>file extent size, to fix ENOSPC problem for dedupe and compression).
> >>
> >>This is already much more complex than you expected.
> >
> >The complexity I anticipate is having to deal with two implementations
> >of the free space search, one for free space cache and one for free
> >space tree.
> >
> >It could be as simple as calling the existing allocation functions and
> >just filtering out anything that isn't suitably aligned inside a raid56
> >block group (at least for a proof of concept).
> >
> >>And this is the *BIGGEST* problem of current btrfs:
> >>No good enough(if there is any) *ISOLATION* for such a complex fs.
> >>
> >>So even "small" modification can lead to unexpected bugs.
> >>
> >>That's why I want to isolate the fix in RAID56 layer, not any layer upwards.
> >
> >I don't think the write hole is fixable in the current raid56 layer, at
> >least not without a nasty brute force solution like stripe update journal.
> >
> >Any of the fixes I'd want to use fix the problem from outside.
> >
> >>If not possible, I prefer not to do anything yet, until we are sure the very
> >>basic part of RAID56 is stable.
> >>
> >>Thanks,
> >>Qu
> >>
> >>>
> >>>It's not as efficient as the ZFS approach, but it doesn't require an
> >>>incompatible disk format change either.
> >>>
> >On BTRFS this could be achieved using several BGs (== block group or 
> >chunk), one for each stripe size.
> >
> >For example, if a filesystem - RAID5 is composed by 4 DISK, the 
> >filesystem should have three BGs:
> >BG #1,composed by two disks (1 data+ 1 parity)
> >BG #2 composed by three disks (2 data + 1 parity)
> >BG #3 composed by four disks (3 data + 1 parity).
> 
> Too complicated bg layout and further extent allocator modification.
> 
> More code means more bugs, and I'm pretty sure it will be bug prone.
> 
> 
> Although the idea of variable stripe size can somewhat reduce the problem
> under certain situation.

Re: RFC: raid with a variable stripe size

2016-11-28 Thread Zygo Blaxell
On Tue, Nov 29, 2016 at 12:12:03PM +0800, Qu Wenruo wrote:
> 
> 
> At 11/29/2016 11:53 AM, Zygo Blaxell wrote:
> >On Tue, Nov 29, 2016 at 08:48:19AM +0800, Qu Wenruo wrote:
> >>At 11/19/2016 02:15 AM, Goffredo Baroncelli wrote:
> >>>Hello,
> >>>
> >>>these are only my thoughts; no code here, but I would like to share it 
> >>>hoping that it could be useful.
> >>>
> >>>As reported several times by Zygo (and others), one of the problem
> >>of raid5/6 is the write hole. Today BTRFS is not capable to address it.
> >>
> >>I'd say, no need to address yet, since current soft RAID5/6 can't handle it
> >>yet.
> >>
> >>Personally speaking, Btrfs should implementing RAID56 support just like
> >>Btrfs on mdadm.
> >
> >Even mdadm doesn't implement it the way btrfs does (assuming all bugs
> >are fixed) any more.
> >
> >>See how badly the current RAID56 works?
> >
> >>The marginally benefit of btrfs RAID56 to scrub data better than tradition
> >>RAID56 is just a joke in current code base.
> >
> >>>The problem is that the stripe size is bigger than the "sector size"
> >>(ok sector is not the correct word, but I am referring to the basic
> >>unit of writing on the disk, which is 4k or 16K in btrfs).  >So when
> >>btrfs writes less data than the stripe, the stripe is not filled; when
> >>it is filled by a subsequent write, a RMW of the parity is required.
> >>>
> >>>On the best of my understanding (which could be very wrong) ZFS try
> >>to solve this issue using a variable length stripe.
> >>
> >>Did you mean ZFS record size?
> >>IIRC that's file extent minimum size, and I didn't see how that can handle
> >>the write hole problem.
> >>
> >>Or did ZFS handle the problem?
> >
> >ZFS's strategy does solve the write hole.  In btrfs terms, ZFS embeds the
> >parity blocks within extents, so it behaves more like btrfs compression
> >in the sense that the data in a RAID-Z extent is encoded differently
> >from the data in the file, and the kernel has to transform it on reads
> >and writes.
> >
> >No ZFS stripe can contain blocks from multiple different
> >transactions because the RAID-Z stripes begin and end on extent
> >(single-transaction-write) boundaries, so there is no write hole on ZFS.
> >
> >There is some space waste in ZFS because the minimum allocation unit
> >is two blocks (one data one parity) so any free space that is less
> >than two blocks long is unusable.  Also the maximum usable stripe width
> >(number of disks) is the size of the data in the extent plus one parity
> >block.  It means if you write a lot of discontiguous 4K blocks, you
> >effectively get 2-disk RAID1 and that may result in disappointing
> >storage efficiency.
> >
> >(the above is for RAID-Z1.  For Z2 and Z3 add an extra block or two
> >for additional parity blocks).
> >
> >One could implement RAID-Z on btrfs, but it's by far the most invasive
> >proposal for fixing btrfs's write hole so far (and doesn't actually fix
> >anything, since the existing raid56 format would still be required to
> >read old data, and it would still be broken).
> >
> >>Anyway, it should be a low priority thing, and personally speaking,
> >>any large behavior modification involving  both extent allocator and bg
> >>allocator will be bug prone.
> >
> >My proposal requires only a modification to the extent allocator.
> >The behavior at the block group layer and scrub remains exactly the same.
> >We just need to adjust the allocator slightly to take the RAID5 CoW
> >constraints into account.
> 
> Then, you'd need to allow btrfs to split large buffered/direct write into
> small extents(not 128M anymore).
> Not sure if we need to do extra work for DirectIO.

Nope, that's not my proposal.  My proposal is to simply ignore free
space whenever it's inside a partially filled raid stripe (optimization:
...which was empty at the start of the current transaction).

That avoids modifying a stripe with committed data and therefore plugs the
write hole.

For nodatacow, prealloc (and maybe directio?) extents the behavior
wouldn't change (you'd have write hole, but only on data blocks not
metadata, and only on files that were already marked as explicitly not
requiring data integrity).

> And in fact, you're going to support variant max file extent size.

The existing extent sizing behavior is not changed *at all* in my proposal,
only the allocator's notion of what space is 'free'.

We can write an

Re: RFC: raid with a variable stripe size

2016-11-28 Thread Zygo Blaxell
On Tue, Nov 29, 2016 at 08:48:19AM +0800, Qu Wenruo wrote:
> At 11/19/2016 02:15 AM, Goffredo Baroncelli wrote:
> >Hello,
> >
> >these are only my thoughts; no code here, but I would like to share it 
> >hoping that it could be useful.
> >
> >As reported several times by Zygo (and others), one of the problem
> of raid5/6 is the write hole. Today BTRFS is not capable to address it.
> 
> I'd say, no need to address yet, since current soft RAID5/6 can't handle it
> yet.
> 
> Personally speaking, Btrfs should implementing RAID56 support just like
> Btrfs on mdadm.

Even mdadm doesn't implement it the way btrfs does (assuming all bugs
are fixed) any more.

> See how badly the current RAID56 works?

> The marginally benefit of btrfs RAID56 to scrub data better than tradition
> RAID56 is just a joke in current code base.

> >The problem is that the stripe size is bigger than the "sector size"
> (ok sector is not the correct word, but I am referring to the basic
> unit of writing on the disk, which is 4k or 16K in btrfs).  >So when
> btrfs writes less data than the stripe, the stripe is not filled; when
> it is filled by a subsequent write, a RMW of the parity is required.
> >
> >On the best of my understanding (which could be very wrong) ZFS try
> to solve this issue using a variable length stripe.
>
> Did you mean ZFS record size?
> IIRC that's file extent minimum size, and I didn't see how that can handle
> the write hole problem.
> 
> Or did ZFS handle the problem?

ZFS's strategy does solve the write hole.  In btrfs terms, ZFS embeds the
parity blocks within extents, so it behaves more like btrfs compression
in the sense that the data in a RAID-Z extent is encoded differently
from the data in the file, and the kernel has to transform it on reads
and writes.

No ZFS stripe can contain blocks from multiple different
transactions because the RAID-Z stripes begin and end on extent
(single-transaction-write) boundaries, so there is no write hole on ZFS.

There is some space waste in ZFS because the minimum allocation unit
is two blocks (one data one parity) so any free space that is less
than two blocks long is unusable.  Also the maximum usable stripe width
(number of disks) is the size of the data in the extent plus one parity
block.  It means if you write a lot of discontiguous 4K blocks, you
effectively get 2-disk RAID1 and that may result in disappointing
storage efficiency.

(the above is for RAID-Z1.  For Z2 and Z3 add an extra block or two
for additional parity blocks).

One could implement RAID-Z on btrfs, but it's by far the most invasive
proposal for fixing btrfs's write hole so far (and doesn't actually fix
anything, since the existing raid56 format would still be required to
read old data, and it would still be broken).

> Anyway, it should be a low priority thing, and personally speaking,
> any large behavior modification involving  both extent allocator and bg
> allocator will be bug prone.

My proposal requires only a modification to the extent allocator.
The behavior at the block group layer and scrub remains exactly the same.
We just need to adjust the allocator slightly to take the RAID5 CoW
constraints into account.

It's not as efficient as the ZFS approach, but it doesn't require an
incompatible disk format change either.

> >On BTRFS this could be achieved using several BGs (== block group or chunk), 
> >one for each stripe size.
> >
> >For example, if a filesystem - RAID5 is composed by 4 DISK, the filesystem 
> >should have three BGs:
> >BG #1,composed by two disks (1 data+ 1 parity)
> >BG #2 composed by three disks (2 data + 1 parity)
> >BG #3 composed by four disks (3 data + 1 parity).
> 
> Too complicated bg layout and further extent allocator modification.
> 
> More code means more bugs, and I'm pretty sure it will be bug prone.
> 
> 
> Although the idea of variable stripe size can somewhat reduce the problem
> under certain situation.
> 
> For example, if sectorsize is 64K, and we make stripe len to 32K, and use 3
> disc RAID5, we can avoid such write hole problem.
> Withouth modification to extent/chunk allocator.
> 
> And I'd prefer to make stripe len mkfs time parameter, not possible to
> modify after mkfs. To make things easy.
> 
> Thanks,
> Qu
> 
> >
> >If the data to be written has a size of 4k, it will be allocated to the BG 
> >#1.
> >If the data to be written has a size of 8k, it will be allocated to the BG #2
> >If the data to be written has a size of 12k, it will be allocated to the BG 
> >#3
> >If the data to be written has a size greater than 12k, it will be allocated 
> >to the BG3, until the data fills a full stripes; then the remainder will be 
> >stored in BG #1 or BG #2.
> >
> >
> >To avoid unbalancing of the disk usage, each BG could use all the disks, 
> >even if a stripe uses less disks: i.e
> >
> >DISK1 DISK2 DISK3 DISK4
> >S1S1S1S2
> >S2S2S3S3
> >S3S4S4S4
> >[]
> >
> >Above is show a BG which uses all the four disks, but 

Re: [PATCH] btrfs: raid56: Use correct stolen pages to calculate P/Q

2016-11-28 Thread Zygo Blaxell
On Tue, Nov 29, 2016 at 02:52:47AM +0100, Christoph Anton Mitterer wrote:
> On Mon, 2016-11-28 at 16:48 -0500, Zygo Blaxell wrote:
> > If a drive's
> > embedded controller RAM fails, you get corruption on the majority of
> > reads from a single disk, and most writes will be corrupted (even if
> > they
> > were not before).
> 
> Administrating a multi-PiB Tier-2 for the LHC Computing Grid with quite
> a number of disks for nearly 10 years now, I'd have never stumbled on
> such a case of breakage so far...

In data centers you won't see breakages that are common on desktop and
laptop drives.  Laptops in particular sometimes (often?) go to places
that are much less friendly to hardware.

All my NAS and enterprise drives in server racks and data centers just
wake up one morning stone dead or with a few well-behaved bad sectors,
with none of this drama.  Boring!

> Actually most cases are as simple as HDD fails to work and this is
> properly signalled to the controller.
> 
> 
> 
> > If there's a transient failure due to environmental
> > issues (e.g. short-term high-amplitude vibration or overheating) then
> > writes may pause for mechanical retry loops.  If there is bitrot in
> > SSDs
> > (particularly in the address translation tables) it looks like a wall
> > of random noise that only ends when the disk goes offline.  You can
> > get
> > combinations of these (e.g. RAM failures caused by transient
> > overheating)
> > where the drive's behavior changes over time.
> > 
> > When in doubt, don't write.
> 
> Sorry, but these cases as any cases of memory issues (be it main memory
> or HDD controller) would also kick in at any normal writes.

Yes, but in a RAID1 context there will be another disk with a good copy
(or if main RAM is failing, the entire filesystem will be toast no matter
what you do).

> So there's no point in protecting against this on the storage side...
> 
> Either never write at all... or have good backups for these rare cases.
> 
> 
> 
> Cheers,
> Chris.




signature.asc
Description: Digital signature


Re: [PATCH] btrfs: raid56: Use correct stolen pages to calculate P/Q

2016-11-28 Thread Zygo Blaxell
On Mon, Nov 28, 2016 at 07:32:38PM +0100, Goffredo Baroncelli wrote:
> On 2016-11-28 04:37, Christoph Anton Mitterer wrote:
> > I think for safety it's best to repair as early as possible (and thus
> > on read when a damage is detected), as further  blocks/devices may fail
> > till eventually a scrub(with repair) would be run manually.
> > 
> > However, there may some workloads under which such auto-repair is
> > undesirable as it may cost performance and safety may be less important
> > than that.
> 
> I am assuming that a corruption is a quite rare event. So occasionally
> it could happens that a page is corrupted and the system corrects
> it. This shouldn't  have an impact on the workloads.

Depends heavily on the specifics of the failure case.  If a drive's
embedded controller RAM fails, you get corruption on the majority of
reads from a single disk, and most writes will be corrupted (even if they
were not before).  If there's a transient failure due to environmental
issues (e.g. short-term high-amplitude vibration or overheating) then
writes may pause for mechanical retry loops.  If there is bitrot in SSDs
(particularly in the address translation tables) it looks like a wall
of random noise that only ends when the disk goes offline.  You can get
combinations of these (e.g. RAM failures caused by transient overheating)
where the drive's behavior changes over time.

When in doubt, don't write.

> BR
> G.Baroncelli
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli 
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> 


signature.asc
Description: Digital signature


Re: [PATCH] btrfs: fix hole read corruption for compressed inline extents

2016-11-28 Thread Zygo Blaxell
On Mon, Nov 28, 2016 at 05:27:10PM +0500, Roman Mamedov wrote:
> On Mon, 28 Nov 2016 00:03:12 -0500
> Zygo Blaxell <ce3g8...@umail.furryterror.org> wrote:
> 
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 8e3a5a2..b1314d6 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -6803,6 +6803,12 @@ static noinline int uncompress_inline(struct 
> > btrfs_path *path,
> > max_size = min_t(unsigned long, PAGE_SIZE, max_size);
> > ret = btrfs_decompress(compress_type, tmp, page,
> >extent_offset, inline_size, max_size);
> > +   WARN_ON(max_size > PAGE_SIZE);
> > +   if (max_size < PAGE_SIZE) {
> > +   char *map = kmap(page);
> > +   memset(map + max_size, 0, PAGE_SIZE - max_size);
> > +   kunmap(page);
> > +   }
> > kfree(tmp);
> > return ret;
> >  }
> 
> Wasn't this already posted as:
> 
> btrfs: fix silent data corruption while reading compressed inline extents
> https://patchwork.kernel.org/patch/9371971/
> 
> but you don't indicate that's a V2 or something, and in fact the patch seems
> exactly the same, just the subject and commit message are entirely different.
> Quite confusing.

The previous commit message discussed the related hole-creation bug,
including a reproducer; however, this patch does not fix the hole-creation
bug and was never intended to.  Despite my follow-up clarification,
reviewers got distracted by the hole-creation bug discussion and didn't
recover, so the patch didn't go anywhere.

This patch only fixes _reading_ the holes after they are created, and
the new commit message and subject line state that much more clearly.

The patch didn't change, so I didn't add 'v2'.  There's no 'v1' with
the same title, so I thought a 'v2' tag would be more confusing than
just starting over.

The hole-creation bug is a very old, low-urgency issue.  btrfs filesystems
in the field have the buggy holes already, and have been creating new
ones from 2009(*) to the present.  I had to ask a few people before I found
one who know whether it was even a bug, or intentional behavior from
the beginning.


(*) 2009 is the oldest commit date I can find that introduces a change
which would only be necessary in the presence of the hole-creation bug.
I have not been able to test kernels before 2012 because they crash
while running my reproducer.

> -- 
> With respect,
> Roman
> 


signature.asc
Description: Digital signature


[PATCH] btrfs: fix hole read corruption for compressed inline extents

2016-11-27 Thread Zygo Blaxell
Commit c8b978188c ("Btrfs: Add zlib compression support") produces
data corruption when reading a file with a hole positioned after an
inline extent.  btrfs_get_extent will return uninitialized kernel memory
instead of zero bytes in the hole.

Commit 93c82d5750 ("Btrfs: zero page past end of inline file items")
fills the hole by memset to zero after *uncompressed* inline extents.

This patch provides the missing memset for holes after *compressed*
inline extents.

The offending holes appear in the wild and will appear during routine
data integrity audits (e.g. comparing backups against their originals).
They can also be created intentionally by fuzzing or crafting a filesystem
image.

Holes like these are not intended to occur in btrfs; however, I tested
tagged kernels between v3.5 and the present, and found that all of them
can create such holes.  Whether we like them or not, this kind of hole
is now part of the btrfs de-facto on-disk format, and we need to be able
to read such holes without an infoleak or wrong data.

An example of a hole leading to data corruption:

item 61 key (606890 INODE_ITEM 0) itemoff 9662 itemsize 160
inode generation 50 transid 50 size 47424 nbytes 49141
block group 0 mode 100644 links 1 uid 0 gid 0
rdev 0 flags 0x0(none)
item 62 key (606890 INODE_REF 603050) itemoff 9642 itemsize 20
inode ref index 3 namelen 10 name: DB_File.so
item 63 key (606890 EXTENT_DATA 0) itemoff 8280 itemsize 1362
inline extent data size 1341 ram 4085 compress(zlib)
item 64 key (606890 EXTENT_DATA 4096) itemoff 8227 itemsize 53
extent data disk byte 5367308288 nr 20480
extent data offset 0 nr 45056 ram 45056
extent compression(zlib)

Different data appears in userspace during each uncached read of the 10
bytes between offset 4085 and 4095.  The extent in item 63 is not long
enough to fill the first page of the file, so a memset is required to
fill the space between item 63 (ending at 4085) and item 64 (beginning
at 4096) with zero.

Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org>

---
 fs/btrfs/inode.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8e3a5a2..b1314d6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6803,6 +6803,12 @@ static noinline int uncompress_inline(struct btrfs_path 
*path,
max_size = min_t(unsigned long, PAGE_SIZE, max_size);
ret = btrfs_decompress(compress_type, tmp, page,
   extent_offset, inline_size, max_size);
+   WARN_ON(max_size > PAGE_SIZE);
+   if (max_size < PAGE_SIZE) {
+   char *map = kmap(page);
+   memset(map + max_size, 0, PAGE_SIZE - max_size);
+   kunmap(page);
+   }
kfree(tmp);
return ret;
 }
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: raid56: Use correct stolen pages to calculate P/Q

2016-11-27 Thread Zygo Blaxell
On Sun, Nov 27, 2016 at 12:16:34AM +0100, Goffredo Baroncelli wrote:
> On 2016-11-26 19:54, Zygo Blaxell wrote:
> > On Sat, Nov 26, 2016 at 02:12:56PM +0100, Goffredo Baroncelli wrote:
> >> On 2016-11-25 05:31, Zygo Blaxell wrote:
> [...]
> >>
> >> BTW Btrfs in RAID1 mode corrects the data even in the read case. So
> > 
> > Have you tested this?  I think you'll find that it doesn't.
> 
> Yes I tested it; and it does the rebuild automatically.
> I corrupted a disk of mirror, then I read the related file. The log  says:
> 
> [   59.287748] BTRFS warning (device vdb): csum failed ino 257 off 0 csum 
> 12813760 expected csum 3114703128
> [   59.291542] BTRFS warning (device vdb): csum failed ino 257 off 0 csum 
> 12813760 expected csum 3114703128
> [   59.294950] BTRFS info (device vdb): read error corrected: ino 257 off 0 
> (dev /dev/vdb sector 2154496)
> ^

> IIRC In case of RAID5/6 the last line is missing. However in both the
> case the data returned is good; but in RAID1 the data is corrected
> also on the disk.
> 
> Where you read that the data is not rebuild automatically ?

Experience?  I have real disk failures all the time.  Errors on RAID1
arrays persist until scrubbed.

No, wait... _transid_ errors always persist until scrubbed.  csum failures
are rewritten in repair_io_failure.  There is a comment earlier in
repair_io_failure that rewrite in RAID56 is not supported yet.

> In fact I was surprised that RAID5/6 behaves differently

The difference is surprising, no matter which strategy you believe
is correct.  ;)

> >> I am still convinced that is the RAID5/6 behavior "strange".
> >>
> >> BR
> >> G.Baroncelli
> >> -- 
> >> gpg @keyserver.linux.it: Goffredo Baroncelli 
> >> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> >>
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli 
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> 


signature.asc
Description: Digital signature


Re: [PATCH] btrfs: raid56: Use correct stolen pages to calculate P/Q

2016-11-26 Thread Zygo Blaxell
On Sat, Nov 26, 2016 at 02:12:56PM +0100, Goffredo Baroncelli wrote:
> On 2016-11-25 05:31, Zygo Blaxell wrote:
> >>> Do you mean, read the corrupted data won't repair it?
> >>>
> >>> IIRC that's the designed behavior.
> >> :O
> >>
> >> You are right... I was unaware of that
> > This is correct.
> > 
> > Ordinary reads shouldn't touch corrupt data, they should only read
> > around it.  Scrubs in read-write mode should write corrected data over
> > the corrupt data.  Read-only scrubs can only report errors without
> > correcting them.
> > 
> > Rewriting corrupt data outside of scrub (i.e. on every read) is a
> > bad idea.  Consider what happens if a RAM controller gets too hot:
> > checksums start failing randomly, but the data on disk is still OK.
> > If we tried to fix the bad data on every read, we'd probably just trash
> > the filesystem in some cases.
> 
> 
> 
> I cant agree. If the filesystem is mounted read-only this behavior may
> be correct; bur in others cases I don't see any reason to not correct
> wrong data even in the read case. If your ram is unreliable you have
> big problem anyway.

If you don't like RAM corruption, pick any other failure mode.  Laptops
have to deal with things like vibration and temperature extremes which
produce the same results (spurious csum failures and IO errors under
conditions where writing will only destroy data that would otherwise
be recoverable).

> The likelihood that the data contained in a disk is "corrupted" is
> higher than the likelihood that the RAM is bad.
>
> BTW Btrfs in RAID1 mode corrects the data even in the read case. So

Have you tested this?  I think you'll find that it doesn't.

> I am still convinced that is the RAID5/6 behavior "strange".
> 
> BR
> G.Baroncelli
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli 
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> 


signature.asc
Description: Digital signature


Re: [PATCH] btrfs: raid56: Use correct stolen pages to calculate P/Q

2016-11-24 Thread Zygo Blaxell
On Fri, Nov 25, 2016 at 03:40:36PM +1100, Gareth Pye wrote:
> On Fri, Nov 25, 2016 at 3:31 PM, Zygo Blaxell
> <ce3g8...@umail.furryterror.org> wrote:
> >
> > This risk mitigation measure does rely on admins taking a machine in this
> > state down immediately, and also somehow knowing not to start a scrub
> > while their RAM is failing...which is kind of an annoying requirement
> > for the admin.
> 
> Attempting to detect if RAM is bad when scrub starts is both time
> consuming and not very reliable right.

RAM, like all hardware, could fail at any time, and a scrub could already
be running when it happens.  This is annoying but also a fact of life that
admins have to deal with.

Testing RAM before scrub starts is not more beneficial than testing RAM
at random intervals--but if you are testing RAM at random intervals,
why not do it at the same intervals as scrub?

If I see corruption errors showing up in stats, I will do a basic sanity
test to make sure they're coming from the storage layer and not somewhere
closer to the CPU.  If all errors come from one device and there are clear
log messages showing SCSI device errors and the SMART log matches the
other data, RAM is probably not the root case of failures, so scrub away.

If normally reliable programs like /bin/sh start randomly segfaulting,
there's smoke pouring out of the back of the machine, all the disks are
full of csum failures, and the BIOS welcome message has spelling errors
that weren't there before, I would *not* start a scrub.  More like
turn the machine off, take it apart, test all the pieces separately,
and only do a scrub after everything above the storage layer had been
replaced or recertified.  I certainly wouldn't want the filesystem to
try to fix the csum failures it finds in such situations.



signature.asc
Description: Digital signature


Re: [PATCH] btrfs: raid56: Use correct stolen pages to calculate P/Q

2016-11-24 Thread Zygo Blaxell
On Tue, Nov 22, 2016 at 07:02:13PM +0100, Goffredo Baroncelli wrote:
> On 2016-11-22 01:28, Qu Wenruo wrote:
> > 
> > 
> > At 11/22/2016 02:48 AM, Goffredo Baroncelli wrote:
> >> Hi Qu,
> >>
> >> I tested this succefully for RAID5 when doing a scrub (i.e.: I mount a 
> >> corrupted disks, then I ran "btrfs scrub start ...", then I check the 
> >> disks).
> >>
> >> However if I do a "cat mnt/out.txt" (out.txt is the corrupted file):
> >> 1) the system detect that the file is corrupted   (good :) )
> >> 2) the system return the correct file content (good :) )
> >> 3) the data on the platter are still wrong(no good :( )
> > 
> > Do you mean, read the corrupted data won't repair it?
> > 
> > IIRC that's the designed behavior.
> 
> :O
> 
> You are right... I was unaware of that

This is correct.

Ordinary reads shouldn't touch corrupt data, they should only read
around it.  Scrubs in read-write mode should write corrected data over
the corrupt data.  Read-only scrubs can only report errors without
correcting them.

Rewriting corrupt data outside of scrub (i.e. on every read) is a
bad idea.  Consider what happens if a RAM controller gets too hot:
checksums start failing randomly, but the data on disk is still OK.
If we tried to fix the bad data on every read, we'd probably just trash
the filesystem in some cases.

This risk mitigation measure does rely on admins taking a machine in this
state down immediately, and also somehow knowing not to start a scrub
while their RAM is failing...which is kind of an annoying requirement
for the admin.

> So you can add a "tested-by: Goffredo Baroncelli "
> 
> BR
> G.Baroncelli
> 
> > 
> > For RAID5/6 read, there are several different mode, like READ_REBUILD or 
> > SCRUB_PARITY.
> > 
> > I'm not sure for write, but for read it won't write correct data.
> > 
> > So it's a designed behavior if I don't miss something.
> > 
> > Thanks,
> > Qu
> > 
> >>
> >>
> >> Enclosed the script which reproduces the problem. Note that:
> >> If I corrupt the data, in the dmesg two time appears a line which says:
> >>
> >> [ 3963.763384] BTRFS warning (device loop2): csum failed ino 257 off 0 
> >> csum 2280586218 expected csum 3192393815
> >> [ 3963.766927] BTRFS warning (device loop2): csum failed ino 257 off 0 
> >> csum 2280586218 expected csum 3192393815
> >>
> >> If I corrupt the parity, of course the system doesn't detect the 
> >> corruption nor try to correct it. But this is the expected behavior.
> >>
> >> BR
> >> G.Baroncelli
> >>
> >>
> >>
> >> On 2016-11-21 09:50, Qu Wenruo wrote:
> >>> In the following situation, scrub will calculate wrong parity to
> >>> overwrite correct one:
> >>>
> >>> RAID5 full stripe:
> >>>
> >>> Before
> >>> | Dev 1  | Dev  2 | Dev 3 |
> >>> | Data stripe 1  | Data stripe 2  | Parity Stripe |
> >>> --- 0
> >>> | 0x (Bad)   | 0xcdcd | 0x|
> >>> --- 4K
> >>> | 0xcdcd | 0xcdcd | 0x|
> >>> ...
> >>> | 0xcdcd | 0xcdcd | 0x|
> >>> --- 64K
> >>>
> >>> After scrubbing dev3 only:
> >>>
> >>> | Dev 1  | Dev  2 | Dev 3 |
> >>> | Data stripe 1  | Data stripe 2  | Parity Stripe |
> >>> --- 0
> >>> | 0xcdcd (Good)  | 0xcdcd | 0xcdcd (Bad)  |
> >>> --- 4K
> >>> | 0xcdcd | 0xcdcd | 0x|
> >>> ...
> >>> | 0xcdcd | 0xcdcd | 0x|
> >>> --- 64K
> >>>
> >>> The calltrace of such corruption is as following:
> >>>
> >>> scrub_bio_end_io_worker() get called for each extent read out
> >>> |- scriub_block_complete()
> >>>|- Data extent csum mismatch
> >>>|- scrub_handle_errored_block
> >>>   |- scrub_recheck_block()
> >>>  |- scrub_submit_raid56_bio_wait()
> >>> |- raid56_parity_recover()
> >>>
> >>> Now we have a rbio with correct data stripe 1 recovered.
> >>> Let's call it "good_rbio".
> >>>
> >>> scrub_parity_check_and_repair()
> >>> |- raid56_parity_submit_scrub_rbio()
> >>>|- lock_stripe_add()
> >>>|  |- steal_rbio()
> >>>| |- Recovered data are steal from "good_rbio", stored into
> >>>|rbio->stripe_pages[]
> >>>|Now rbio->bio_pages[] are bad data read from disk.
> >>>|- async_scrub_parity()
> >>>   |- scrub_parity_work() (delayed_call to scrub_parity_work)
> >>>
> >>> scrub_parity_work()
> >>> |- raid56_parity_scrub_stripe()
> >>>|- validate_rbio_for_parity_scrub()
> >>>   |- finish_parity_scrub()
> >>>  |- Recalculate parity using *BAD* pages in rbio->bio_pages[]
> >>> So good parity is overwritten with *BAD* one
> >>>
> >>> The fix is to introduce 2 new members, 

Re: [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe

2016-11-24 Thread Zygo Blaxell
On Wed, Nov 23, 2016 at 05:26:18PM -0800, Darrick J. Wong wrote:
[...]
> Keep in mind that the number of bytes deduped is returned to userspace
> via file_dedupe_range.info[x].bytes_deduped, so a properly functioning
> userspace program actually /can/ detect that its 128MB request got cut
> down to only 16MB and re-issue the request with the offsets moved up by
> 16MB.  The dedupe client in xfs_io (see dedupe_ioctl() in io/reflink.c)
> implements this strategy.  duperemove (the only other user I know of)
> also does this.
> 
> So it's really no big deal to increase the limit beyond 16MB, eliminate
> it entirely, or even change it to cap the total request size while
> dropping the per-item IO limit.
> 
> As I mentioned in my other reply, the only hesitation I have for not
> killing XFS_MAX_DEDUPE_LEN is that I feel that 2GB is enough IO for a
> single ioctl call.

Everything's relative.  btrfs has ioctls that will do hundreds of
terabytes of IO and take months to run.  2GB of data is nothing.

Deduping entire 100TB files with a single ioctl call makes as much
sense to me as reflink copying them with a single ioctl call.  The only
reason I see to keep the limit is to work around something wrong with
the implementation.



signature.asc
Description: Digital signature


Re: Identifying reflink / CoW files

2016-11-24 Thread Zygo Blaxell
On Fri, Nov 04, 2016 at 03:41:49PM +0100, Saint Germain wrote:
> On Thu, 3 Nov 2016 01:17:07 -0400, Zygo Blaxell
> <ce3g8...@umail.furryterror.org> wrote :
> > [...]
> > The quality of the result therefore depends on the amount of effort
> > put into measuring it.  If you look for the first non-hole extent in
> > each file and use its physical address as a physical file identifier,
> > then you get a fast reflink detector function that has a high risk of
> > false positives.  If you map out two files and compare physical
> > addresses block by block, you get a slow function with a low risk of
> > false positives (but maybe a small risk of false negatives too).
> > 
> > If your dedup program only does full-file reflink copies then the
> > first extent physical address method is sufficient.  If your program
> > does block- or extent-level dedup then it shouldn't be using files in
> > its data model at all, except where necessary to provide a mechanism
> > to access the physical blocks through the POSIX filesystem API.
> > 
> > FIEMAP will tell you about all the extents (physical address for
> > extents that have them, zero for other extent types).  It's also slow
> > and has assorted accuracy problems especially with compressed files.
> > Any user can run FIEMAP, and it uses only standard structure arrays.
> > 
> > SEARCH_V2 is root-only and requires parsing variable-length binary
> > btrfs data encoding, but it's faster than FIEMAP and gives more
> > accurate results on compressed files.
> 
> As the dedup program only does full-file reflink, the first extent
> physical address method can be used as a fast first check to identify
> potential files.
> 
> But how to implement the second check in order to have 0% risk of false
> positive ?
> Because you said that mapping out two files and comparing the physical
> addresses block by block also has a low risk of false positives.

In theory, what you do is call FIEMAP on each file and compare the
physical blocks that come back.  If they are large files you will have
to call FIEMAP multiple times on both files, each time setting the start
position to the end position of the previous run.  Translate each result
record into a range of physical addresses, then compare them.  If there
were no differences, the files are already deduped.

In practice, FIEMAP doesn't provide full accuracy for compressed extents,
and in some cases the physical address data will compare equal when
the files are in fact different.  This is the small risk of false
positives, and the only way to get 100% accuracy is to not use FIEMAP.

Instead you can use the SEARCH ioctl, which dumps out the binary extent
items from btrfs.  If you look up the items corresponding to one inode,
you can get the real physical block address plus the offset from the
beginning of the extent for compressed extents.

In Bees I encode the compressed extent start offset into the same
uint64_t as the physical extent start address using the bottom 6 bits
of the physical (bytenr) address:

https://github.com/Zygo/bees/blob/master/src/bees-types.cc#L744

This fills in an object which uniquely (and reversibly) identifies
the block on the filesystem.

The raw btrfs extent data is extracted here:

https://github.com/Zygo/bees/blob/master/lib/extentwalker.cc#L533

BeesAddress gives no false positives, but it's built on top of hundreds
of lines of userspace support code.  :-/

> Thank you very much for the detailed explanation !
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: Digital signature


Re: Increased disk usage after deduplication and system running out of memory

2016-11-24 Thread Zygo Blaxell
On Thu, Nov 24, 2016 at 03:00:26PM +0100, Niccolò Belli wrote:
> Hi,
> I use snapper, so I have plenty of snapshots in my btrfs partition and most
> of my data is already deduplicated because of that.
> Since long time ago I run offline defragmentation once (because I didn't
> know extents get unshared) I wanted to run offline deduplication to free a
> couple of GBs.
> 
> This is the script I use to stop snapper, set snapshots to rw, balance,
> deduplicate, etc: https://paste.pound-python.org/show/vPUGVNjPQbDvr4HbtMgs/
> 
> $ cat after_balance Overall:
>Device size: 152.36GiB
> Device allocated:136.00GiB
> Device unallocated:   16.35GiB
> Device missing:  0.00B
> Used:133.97GiB
> Free (estimated): 17.17GiB  (min: 17.17GiB)
> Data ratio:   1.00
> Metadata ratio:   1.00
> Global reserve:  239.94MiB  (used: 0.00B)
> Data,single: Size:133.00GiB, Used:132.18GiB
> /dev/mapper/cryptroot 133.00GiB
> Metadata,single: Size:3.00GiB, Used:1.79GiB
> /dev/mapper/cryptroot   3.00GiB
> System,single: Size:3.00MiB, Used:16.00KiB
> /dev/mapper/cryptroot   3.00MiB
> Unallocated:
> /dev/mapper/cryptroot  16.35GiB
> 
> 
> $ cat after_duperemove_and_balance
> Overall:
> Device size: 152.36GiB
> Device allocated:136.03GiB
> Device unallocated:   16.33GiB
> Device missing:  0.00B
> Used:133.81GiB
> Free (estimated): 16.55GiB  (min: 16.55GiB)
> Data ratio:   1.00
> Metadata ratio:   1.00
> Global reserve:  512.00MiB  (used: 0.00B)
> 
> Data,single: Size:127.00GiB, Used:126.77GiB
>   /dev/mapper/cryptroot 127.00GiB
> 
> Metadata,single: Size:9.00GiB, Used:7.03GiB
>   /dev/mapper/cryptroot   9.00GiB
> 
> System,single: Size:32.00MiB, Used:16.00KiB
>   /dev/mapper/cryptroot  32.00MiB
> 
> Unallocated:
>   /dev/mapper/cryptroot  16.33GiB
> 
> 
> As you can see it freed 5.41 GB of data, but it also added 5.24 GB of
> metadata. The estimated free space is now 16.55 GB, while before the
> deduplication it was higher: 17.17 GB.
> 
> This is when running duperemove git with noblock, but almost nothing changes
> if I omitt it (it defaults to block).
> Why did my metadata increase by a 4x factor? 99% of my data already had
> shared extents because of snapshots, so why such a huge increase?

Sharing by snapshot is different from sharing by dedup.

For snapshots, a new tree node is introduced which shares the entire
rest of the tree.  So you get:

Root 123 -\   /--- Node 85 --- data 84
   >- Node 87 ---<
Root 124 -/   \--- Node 43 --- data 42

This means there's 16K of metadata (actually probably more, but small
nonetheless) that is sharing the entire subvol.

For dedup, each shared data extent is shared individually, and metadata
is not shared at all:

Root 123 -\   /--- Node 85 --- data 84 (shared)
   \- Node 87 ---<
  \--- Node 43 --- data 42 (shared)

  /--- Node 129 --- data 84 (shared)
Root 124 --- Node 131 ---<
  \--- Node 126 --- data 42 (shared)

If you dedup over a set of snapshots, it eventually unshares the metadata.
The data is still shared, but _only_ the data, so it multiplies the
metadata size by the number of snapshots.  It's even worse if you have
dup metadata since the cost of each new metadata page is doubled.

> Deduplication didn't finish up to 100%, because duperemove got killed by OOM
> killer at 99%: https://paste.pound-python.org/show/yUcIOSzXcrfNPkF9rV2L/
> 
> As you can see from dmesg
> (https://paste.pound-python.org/show/eZIkpxUU6QR9ij6Rn1Oq/) there is no
> process stealing so much memory (my system has 8GB): the biggest one takes
> as much as 700MB of vm.
> 
> Another strange thing that you can see from the previous log is that it
> tries to deduplicate /home/niko/nosnap/rootfs/@images/fedora25.qcow2 which
> is a UNIQUE file. Such image is stored in a separate subvolume because I
> don't want it to be snapshotted, so I'm pretty sure there are no other
> copies of this image, but still it tries to deduplicate it.
> 
> Niccolò Belli
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: Digital signature


Re: [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe

2016-11-23 Thread Zygo Blaxell
On Tue, Nov 22, 2016 at 06:44:19PM -0800, Darrick J. Wong wrote:
> On Tue, Nov 22, 2016 at 09:02:10PM -0500, Zygo Blaxell wrote:
> > On Thu, Nov 17, 2016 at 04:07:48PM -0800, Omar Sandoval wrote:
> > > 3. Both XFS and Btrfs cap each dedupe operation to 16MB, but the
> > >implicit EOF gets around this in the existing XFS implementation. I
> > >copied this for the Btrfs implementation.
> > 
> > Somewhat tangential to this patch, but on the dedup topic:  Can we raise
> > or drop that 16MB limit?
> > 
> > The maximum btrfs extent length is 128MB.  Currently the btrfs dedup
> > behavior for a 128MB extent is to generate 8x16MB shared extent references
> > with different extent offsets to a single 128MB physical extent.
> > These references no longer look like the original 128MB extent to a
> > userspace dedup tool.  That raises the difficulty level substantially
> > for a userspace dedup tool when it tries to figure out which extents to
> > keep and which to discard or rewrite.
> > 
> > XFS may not have this problem--I haven't checked.  On btrfs it's
> > definitely not as simple as "bang two inode/offset/length pairs together
> > with dedup and disk space will be freed automagically."  If dedup is
> > done incorrectly on btrfs, it can end up just making the filesystem slow
> > without freeing any space.
> 
> I copied the 16M limit into xfs/ocfs2 because btrfs had it. :)

Finally, a clearly stated rationale.  ;)

> The VFS now limits the size of the incoming struct file_dedupe_range to
> whatever a page size is.  On x86 that only allows us 126 dedupe
> candidates, which means that a single call can perform up to ~2GB of IO.
> Storage is getting faster, but 2GB is still a fair amount for a single
> call.  Of course in XFS we do the dedupe one file and one page at a time
> to keep the memory footprint sane.
> 
> On ppc64 with its huge 64k pages that comes out to 32GB of IO.
> 
> One thing we (speaking for XFS, anyway) /could/ do is limit based on the
> theoretical IO count instead of clamping the length, e.g.
> 
> if ((u64)dest_count * len >= (1ULL << 31))
>   return -E2BIG;
> 
> That way you /could/ specify a larger extent size if you pass in fewer
> file descriptors.  OTOH XFS will merge all the records together, so even
> if you deduped the whole 128M in 4k chunks you'll still end up with a
> single block mapping record and a single backref.

This is why I'm mystified that XFS has this limitation.  On btrfs there
were at least _reasons_ for it, even if they were just "we have a v0.3
implementation and nobody's even started optimizing it yet."

The btrfs code calls kzalloc (with size limited by MAX_DEDUPE_LEN) in
the context of the thread executing the ioctl.  It then loads up all the
pages, compares them, then decides whether to continue with clone_range
for the whole extent, or not.  btrfs doesn't seem to ever merge these.

> Were I starting from scratch I'd probably just dump the existing dedupe
> interface in favor of a non-vectorized dedupe_range call taking the same
> parameters as clone_range:
> 
> int dedupe_range(src_fd, src_off, dest_fd, dest_off);
> 
> I'd also change the semantics to "Find and share all identical blocks in
> this subrange.  Differing blocks are left alone." because it seems silly
> that duperemove can issue large requests but a single byte difference in
> the middle causes info->status to be set to FILE_DEDUPE_RANGE_DIFFERS
> and info->bytes_deduped only changes if the entire range was deduped.

It'd also be nice if it replaced all existing shared refs to the dst
blocks at the same time.  On btrfs, dedup agents have to find all the
shared refs (either through brute force or by using LOGICAL_INO to look
them all up through backrefs) and feed each one into extent_same until
the last reference to dst is removed.  But maybe this is only needed to
work around a btrfs thing that never happens on XFS... :-P

> > The 16MB limit doesn't seem to be useful in practice.  The two useful
> > effects of the limit seem to be DoS mitigation.  There is no checking of
> > the RAM usage that I can find (i.e. if you fire off 16 dedup threads,
> > they want 256MB of RAM; put another way, if you want to tie up 16GB of
> > kernel RAM, all you have to do is create 1024 dedup threads), so it's
> > not an effective DoS mitigation feature.  Internally dedup could verify
> > blocks in batches of 16MB and check for signals/release and reacquire
> > locks in between, so it wouldn't tie up the kernel or the two inodes
> > for excessively long periods.
> 
> (Does btrfs actually do the extent_same stuff in parallel??)

A btrfs dedup agent can invoke multiple extent_sames 

bees v0.1 - Best-Effort Extent-Same, a btrfs deduplication daemon

2016-11-23 Thread Zygo Blaxell
I made a thing!

Bees ("Best-Effort Extent-Same") is a dedup daemon for btrfs.

Bees is a block-oriented userspace dedup designed to avoid scalability
problems on large filesystems.

Bees is designed to degrade gracefully when underprovisioned with RAM.
Bees does not use more RAM or storage as filesystem data size increases.
The dedup hash table size is fixed at creation time and does not change.
The effective dedup block size is dynamic and adjusts automatically to
fit the hash table into the configured RAM limit.  Hash table overflow
is not implemented to eliminate the IO overhead of hash table overflow.
Hash table entries are only 16 bytes per dedup block to keep the average
dedup block size small.

Bees does not require alignment between dedup blocks or extent boundaries
(i.e. it can handle any multiple-of-4K offset between dup block pairs).
Bees rearranges blocks into shared and unique extents if required to
work within current btrfs kernel dedup limitations.

Bees can dedup any combination of compressed and uncompressed extents.

Bees operates in a single pass which removes duplicate extents immediately
during scan.  There are no separate scanning and dedup phases.

Bees uses only data-safe btrfs kernel operations, so it can dedup live
data (e.g. build servers, sqlite databases, VM disk images).  It does
not modify file attributes or timestamps.

Bees does not store any information about filesystem structure, so it is
not affected by the number or size of files (except to the extent that
these cause performance problems for btrfs in general).  It retrieves such
information on demand through btrfs SEARCH_V2 and LOGICAL_INO ioctls.
This eliminates the storage required to maintain the equivalents of
these functions in userspace.  It's also why bees has no XFS support.

Bees is a daemon designed to run continuously and maintain its state
across crahes and reboots.  Bees uses checkpoints for persistence to
eliminate the IO overhead of a transactional data store.  On restart,
bees will dedup any data that was added to the filesystem since the
last checkpoint.

I use bees to dedup filesystems ranging in size from 16GB to 35TB, with
hash tables ranging in size from 128MB to 11GB.  It's well past time
for a v0.1 release, so here it is!

Bees is available on Github:

https://github.com/Zygo/bees

Please enjoy this code.


signature.asc
Description: Digital signature


Re: [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe

2016-11-23 Thread Zygo Blaxell
On Thu, Nov 24, 2016 at 09:13:28AM +1100, Dave Chinner wrote:
> On Wed, Nov 23, 2016 at 08:55:59AM -0500, Zygo Blaxell wrote:
> > On Wed, Nov 23, 2016 at 03:26:32PM +1100, Dave Chinner wrote:
> > > On Tue, Nov 22, 2016 at 09:02:10PM -0500, Zygo Blaxell wrote:
> > > > On Thu, Nov 17, 2016 at 04:07:48PM -0800, Omar Sandoval wrote:
> > > > > 3. Both XFS and Btrfs cap each dedupe operation to 16MB, but the
> > > > >implicit EOF gets around this in the existing XFS implementation. I
> > > > >copied this for the Btrfs implementation.
> > > > 
> > > > Somewhat tangential to this patch, but on the dedup topic:  Can we raise
> > > > or drop that 16MB limit?
> > > > 
> > > > The maximum btrfs extent length is 128MB.  Currently the btrfs dedup
> > > > behavior for a 128MB extent is to generate 8x16MB shared extent 
> > > > references
> > > > with different extent offsets to a single 128MB physical extent.
> > > > These references no longer look like the original 128MB extent to a
> > > > userspace dedup tool.  That raises the difficulty level substantially
> > > > for a userspace dedup tool when it tries to figure out which extents to
> > > > keep and which to discard or rewrite.
> > > 
> > > That, IMO, is a btrfs design/implementation problem, not a problem
> > > with the API. Applications are always going to end up doing things
> > > that aren't perfectly aligned to extent boundaries or sizes
> > > regardless of the size limit that is placed on the dedupe ranges.
> > 
> > Given that XFS doesn't have all the problems btrfs does, why does XFS
> > have the same aribitrary size limit?  Especially since XFS demonstrably
> > doesn't need it?
> 
> Creating a new-but-slightly-incompatible jsut for XFS makes no
> sense - we have multiple filesystems that support this functionality
> and so they all should use the same APIs and present (as far as is
> possible) the same behaviour to userspace.

OK.  Let's just remove the limit on all the filesystems then.
XFS doesn't need it, and btrfs can be fixed.

> IOWs it's more important to use existing APIs than to invent a new
> one that does almost the same thing. This way userspace applications
> don't need to be changed to support new XFS functionality and we
> make life easier for everyone. 

Except removing the limit doesn't work that way.  An application that
didn't impose an undocumented limit on itself wouldn't break when moved
to a filesystem that imposed no such limit, i.e. if XFS had no limit,
an application that moved from btrfs to XFS would just work.

> A shiny new API without warts would
> be nice, but we've already got to support the existing one forever,
> it does the job we need and so it's less burden on everyone if we
> just use it as is.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
> 


signature.asc
Description: Digital signature


Re: [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe

2016-11-23 Thread Zygo Blaxell
On Wed, Nov 23, 2016 at 03:26:32PM +1100, Dave Chinner wrote:
> On Tue, Nov 22, 2016 at 09:02:10PM -0500, Zygo Blaxell wrote:
> > On Thu, Nov 17, 2016 at 04:07:48PM -0800, Omar Sandoval wrote:
> > > 3. Both XFS and Btrfs cap each dedupe operation to 16MB, but the
> > >implicit EOF gets around this in the existing XFS implementation. I
> > >copied this for the Btrfs implementation.
> > 
> > Somewhat tangential to this patch, but on the dedup topic:  Can we raise
> > or drop that 16MB limit?
> > 
> > The maximum btrfs extent length is 128MB.  Currently the btrfs dedup
> > behavior for a 128MB extent is to generate 8x16MB shared extent references
> > with different extent offsets to a single 128MB physical extent.
> > These references no longer look like the original 128MB extent to a
> > userspace dedup tool.  That raises the difficulty level substantially
> > for a userspace dedup tool when it tries to figure out which extents to
> > keep and which to discard or rewrite.
> 
> That, IMO, is a btrfs design/implementation problem, not a problem
> with the API. Applications are always going to end up doing things
> that aren't perfectly aligned to extent boundaries or sizes
> regardless of the size limit that is placed on the dedupe ranges.

Given that XFS doesn't have all the problems btrfs does, why does XFS
have the same aribitrary size limit?  Especially since XFS demonstrably
doesn't need it?

> > XFS may not have this problem--I haven't checked.
> 
> It doesn't - it tracks shared blocks exactly and merges adjacent
> extent records whenever possible.
> 
> > Even if we want to keep the 16MB limit, there's also no way to query the
> > kernel from userspace to find out what the limit is, other than by trial
> > and error.  It's not even in a header file, userspace just has to *know*.
> 
> So add a define to the API to make it visible to applications and
> document it in the man page.

To answer some of my own questions on the btrfs side:  It looks like
the btrfs implementation does have a reason for it (fixed-size arrays).

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
> 


signature.asc
Description: Digital signature


Re: [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe

2016-11-22 Thread Zygo Blaxell
On Thu, Nov 17, 2016 at 04:07:48PM -0800, Omar Sandoval wrote:
> 3. Both XFS and Btrfs cap each dedupe operation to 16MB, but the
>implicit EOF gets around this in the existing XFS implementation. I
>copied this for the Btrfs implementation.

Somewhat tangential to this patch, but on the dedup topic:  Can we raise
or drop that 16MB limit?

The maximum btrfs extent length is 128MB.  Currently the btrfs dedup
behavior for a 128MB extent is to generate 8x16MB shared extent references
with different extent offsets to a single 128MB physical extent.
These references no longer look like the original 128MB extent to a
userspace dedup tool.  That raises the difficulty level substantially
for a userspace dedup tool when it tries to figure out which extents to
keep and which to discard or rewrite.

XFS may not have this problem--I haven't checked.  On btrfs it's
definitely not as simple as "bang two inode/offset/length pairs together
with dedup and disk space will be freed automagically."  If dedup is
done incorrectly on btrfs, it can end up just making the filesystem slow
without freeing any space.

The 16MB limit doesn't seem to be useful in practice.  The two useful
effects of the limit seem to be DoS mitigation.  There is no checking of
the RAM usage that I can find (i.e. if you fire off 16 dedup threads,
they want 256MB of RAM; put another way, if you want to tie up 16GB of
kernel RAM, all you have to do is create 1024 dedup threads), so it's
not an effective DoS mitigation feature.  Internally dedup could verify
blocks in batches of 16MB and check for signals/release and reacquire
locks in between, so it wouldn't tie up the kernel or the two inodes
for excessively long periods.

Even if we want to keep the 16MB limit, there's also no way to query the
kernel from userspace to find out what the limit is, other than by trial
and error.  It's not even in a header file, userspace just has to *know*.



signature.asc
Description: Digital signature


Re: [RFC] btrfs: make max inline data can be equal to sectorsize

2016-11-19 Thread Zygo Blaxell
On Fri, Nov 18, 2016 at 03:58:06PM -0500, Chris Mason wrote:
> 
> 
> On 11/16/2016 11:10 AM, David Sterba wrote:
> >On Mon, Nov 14, 2016 at 09:55:34AM +0800, Qu Wenruo wrote:
> >>At 11/12/2016 04:22 AM, Liu Bo wrote:
> >>>On Tue, Oct 11, 2016 at 02:47:42PM +0800, Wang Xiaoguang wrote:
> If we use mount option "-o max_inline=sectorsize", say 4096, indeed
> even for a fresh fs, say nodesize is 16k, we can not make the first
> 4k data completely inline, I found this conditon causing this issue:
>   !compressed_size && (actual_end & (root->sectorsize - 1)) == 0
> 
> If it retuns true, we'll not make data inline. For 4k sectorsize,
> 0~4094 dara range, we can make it inline, but 0~4095, it can not.
> I don't think this limition is useful, so here remove it which will
> make max inline data can be equal to sectorsize.
> >>>
> >>>It's difficult to tell whether we need this, I'm not a big fan of using
> >>>max_inline size more than the default size 2048, given that most reports
> >>>about ENOSPC is due to metadata and inline may make it worse.
> >>
> >>IMHO if we can use inline data extents to trigger ENOSPC more easily,
> >>then we should allow it to dig the problem further.
> >>
> >>Just ignoring it because it may cause more bug will not solve the real
> >>problem anyway.
> >
> >Not allowing the full 4k value as max_inline looks artificial to me.
> >We've removed other similar limitation in the past so I'd tend to agree
> >to do the same here. There's no significant use for it as far as I can
> >tell, if you want to exhaust metadata, the difference to max_inline=4095
> >would be really tiny in the end. So, I'm okay with merging it. If
> >anybody feels like adding his reviewed-by, please do so.
> 
> The check is there because in practice it doesn't make sense to inline an
> extent if it fits perfectly in a data block.  You could argue its saving
> seeks, but we're also adding seeks by spreading out the metadata in general.
> So, I'd want to see benchmarks before deciding.

Does that limit kick in before or after compression?  A compressed extent
could easily have 4096 bytes of data in 200 bytes.  If a filesystem
contained a whole lot of exactly-4096-byte compressible files that extra
byte might be worth something.

> If we're using it for debugging, I'd rather stick with max_inline=4095.
> 
> -chris
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: Digital signature


Re: RFC: raid with a variable stripe size

2016-11-19 Thread Zygo Blaxell
On Fri, Nov 18, 2016 at 07:15:12PM +0100, Goffredo Baroncelli wrote:
> Hello,
>
> these are only my thoughts; no code here, but I would like to share
> it hoping that it could be useful.
>
> As reported several times by Zygo (and others), one of the problem of
> raid5/6 is the write hole. Today BTRFS is not capable to address it.
>
> The problem is that the stripe size is bigger than the "sector size"
> (ok sector is not the correct word, but I am referring to the basic
> unit of writing on the disk, which is 4k or 16K in btrfs).  So when
> btrfs writes less data than the stripe, the stripe is not filled; when
> it is filled by a subsequent write, a RMW of the parity is required.

The key point in the problem statement is that subsequent writes are
allowed to modify stripes while they contain data.  Proper CoW would
never do that.

Stripes should never contain data from two separate transactions--that
would imply that CoW rules have been violated.

Currently there is no problem for big writes on empty disks because
the data block allocator happens to do the right thing accidentally in
such cases.  It's only when the allocator allocates new data to partially
filled stripes that the problems occur.

For metadata the allocator currently stumbles into RMW writes so badly
that the difference between the current allocator and the worst possible
allocator is only a few percent.

> On the best of my understanding (which could be very wrong) ZFS try
> to solve this issue using a variable length stripe.

ZFS ties the parity blocks to what btrfs would call extents.  It prevents
multiple writes to the same RAID stripe in different transactions by
dynamically defining the RAID stripe boundaries *around* the write
boundaries.  This is very different from btrfs's current on-disk
structure.

e.g. if we were to write:

extent D, 7 blocks
extent E, 3 blocks
extent F, 9 blocks

the disk in btrfs looks something like:

D1 D2 D3 D4 P1
D5 D6 D7 P2 E1
E2 E3 P3 F1 F2
F3 P4 F4 F5 F6
P5 F7 F8 F9 xx

P1 = parity(D1..D4)
P2 = parity(D5..D7, E1)
P3 = parity(E2, E3, F1, F2)
P4 = parity(F3..F6)
P5 = parity(F7..F9)

If D, E, and F were written in different transactions, it could make P2
and P3 invalid.

The disk in ZFS looks something like:

D1 D2 D3 D4 P1
D5 D6 D7 P2 E1
E2 E3 P3 F1 F2
F3 F4 P4 F5 F6
F7 F8 P5 F9 P6

where:

P1 is parity(D1..D4)
P2 is parity(D5..D7)
P3 is parity(E1..E3)
P4 is parity(F1..F4)
P5 is parity(F5..F8)
P6 is parity(F9)

Each parity value contains only data from one extent, which makes it
impossible for any P block to contain data from different transactions.
Every extent is striped across a potentially different number of disks,
so it's less efficient than "pure" raid5 would be with the same quantity
of data.

This would require pushing the parity allocation all the way up into
the extent layer in btrfs, which would be a massive change that could
introduce regressions into all the other RAID levels; on the other hand,
if it was pushed up to that level, it would be possible to checksum the
parity blocks...

> On BTRFS this could be achieved using several BGs (== block group or
> chunk), one for each stripe size.

Actually it's one per *possibly* failed disk (N^2 - N disks for RAID6).
Block groups are composed of *specific* disks...

> For example, if a filesystem - RAID5 is composed by 4 DISK, the
> filesystem should have three BGs: BG #1,composed by two disks (1
> data+ 1 parity) BG #2 composed by three disks (2 data + 1 parity)
> BG #3 composed by four disks (3 data + 1 parity).

...i.e. you'd need block groups for disks ABCD, ABC, ABD, ACD, and BCD.

Btrfs doesn't allocate block groups that way anyway.  A much simpler
version of this is to make two changes:

1.  Identify when disks go offline and mark block groups touching
these disks as 'degraded'.  Currently this only happens at mount
time, so the btrfs change would be to add the detection of state
transition at the instant when a disk fails.

2.  When a block group is degraded (i.e. some of its disks are
missing), mark it strictly read-only and disable nodatacow.

Btrfs can already do #2 when balancing.  I've used this capability to
repair broken raid5 arrays.  Currently btrfs does *not* do this for
ordinary data writes, and that's the required change.

The trade-off for this approach is that if you didn't have any unallocated
space when a disk failed, you'll get ENOSPC for everything, because
there's no disk you could be allocating new metadata pages on.  That
makes it hard to add or replace disks.

> If the data to be written has a size of 4k, it will be allocated to
> the BG #1.  If the data to be written has a size of 8k, it will be
> allocated to the BG #2 If the data to be written has a size of 12k,
> it will be 

Re: [PATCH 0/2] RAID5/6 scrub race fix

2016-11-18 Thread Zygo Blaxell
On Fri, Nov 18, 2016 at 07:09:34PM +0100, Goffredo Baroncelli wrote:
> Hi Zygo
> On 2016-11-18 00:13, Zygo Blaxell wrote:
> > On Tue, Nov 15, 2016 at 10:50:22AM +0800, Qu Wenruo wrote:
> >> Fix the so-called famous RAID5/6 scrub error.
> >>
> >> Thanks Goffredo Baroncelli for reporting the bug, and make it into our
> >> sight.
> >> (Yes, without the Phoronix report on this,
> >> https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad,
> >> I won't ever be aware of it)
> > 
> > If you're hearing about btrfs RAID5 bugs for the first time through
> > Phoronix, then your testing coverage is *clearly* inadequate.
> > 
> > Fill up a RAID5 array, start a FS stress test, pull a drive out while
> > that's running, let the FS stress test run for another hour, then try
> > to replace or delete the missing device.  If there are any crashes,
> > corruptions, or EIO during any part of this process (assuming all the
> > remaining disks are healthy), then btrfs RAID5 is still broken, and
> > you've found another bug to fix.
> > 
> > The fact that so many problems in btrfs can still be found this way
> > indicates to me that nobody is doing this basic level of testing
> > (or if they are, they're not doing anything about the results).
> 
> [...]
> 
> Sorry but I don't find useful this kind of discussion.  Yes BTRFS
> RAID5/6 needs a lot of care. Yes, *our* test coverage is far to be
> complete; but this is not a fault of a single person; and Qu tried to
> solve one issue and for this we should say only tanks..
>
> Even if you don't find valuable the work of Qu (and my little one :-)
> ), this required some time and need to be respected.

I do find this work valuable, and I do thank you and Qu for it.
I've been following it with great interest because I haven't had time
to dive into it myself.  It's a use case I used before and would like
to use again.

Most of my recent frustration, if directed at anyone, is really directed
at Phoronix for conflating "one bug was fixed" with "ready for production
use today," and I wanted to ensure that the latter rumor was promptly
quashed.

This is why I'm excited about Qu's work:  on my list of 7 btrfs-raid5
recovery bugs (6 I found plus yours), Qu has fixed at least 2 of them,
maybe as many as 4, with the patches so far.  I can fix 2 of the others,
for a total of 6 fixed out of 7.

Specifically, the 7 bugs I know of are:

1-2. BUG_ONs in functions that should return errors (I had
fixed both already when trying to recover my broken arrays)

3. scrub can't identify which drives or files are corrupted
(Qu might have fixed this--I won't know until I do testing)

4-6. symptom groups related to wrong data or EIO in scrub
recovery, including Goffredo's (Qu might have fixed all of these,
but from a quick read of the patch I think at least two are done).

7. the write hole.

I'll know more after I've had a chance to run Qu's patches through
testing, which I intend to do at some point.

Optimistically, this means there could be only *one* bug remaining
in the critical path for btrfs RAID56 single disk failure recovery.
That last bug is the write hole, which is why I keep going on about it.
It's the only bug I know exists in btrfs RAID56 that has neither an
existing fix nor any evidence of someone actively working on it, even
at the design proposal stage.  Please, I'd love to be wrong about this.

When I described the situation recently as "a thin layer of bugs on
top of a design defect", I was not trying to be mean.  I was trying to
describe the situation *precisely*.

The thin layer of bugs is much thinner thanks to Qu's work, and thanks
in part to his work, I now have confidence that further investment in
this area won't be wasted.

> Finally, I don't think that we should compare the RAID-hole with this
> kind of bug(fix). The former is a design issue, the latter is a bug
> related of one of the basic feature of the raid system (recover from
> the lost of a disk/corruption).
>
> Even the MD subsystem (which is far behind btrfs) had tolerated
> the raid-hole until last year. 

My frustration against this point is the attitude that mdadm was ever
good enough, much less a model to emulate in the future.  It's 2016--there
have been some advancements in the state of the art since the IBM patent
describing RAID5 30 years ago, yet in the btrfs world, we seem to insist
on repeating all the same mistakes in the same order.

"We're as good as some existing broken-by-design thing" isn't a really
useful attitude.  We should aspire to do *better* than the existing
broken-by-design things.  If we didn't, we wouldn't be here, we'd all
be lurking on some other list, running ext4 

Re: [PATCH 0/2] RAID5/6 scrub race fix

2016-11-17 Thread Zygo Blaxell
On Fri, Nov 18, 2016 at 10:42:23AM +0800, Qu Wenruo wrote:
> 
> 
> At 11/18/2016 09:56 AM, Hugo Mills wrote:
> >On Fri, Nov 18, 2016 at 09:19:11AM +0800, Qu Wenruo wrote:
> >>
> >>
> >>At 11/18/2016 07:13 AM, Zygo Blaxell wrote:
> >>>On Tue, Nov 15, 2016 at 10:50:22AM +0800, Qu Wenruo wrote:
> >>>>Fix the so-called famous RAID5/6 scrub error.
> >>>>
> >>>>Thanks Goffredo Baroncelli for reporting the bug, and make it into our
> >>>>sight.
> >>>>(Yes, without the Phoronix report on this,
> >>>>https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad,
> >>>>I won't ever be aware of it)
> >>>
> >>>If you're hearing about btrfs RAID5 bugs for the first time through
> >>>Phoronix, then your testing coverage is *clearly* inadequate.
> >>
> >>I'm not fixing everything, I'm just focusing on the exact one bug
> >>reported by Goffredo Baroncelli.
> >>
> >>Although it seems that, the bug reported by him is in fact two bugs.
> >>One is race condition I'm fixing, another one is that recovery is
> >>recovering data correctly, but screwing up parity.
> >>
> >>I just don't understand why you always want to fix everything in one step.
> >
> >   Fix the important, fundamental things first, and the others
> >later. This, from my understanding of Zygo's comments, appears to be
> >one of the others.
> >
> >   It's papering over the missing bricks in the wall instead of
> >chipping out the mortar and putting new bricks in. It may need to be
> >fixed, but it's not the fundamental "OMG, everything's totally broken"
> >problem. If anything, it's only a serious problem *because* the other
> >thing (write hole) is still there.
> >
> >   It just seems like a piece of mis-prioritised effort.
> 
> It seems that, we have different standards on the priority.

My concern isn't priority.  Easier bugs often get fixed first.  That's
just the way Linux development works.

I am very concerned by articles like this:

http://phoronix.com/scan.php?page=news_item=Btrfs-RAID5-RAID6-Fixed

with headlines like "btrfs RAID5/RAID6 support is finally fixed" when
that's very much not the case.  Only one bug has been removed for the
key use case that makes RAID5 interesting, and it's just the first of
many that still remain in the path of a user trying to recover from a
normal disk failure.

Admittedly this is Michael's (Phoronix's) problem more than Qu's, but
it's important to always be clear and _complete_ when stating bug status
because people quote statements out of context.  When the article quoted
the text

"it's not a timed bomb buried deeply into the RAID5/6 code,
but a race condition in scrub recovery code"

the commenters on Phoronix are clearly interpreting this to mean "famous
RAID5/6 scrub error" had been fixed *and* the issue reported by Goffredo
was the time bomb issue.  It's more accurate to say something like

"Goffredo's issue is not the time bomb buried deeply in the
RAID5/6 code, but a separate issue caused by a race condition
in scrub recovery code"

Reading the Phoronix article, one might imagine RAID5 is now working
as well as RAID1 on btrfs.  To be clear, it's not--although the gap
is now significantly narrower.

> For me, if some function on the very basic/minimal environment can't work
> reliably, then it's a high priority bug.
> 
> In this case, in a very minimal setup, with only 128K data spreading on 3
> devices RAID5. With a data stripe fully corrupted, without any other thing
> interfering.
> Scrub can't return correct csum error number and even cause false
> unrecoverable error, then it's a high priority thing.

> If the problem involves too many steps like removing devices, degraded mode,
> fsstress and some time. Then it's not that priority unless one pin-downs the
> root case to, for example, degraded mode itself with special sequenced
> operations.

There are multiple bugs in the stress + remove device case.  Some are
quite easy to isolate.  They range in difficulty from simple BUG_ON
instead of error returns to finally solving the RMW update problem.

Run the test, choose any of the bugs that occur to work on, repeat until
the test stops finding new bugs for a while.  There are currently several
bugs to choose from with various levels of difficulty to fix them, and you
should hit the first level of bugs in a matter of hours if not minutes.

Using this method, you would have discovered Goffredo's bug years ago.
Instead, you only discovered it after Phoronix quoted the conclusion
of an investigation that started because of pro

Re: [PATCH 0/2] RAID5/6 scrub race fix

2016-11-17 Thread Zygo Blaxell
On Tue, Nov 15, 2016 at 10:50:22AM +0800, Qu Wenruo wrote:
> Fix the so-called famous RAID5/6 scrub error.
> 
> Thanks Goffredo Baroncelli for reporting the bug, and make it into our
> sight.
> (Yes, without the Phoronix report on this,
> https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad,
> I won't ever be aware of it)

If you're hearing about btrfs RAID5 bugs for the first time through
Phoronix, then your testing coverage is *clearly* inadequate.

Fill up a RAID5 array, start a FS stress test, pull a drive out while
that's running, let the FS stress test run for another hour, then try
to replace or delete the missing device.  If there are any crashes,
corruptions, or EIO during any part of this process (assuming all the
remaining disks are healthy), then btrfs RAID5 is still broken, and
you've found another bug to fix.

The fact that so many problems in btrfs can still be found this way
indicates to me that nobody is doing this basic level of testing
(or if they are, they're not doing anything about the results).

> Unlike many of us(including myself) assumed, it's not a timed bomb buried
> deeply into the RAID5/6 code, but a race condition in scrub recovery
> code.

I don't see how this patch fixes the write hole issue at the core of
btrfs RAID56.  It just makes the thin layer of bugs over that issue a
little thinner.  There's still the metadata RMW update timebomb at the
bottom of the bug pile that can't be fixed by scrub (the filesystem is
unrecoverably damaged when the bomb goes off, so scrub isn't possible).

> The problem is not found because normal mirror based profiles aren't
> affected by the race, since they are independent with each other.

True.

> Although this time the fix doesn't affect the scrub code much, it should
> warn us that current scrub code is really hard to maintain.

This last sentence is true.  I found and fixed three BUG_ONs in RAID5
code on the first day I started testing in degraded mode, then hit
the scrub code and had to give up.  It was like a brick wall made out
of mismatched assumptions and layering inversions, using uninitialized
kernel data as mortar (though I suppose the "uninitialized" data symptom
might just have been an unprotected memory access).

> Abuse of workquque to delay works and the full fs scrub is race prone.
> 
> Xfstest will follow a little later, as we don't have good enough tools
> to corrupt data stripes pinpointly.
> 
> Qu Wenruo (2):
>   btrfs: scrub: Introduce full stripe lock for RAID56
>   btrfs: scrub: Fix RAID56 recovery race condition
> 
>  fs/btrfs/ctree.h   |   4 ++
>  fs/btrfs/extent-tree.c |   3 +
>  fs/btrfs/scrub.c   | 192 
> +
>  3 files changed, 199 insertions(+)
> 
> -- 
> 2.10.2
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: Digital signature


Re: Announcing btrfs-dedupe

2016-11-16 Thread Zygo Blaxell
On Wed, Nov 16, 2016 at 11:24:33PM +0100, Niccolò Belli wrote:
> On martedì 15 novembre 2016 18:52:01 CET, Zygo Blaxell wrote:
> >Like I said, millions of extents per week...
> >
> >64K is an enormous dedup block size, especially if it comes with a 64K
> >alignment constraint as well.
> >
> >These are the top ten duplicate block sizes from a sample of 95251
> >dedup ops on a medium-sized production server with 4TB of filesystem
> >(about one machine-day of data):
> 
> Which software do you use to dedupe your data? I tried duperemove but it
> gets killed by the OOM killer because it triggers some kind of memory leak:
> https://github.com/markfasheh/duperemove/issues/163

Duperemove does use a lot of memory, but the logs at that URL only show
2G of RAM in duperemove--not nearly enough to trigger OOM under normal
conditions on an 8G machine.  There's another process with 6G of virtual
address space (although much less than that resident) that looks more
interesting (i.e. duperemove might just be the victim of some interaction
between baloo_file and the OOM killer).

On the other hand, the logs also show kernel 4.8.  100% of my test
machines failed to finish booting before they were cut down by OOM on
4.7.x kernels.  The same problem occurs on early kernels in the 4.8.x
series.  I am having good results with 4.8.6 and later, but you should
be aware that significant changes have been made to the way OOM works
in these kernel versions, and maybe you're hitting a regression for your
use case.

> Niccolò Belli
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: Digital signature


Re: Announcing btrfs-dedupe

2016-11-15 Thread Zygo Blaxell
On Tue, Nov 15, 2016 at 07:26:53AM -0500, Austin S. Hemmelgarn wrote:
> On 2016-11-14 16:10, Zygo Blaxell wrote:
> >Why is deduplicating thousands of blocks of data crazy?  I already
> >deduplicate four orders of magnitude more than that per week.
> You missed the 'tiny' quantifier.  I'm talking really small blocks, on the
> order of less than 64k (so, IOW, stuff that's not much bigger than a few
> filesystem blocks), and that is somewhat crazy because it ends up not only
> taking _really_ long to do compared to larger chunks (because you're running
> more independent hashes than with bigger blocks), but also because it will
> often split extents unnecessarily and contribute to fragmentation, which
> will lead to all kinds of other performance problems on the FS.

Like I said, millions of extents per week...

64K is an enormous dedup block size, especially if it comes with a 64K
alignment constraint as well.

These are the top ten duplicate block sizes from a sample of 95251
dedup ops on a medium-sized production server with 4TB of filesystem
(about one machine-day of data):

total bytes extent countdup size
2750808064  20987   131072
803733504   1533524288
123801600   975 126976
103575552   842912288
97443840793 122880
8205107210016   8192
7749222418919   4096
71331840645 110592
64143360540 118784
63897600650 98304

all bytes   all extents average dup size
6129995776  95251   64356

128K and 512K are the most common sizes due to btrfs compression (it
limits the block size to 128K for compressed extents and seems to limit
uncompressed extents to 512K for some reason).  12K is #4, and 3 of the
top ten sizes are below 16K.  The average size is just a little below 64K.

These are the duplicates with block sizes smaller than 64K:

total bytes extent countextent size
41615360635 65536
46264320753 61440
45817856799 57344
41267200775 53248 
45760512931 49152
46948352104245056
43417600106040960
47296512128336864
59277312180932768
49029120171028672
43745280178024576
53616640261820480
43466752265316384
103575552   842912288
8205107210016   8192 
7749222418919   4096 

all bytes <=64K extents <=64K   average dup size <=64K
870641664   55212   15769

14% of my duplicate bytes are in blocks smaller than 64K or blocks not
aligned to a 64K boundary within a file.  It's too large a space saving
to ignore on machines that have constrained storage.

It may be worthwhile skipping 4K and 8K dedups--at 250 ms per dedup,
they're 30% of the total run time and only 2.6% of the total dedup bytes.
On the other hand, this machine is already deduping everything fast enough
to keep up with new data, so there's no performance problem to solve here.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: Digital signature


Re: Announcing btrfs-dedupe

2016-11-14 Thread Zygo Blaxell
On Mon, Nov 14, 2016 at 09:07:51PM +0100, James Pharaoh wrote:
> On 14/11/16 20:51, Zygo Blaxell wrote:
> >On Mon, Nov 14, 2016 at 01:39:02PM -0500, Austin S. Hemmelgarn wrote:
> >>On 2016-11-14 13:22, James Pharaoh wrote:
> >>>One thing I am keen to understand is if BTRFS will automatically ignore
> >>>a request to deduplicate a file if it is already deduplicated? Given the
> >>>performance I see when doing a repeat deduplication, it seems to me that
> >>>it can't be doing so, although this could be caused by the CPU usage you
> >>>mention above.
> >>
> >>What's happening is that the dedupe ioctl does a byte-wise comparison of the
> >>ranges to make sure they're the same before linking them.  This is actually
> >>what takes most of the time when calling the ioctl, and is part of why it
> >>takes longer the larger the range to deduplicate is.  In essence, it's
> >>behaving like an OS should and not trusting userspace to make reasonable
> >>requests (which is also why there's a separate ioctl to clone a range from
> >>another file instead of deduplicating existing data).
> >
> > - the extent-same ioctl could check to see which extents
> > are referenced by the src and dst ranges, and return success
> > immediately without reading data if they are the same (but
> > userspace should already know this, or it's wasting a huge amount
> > of time before it even calls the kernel).
> 
> Yes, this is what I am talking about. I believe I should be able to read
> data about the BTRFS data structures and determine if this is the case. I
> don't care if there are false matches, due to concurrent updates, but
> there'll be a /lot/ of repeat deduplications unless I do this, because even
> if the file is identical, the mtime etc hasn't changed, and I have a record
> of previously doing a dedupe, there's no guarantee that the file hasn't been
> rewritten in place (eg by rsync), and no way that I know of to reliably
> detect if a file has been changed.
> 
> I am sure there are libraries out there which can look into the data
> structures of a BTRFS file system, I haven't researched this in detail
> though. I imagine that with some kind of lock on a BTRFS root, this could be
> achieved by simply reading the data from the disk, since I believe that
> everything is copy-on-write, so no existing data should be overwritten until
> all roots referring to it are updated. Perhaps I'm missing something
> though...

FIEMAP (VFS) and SEARCH_V2 (btrfs-specific) will both give you access
to the underlying physical block numbers.  SEARCH_V2 is non-trivial
to use without reverse-engineering significant parts of btrfs-progs.
SEARCH_V2 is a generic tree-searching tool which will give you all kinds
of information about btrfs structures...it's essential for a sophisticated
deduplicator and overkill for a simple one.

For full-file dedup using FIEMAP you only need to look at the "physical"
field of the first extent (if it's zero or the same as the other file, the
files cannot be deduplicated or are already deduplicated, respectively).
The source for 'filefrag' (from e2fsprogs) is good for learning how
FIEMAP works.

For block-level dedup you need to look at each extent individually.
That's much slower and full of additional caveats.  If you're going down
that road it's probably better to just improve duperemove instead.

> James


signature.asc
Description: Digital signature


Re: Announcing btrfs-dedupe

2016-11-14 Thread Zygo Blaxell
On Mon, Nov 14, 2016 at 02:56:51PM -0500, Austin S. Hemmelgarn wrote:
> On 2016-11-14 14:51, Zygo Blaxell wrote:
> >Deduplicating an extent that may might be concurrently modified during the
> >dedup is a reasonable userspace request.  In the general case there's
> >no way for userspace to ensure that it's not happening.
> I'm not even talking about the locking, I'm talking about the data
> comparison that the ioctl does to ensure they are the same before
> deduplicating them, and specifically that protecting against userspace just
> passing in two random extents that happen to be the same size but not
> contain the same data (because deduplication _should_ reject such a
> situation, that's what the clone ioctl is for).

If I'm deduping a VM image, and the virtual host is writing to said image
(which is likely since an incremental dedup will be intentionally doing
dedup over recently active data sets), the extent I just compared in
userspace might be different by the time the kernel sees it.

This is an important reason why the whole lock/read/compare/replace step
is an atomic operation from userspace's PoV.

The read also saves having to confirm a short/weak hash isn't a collision.
The RAM savings from using weak hashes (~48 bits) are a huge performance
win.

The locking overhead is very small compared to the reading overhead,
and (in the absence of bugs) it will only block concurrent writes to the
same offset range in the src/dst inodes (based on a read of the code...I
don't know if there's also an inode-level or backref-level barrier that
expands the locking scope).

I'm not sure the ioctl is well designed for simply throwing random
data at it, especially not entire files (it can't handle files over
16MB anyway).  It will read more data than it has to compared to a
block-by-block comparison from userspace with prefetches or a pair of
IO threads.  If userspace reads both copies of the data just before
issuing the extent-same call, the kernel will read the data from cache
reasonably quickly.

> The locking is perfectly reasonable and shouldn't contribute that much to
> the overhead (unless you're being crazy and deduplicating thousands of tiny
> blocks of data).

Why is deduplicating thousands of blocks of data crazy?  I already
deduplicate four orders of magnitude more than that per week.

> >That said, some optimization is possible (although there are good reasons
> >not to bother with optimization in the kernel):
> >
> > - VFS could recognize when it has two separate references to
> > the same physical extent and not re-read the same data twice
> > (but that requires teaching VFS how to do CoW in general, and is
> > hard for political reasons on top of the obvious technical ones).
> >
> > - the extent-same ioctl could check to see which extents
> > are referenced by the src and dst ranges, and return success
> > immediately without reading data if they are the same (but
> > userspace should already know this, or it's wasting a huge amount
> > of time before it even calls the kernel).
> >
> >>TBH, even though it's kind of annoying from a performance perspective, it's
> >>a rather nice safety net to have.  For example, one of the cases where I do
> >>deduplication is a couple of directories where each directory is an
> >>overlapping partial subset of one large tree which I keep elsewhere.  In
> >>this case, I can tell just by filename exactly what files might be
> >>duplicates, so the ioctl's check lets me just call the ioctl on all
> >>potential duplicates (after checking size, no point in wasting time if the
> >>files obviously aren't duplicates), and have it figure out whether or not
> >>they can be deduplicated.
> >>>
> >>>In any case, I'm considering some digging into the filesystem structures
> >>>to see if I can work this out myself before i do any deduplication. I'm
> >>>fairly sure this should be relatively simple to work out, at least well
> >>>enough for my purposes.
> >>Sadly, there's no way to avoid doing so right now.
> >>
> >>--
> >>To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> >>the body of a message to majord...@vger.kernel.org
> >>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


signature.asc
Description: Digital signature


Re: Announcing btrfs-dedupe

2016-11-14 Thread Zygo Blaxell
On Mon, Nov 14, 2016 at 01:39:02PM -0500, Austin S. Hemmelgarn wrote:
> On 2016-11-14 13:22, James Pharaoh wrote:
> >One thing I am keen to understand is if BTRFS will automatically ignore
> >a request to deduplicate a file if it is already deduplicated? Given the
> >performance I see when doing a repeat deduplication, it seems to me that
> >it can't be doing so, although this could be caused by the CPU usage you
> >mention above.
> What's happening is that the dedupe ioctl does a byte-wise comparison of the
> ranges to make sure they're the same before linking them.  This is actually
> what takes most of the time when calling the ioctl, and is part of why it
> takes longer the larger the range to deduplicate is.  In essence, it's
> behaving like an OS should and not trusting userspace to make reasonable
> requests (which is also why there's a separate ioctl to clone a range from
> another file instead of deduplicating existing data).

Deduplicating an extent that may might be concurrently modified during the
dedup is a reasonable userspace request.  In the general case there's
no way for userspace to ensure that it's not happening.

That said, some optimization is possible (although there are good reasons
not to bother with optimization in the kernel):

- VFS could recognize when it has two separate references to
the same physical extent and not re-read the same data twice
(but that requires teaching VFS how to do CoW in general, and is
hard for political reasons on top of the obvious technical ones).

- the extent-same ioctl could check to see which extents
are referenced by the src and dst ranges, and return success
immediately without reading data if they are the same (but
userspace should already know this, or it's wasting a huge amount
of time before it even calls the kernel).

> TBH, even though it's kind of annoying from a performance perspective, it's
> a rather nice safety net to have.  For example, one of the cases where I do
> deduplication is a couple of directories where each directory is an
> overlapping partial subset of one large tree which I keep elsewhere.  In
> this case, I can tell just by filename exactly what files might be
> duplicates, so the ioctl's check lets me just call the ioctl on all
> potential duplicates (after checking size, no point in wasting time if the
> files obviously aren't duplicates), and have it figure out whether or not
> they can be deduplicated.
> >
> >In any case, I'm considering some digging into the filesystem structures
> >to see if I can work this out myself before i do any deduplication. I'm
> >fairly sure this should be relatively simple to work out, at least well
> >enough for my purposes.
> Sadly, there's no way to avoid doing so right now.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: Digital signature


Re: Announcing btrfs-dedupe

2016-11-14 Thread Zygo Blaxell
On Mon, Nov 14, 2016 at 07:22:59PM +0100, James Pharaoh wrote:
> On 14/11/16 19:07, Zygo Blaxell wrote:
> >There is also a still-unresolved problem where the filesystem CPU usage
> >rises exponentially for some operations depending on the number of shared
> >references to an extent.  Files which contain blocks with more than a few
> >thousand shared references can trigger this problem.  A file over 1TB can
> >keep the kernel busy at 100% CPU for over 40 minutes at a time.
> 
> Yes, I see this all the time. For my use cases, I don't really care about
> "shared references" as blocks of files, but am happy to simply deduplicate
> at the whole-file level. I wonder if this still will have the same effect,
> however. I guess that this could be mitigated in a tool, but this is going
> to be both annoying and not the most elegant solution.

If you have huge files (1TB+) this can be a problem even with whole-file
deduplications (which are really just extent-level deduplications applied
to the entire file).  The CPU time is a product of file size and extent
reference count with some other multipliers on top.

I've hacked around it by timing how long it takes to manipulate the data,
and blacklisting any hash value or block address that takes more than
10 seconds to process (if such a block is found after blacklisting, just
skip processing the block/extent/file entirely).  It turns out there are
very few of these in practice (only a few hundred per TB) but these few
hundred block hash values occur millions of times in a large data corpus.

> One thing I am keen to understand is if BTRFS will automatically ignore a
> request to deduplicate a file if it is already deduplicated? Given the
> performance I see when doing a repeat deduplication, it seems to me that it
> can't be doing so, although this could be caused by the CPU usage you
> mention above.

As far as I can tell btrfs doesn't do anything different in this
case--it'll happily repeat the entire lock/read/compare/delete/insert
sequence even if the outcome cannot be different from the initial
conditions.  Due to limitations of VFS caching it'll read the same blocks
from storage hardware twice, too.

> In any case, I'm considering some digging into the filesystem structures to
> see if I can work this out myself before i do any deduplication. I'm fairly
> sure this should be relatively simple to work out, at least well enough for
> my purposes.

I used FIEMAP (then later replaced it with SEARCH_V2 for speed) to map
the extents to physical addresses before deduping them.  If you're only
going to do whole-file dedup then you only need to care about the physical
address of the first non-hole extent.



signature.asc
Description: Digital signature


Re: Announcing btrfs-dedupe

2016-11-14 Thread Zygo Blaxell
On Tue, Nov 08, 2016 at 12:06:01PM +0100, Niccolò Belli wrote:
> Nice, you should probably update the btrfs wiki as well, because there is no
> mention of btrfs-dedupe.
> 
> First question, why this name? Don't you plan to support xfs as well?

Does XFS plan to support LOGICAL_INO, INO_PATHS, and something analogous
to SEARCH_V2?

POSIX API + FILE_EXTENT_SAME is OK for the lowest common denominator
across arbitrary filesystems, but a btrfs-specific tool can do a lot
better.  Especially for incremental dedup and low-RAM algorithms.



signature.asc
Description: Digital signature


Re: Announcing btrfs-dedupe

2016-11-14 Thread Zygo Blaxell
On Mon, Nov 07, 2016 at 07:49:51PM +0100, James Pharaoh wrote:
> Annoyingly I can't find this now, but I definitely remember reading someone,
> apparently someone knowledgable, claim that the latest version of the kernel
> which I was using at the time, still suffered from issues regarding the
> dedupe code.

> This was a while ago, and I would be very pleased to hear that there is high
> confidence in the current implementation! I'll post a link if I manage to
> find the comments.

I've been running the btrfs dedup ioctl 7 times per second on average
over 42TB of test data for most of a year (and at a lower rate for two
years).  I have not found any data corruptions due to _dedup_.  I did find
three distinct data corruption kernel bugs unrelated to dedup, and two
test machines with bad RAM, so I'm pretty sure my corruption detection
is working.

That said, I wouldn't run dedup on a kernel older than 4.4.  LTS kernels
might be OK too, but only if they're up to date with backported btrfs
fixes.

Kernels older than 3.13 lack the FILE_EXTENT_SAME ioctl and can
only deduplicate static data (i.e. data you are certain is not being
concurrently modified).  Before 3.12 there are so many bugs you might
as well not bother.

Older kernels are bad for dedup because of non-corruption reasons.
Between 3.13 and 4.4, the following bugs were fixed:

- false-negative capability checks (e.g. same-inode, EOF extent)
reduce dedup efficiency

- ctime updates (older versions would update ctime when a file was
deduped) mess with incremental backup tools, build systems, etc.

- kernel memory leaks (self-explanatory)

- multiple kernel hang/panic bugs (e.g. a deadlock if two threads
try to read the same extent at the same time, and at least one
of those threads is dedup; and there was some race condition
leading to invalid memory access on dedup's comparison reads)
which won't eat your data, but they might ruin your day anyway.

There is also a still-unresolved problem where the filesystem CPU usage
rises exponentially for some operations depending on the number of shared
references to an extent.  Files which contain blocks with more than a few
thousand shared references can trigger this problem.  A file over 1TB can
keep the kernel busy at 100% CPU for over 40 minutes at a time.

There might also be a correlation between delalloc data and hangs in
extent-same, but I have NOT been able to confirm this.  All I know
at this point is that doing a fsync() on the source FD just before
doing the extent-same ioctl dramatically reduces filesystem hang rates:
several weeks between hangs (or no hangs at all) with fsync, vs. 18 hours
or less without.

> James
> 
> On 07/11/16 18:59, Mark Fasheh wrote:
> >Hi James,
> >
> >Re the following text on your project page:
> >
> >"IMPORTANT CAVEAT — I have read that there are race and/or error
> >conditions which can cause filesystem corruption in the kernel
> >implementation of the deduplication ioctl."
> >
> >Can you expound on that? I'm not aware of any bugs right now but if
> >there is any it'd absolutely be worth having that info on the btrfs
> >list.
> >
> >Thanks,
> >--Mark
> >
> >
> >On Sun, Nov 6, 2016 at 7:30 AM, James Pharaoh
> > wrote:
> >>Hi all,
> >>
> >>I'm pleased to announce my btrfs deduplication utility, written in Rust.
> >>This operates on whole files, is fast, and I believe complements the
> >>existing utilities (duperemove, bedup), which exist currently.
> >>
> >>Please visit the homepage for more information:
> >>
> >>http://btrfs-dedupe.com
> >>
> >>James Pharaoh
> >>--
> >>To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> >>the body of a message to majord...@vger.kernel.org
> >>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: Digital signature


Re: Identifying reflink / CoW files

2016-11-02 Thread Zygo Blaxell
On Thu, Oct 27, 2016 at 01:30:11PM +0200, Saint Germain wrote:
> Hello,
> 
> Following the previous discussion:
> https://www.spinics.net/lists/linux-btrfs/msg19075.html
> 
> I would be interested in finding a way to reliably identify reflink /
> CoW files in order to use deduplication programs (like fdupes, jdupes,
> rmlint) efficiently.
> 
> Using FIEMAP doesn't seem to be reliable according to this discussion
> on rmlint:
> https://github.com/sahib/rmlint/issues/132#issuecomment-157665154

Inline extents have no physical address (FIEMAP returns 0 in that field).
You can't dedup them and each file can have only one, so if you see
the FIEMAP_EXTENT_INLINE bit set, you can just skip processing the entire
file immediately.

You can create a separate non-inline extent in a temporary file then
use dedup to replace _both_ copies of the original inline extent.
Or don't bother, as the savings are negligible.

> Is there another way that deduplication programs can easily use ?

The problem is that it's not files that are reflinked--individual extents
are.  "reflink file copy" really just means "a file whose extents are
100% shared with another file." It's possible for files on btrfs to have
any percentage of shared extents from 0 to 100% in increments of the
host page size.  It's also possible for the blocks to be shared with
different extent boundaries.

The quality of the result therefore depends on the amount of effort
put into measuring it.  If you look for the first non-hole extent in
each file and use its physical address as a physical file identifier,
then you get a fast reflink detector function that has a high risk of
false positives.  If you map out two files and compare physical addresses
block by block, you get a slow function with a low risk of false positives
(but maybe a small risk of false negatives too).

If your dedup program only does full-file reflink copies then the first
extent physical address method is sufficient.  If your program does
block- or extent-level dedup then it shouldn't be using files in its
data model at all, except where necessary to provide a mechanism to
access the physical blocks through the POSIX filesystem API.

FIEMAP will tell you about all the extents (physical address for extents
that have them, zero for other extent types).  It's also slow and has
assorted accuracy problems especially with compressed files.  Any user
can run FIEMAP, and it uses only standard structure arrays.

SEARCH_V2 is root-only and requires parsing variable-length binary
btrfs data encoding, but it's faster than FIEMAP and gives more accurate
results on compressed files.

> Thanks
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: Digital signature


Re: Monitoring Btrfs

2016-10-17 Thread Zygo Blaxell
On Mon, Oct 17, 2016 at 06:44:14PM +0200, Stefan Malte Schumacher wrote:
> Hello
> 
> I would like to monitor my btrfs-filesystem for missing drives. On
> Debian mdadm uses a script in /etc/cron.daily, which calls mdadm and
> sends an email if anything is wrong with the array. I would like to do
> the same with btrfs. In my first attempt I grepped and cut the
> information from "btrfs fi show" and let the script send an email if
> the number of devices was not equal to the preselected number.
> 
> Then I saw this:
> 
> ubuntu@ubuntu:~$ sudo btrfs filesystem show
> Label: none  uuid: 67b4821f-16e0-436d-b521-e4ab2c7d3ab7
> Total devices 6 FS bytes used 5.47TiB
> devid1 size 1.81TiB used 1.71TiB path /dev/sda3
> devid2 size 1.81TiB used 1.71TiB path /dev/sdb3
> devid3 size 1.82TiB used 1.72TiB path /dev/sdc1
> devid4 size 1.82TiB used 1.72TiB path /dev/sdd1
> devid5 size 2.73TiB used 2.62TiB path /dev/sde1
> *** Some devices missing
> 
> on this page: 
> https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices
> The number of devices is still at 6, despite the fact that one of the
> drives is missing, which means that my first idea doesnt work. 

Using fi show for this isn't a good idea.  By the time btrfs fi show
tells you something is different from the norm, you've probably already
crashed at least once and are now mounting with the 'degraded' option.

> I have
> two questions:
> 1) Has anybody already written a script like this? After all, there is
> no need to reinvent the wheel a second time.
> 2) What should I best grep for? In this case I would just go for the
> "missing". Does this cover all possible outputs of btrfs fi show in
> case of a damaged array? What other outputs do I need to consider for
> my script.

I monitor the device error counters, i.e. the output of

for fs in /fs1 /fs2 /fs3... ; do
btrfs dev stat "$fs" | grep -v " 0$"
done

and send an email when it isn't empty.

When there are errors I investigate in more detail (is it a failing disk?
failed disk?  bad cables?  bad RAM?  One-off UNC sector that can be
ignored?), fix any problems (i.e. replace hardware, run scrub), and
reset the counters to zero with 'btrfs dev stat -z'.

> Yours sincerely
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: Digital signature


Re: [RFC] btrfs: make max inline data can be equal to sectorsize

2016-10-15 Thread Zygo Blaxell
On Wed, Oct 12, 2016 at 11:35:46AM +0800, Wang Xiaoguang wrote:
> hi,
> 
> On 10/11/2016 11:49 PM, Chris Murphy wrote:
> >On Tue, Oct 11, 2016 at 12:47 AM, Wang Xiaoguang
> > wrote:
> >>If we use mount option "-o max_inline=sectorsize", say 4096, indeed
> >>even for a fresh fs, say nodesize is 16k, we can not make the first
> >>4k data completely inline, I found this conditon causing this issue:
> >>   !compressed_size && (actual_end & (root->sectorsize - 1)) == 0
> >>
> >>If it retuns true, we'll not make data inline. For 4k sectorsize,
> >>0~4094 dara range, we can make it inline, but 0~4095, it can not.
> >>I don't think this limition is useful, so here remove it which will
> >>make max inline data can be equal to sectorsize.
> >>
> >>Signed-off-by: Wang Xiaoguang 
> >>---
> >>  fs/btrfs/inode.c | 2 --
> >>  1 file changed, 2 deletions(-)
> >>
> >>diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> >>index ea15520..c0db393 100644
> >>--- a/fs/btrfs/inode.c
> >>+++ b/fs/btrfs/inode.c
> >>@@ -267,8 +267,6 @@ static noinline int cow_file_range_inline(struct 
> >>btrfs_root *root,
> >> if (start > 0 ||
> >> actual_end > root->sectorsize ||
> >> data_len > BTRFS_MAX_INLINE_DATA_SIZE(root) ||
> >>-   (!compressed_size &&
> >>-   (actual_end & (root->sectorsize - 1)) == 0) ||
> >> end + 1 < isize ||
> >> data_len > root->fs_info->max_inline) {
> >> return 1;
> >>--
> >>2.9.0
> >
> >Before making any further changes to inline data, does it make sense
> >to find the source of corruption Zygo has been experiencing? That's in
> >the "btrfs rare silent data corruption with kernel data leak" thread.
> Yes, agree.
> Also Zygo has sent a patch to fix that bug this morning :)

FWIW I don't see any connection between this and the problem I found.
A page-sized inline extent wouldn't have any room for uninitialized
bytes.  If anthing, it's the one rare case that already worked.  ;)

> Regards,
> XIaoguang Wang
> 
> >
> >
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: Digital signature


  1   2   3   >