date:20141009

On 2014-10-08 15:11, Eric Sandeen wrote:

I was looking at Marc's post:

http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html

and it feels like there isn't exactly a cohesive, overarching vision for
repair of a corrupted btrfs filesystem.

In other words - I'm an admin cruising along, when the kernel throws some
fs corruption error, or for whatever reason btrfs fails to mount.
What should I do?

Marc lays out several steps, but to me this highlights that there seem to
be a lot of disjoint mechanisms out there to deal with these problems;
mostly from Marc's blog, with some bits of my own:

* btrfs scrub
Errors are corrected along if possible (what *is* possible?)
* mount -o recovery
Enable autorecovery attempts if a bad tree root is found at mount
time.
* mount -o degraded
Allow mounts to continue with missing devices.
(This isn't really a way to recover from corruption, right?)
* btrfs-zero-log
remove the log tree if log tree is corrupt
* btrfs rescue
Recover a damaged btrfs filesystem
chunk-recover
super-recover
How does this relate to btrfs check?
* btrfs check
repair a btrfs filesystem
--repair
--init-csum-tree
--init-extent-tree
How does this relate to btrfs rescue?
* btrfs restore
try to salvage files from a damaged filesystem
(not really repair, it's disk-scraping)

What's the vision for, say, scrub vs. check vs. rescue? Should they repair the
same errors, only online vs. offline? If not, what class of errors does one
fix vs.
the other? How would an admin know? Can btrfs check recover a bad tree root
in the same way that mount -o recovery does? How would I know if I should use
--init-*-tree, or chunk-recover, and what are the ramifications of using
these options?

It feels like recovery tools have been badly splintered, and if there's an
overarching design or vision for btrfs fs repair, I can't tell what it is.
Can anyone help me?

Well, based on my understanding:
* btrfs scrub is intended to be almost exactly equivalent to scrubbing a
RAID volume; that is, it fixes disparity between multiple copies of the
same block. IOW, it isn't really repair per se, but more preventative
maintnence. Currently, it only works for cases where you have multiple
copies of a block (dup, raid1, and raid10 profiles), but support is
planned for error correction of raid5 and raid6 profiles.
* mount -o recovery I don't know much about, but AFAICT, it s more for
dealing with metadata related FS corruption.
* mount -o degraded is used to mount a fs configured for a raid storage
profile with fewer devices than the profile minimum. It's primarily so
that you can get the fs into a state where you can run 'btrfs device
replace'
* btrfs-zero-log only deals with log tree corruption. This would be
roughly equivalent to zeroing out the journal on an XFS or ext4
filesystem, and should almost never be needed.
* btrfs rescue is intended for low level recovery corruption on an
offline fs.
* chunk-recover I'm not entirely sure about, but I believe it's
like scrub for a single chunk on an offline fs
* super-recover is for dealing with corrupted superblocks, and
tries to replace it with one of the other copies (which hopefully isn't
corrupted)
* btrfs check is intended to (eventually) be equivalent to the fsck
utility for most other filesystems. Currently, it's relatively good at
identifying corruption, but less so at actually fixing it. There are
however, some things that it won't catch, like a superblock pointing to
a corrupted root tree.
* btrfs restore is essentially disk scraping, but with built-in
knowledge of the filesystem's on-disk structure, which makes it more
reliable than more generic tools like scalpel for files that are too big
to fit in the metadata blocks, and it is pretty much essential for
dealing with transparently compressed files.

In general, my personal procedure for handling a misbehaving BTRFS
filesystem is:
* Run btrfs check on it WITHOUT ANY OTHER OPTIONS to try to identify
what's wrong

* Try mounting it using -o recovery
* Try mounting it using -o ro,recovery
* Use -o degraded only if it's a BTRFS raid set that lost a disk
* If btrfs check AND dmesg both seem to indicate that the log tree is
corrupt, try btrfs-zero-log
* If btrfs check indicated a corrupt superblock, try btrfs rescue
super-recover

* If all of the above fails, ask for advice on the mailing list or IRC
Also, you should be running btrfs scrub regularly to correct bit-rot and
force remapping of blocks with read errors. While BTRFS technically
handles both transparently on reads, it only corrects thing on disk when
you do a scrub.

smime.p7s
Description: S/MIME Cryptographic Signature

Re: What is the vision for btrfs fs repair?

Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as
excerpted:

 Also, you should be running btrfs scrub regularly to correct bit-rot
 and force remapping of blocks with read errors.  While BTRFS
 technically handles both transparently on reads, it only corrects thing
 on disk when you do a scrub.

AFAIK that isn't quite correct.  Currently, the number of copies is 
limited to two, meaning if one of the two is bad, there's a 50% chance of 
btrfs reading the good one on first try.

If btrfs reads the good copy, it simply uses it.  If btrfs reads the bad 
one, it checks the other one and assuming it's good, replaces the bad one 
with the good one both for the read (which otherwise errors out), and by 
overwriting the bad one.

But here's the rub.  The chances of detecting that bad block are 
relatively low in most cases.  First, the system must try reading it for 
some reason, but even then, chances are 50% it'll pick the good one and 
won't even notice the bad one.

Thus, while btrfs may randomly bump into a bad block and rewrite it with 
the good copy, scrub is the only way to systematically detect and (if 
there's a good copy) fix these checksum errors.  It's not that btrfs 
doesn't do it if it finds them, it's that the chances of finding them are 
relatively low, unless you do a scrub, which systematically checks the 
entire filesystem (well, other than files marked nocsum, or nocow, which 
implies nocsum, or files written when mounted with nodatacow or 
nodatasum).

At least that's the way it /should/ work.  I guess it's possible that 
btrfs isn't doing those routine bump-into-it-and-fix-it fixes yet, but 
if so, that's the first /I/ remember reading of it.

Other than that detail, what you posted matches my knowledge and 
experience, such as it may be as a non-dev list regular, as well.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What is the vision for btrfs fs repair?

2014-10-09 Thread Hugo Mills

On Thu, Oct 09, 2014 at 11:53:23AM +, Duncan wrote:
 Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as
 excerpted:
 
  Also, you should be running btrfs scrub regularly to correct bit-rot
  and force remapping of blocks with read errors.  While BTRFS
  technically handles both transparently on reads, it only corrects thing
  on disk when you do a scrub.
 
 AFAIK that isn't quite correct.  Currently, the number of copies is 
 limited to two, meaning if one of the two is bad, there's a 50% chance of 
 btrfs reading the good one on first try.

   Scrub checks both copies, though. It's ordinary reads that don't.

   Hugo.

 If btrfs reads the good copy, it simply uses it.  If btrfs reads the bad 
 one, it checks the other one and assuming it's good, replaces the bad one 
 with the good one both for the read (which otherwise errors out), and by 
 overwriting the bad one.
 
 But here's the rub.  The chances of detecting that bad block are 
 relatively low in most cases.  First, the system must try reading it for 
 some reason, but even then, chances are 50% it'll pick the good one and 
 won't even notice the bad one.
 
 Thus, while btrfs may randomly bump into a bad block and rewrite it with 
 the good copy, scrub is the only way to systematically detect and (if 
 there's a good copy) fix these checksum errors.  It's not that btrfs 
 doesn't do it if it finds them, it's that the chances of finding them are 
 relatively low, unless you do a scrub, which systematically checks the 
 entire filesystem (well, other than files marked nocsum, or nocow, which 
 implies nocsum, or files written when mounted with nodatacow or 
 nodatasum).
 
 At least that's the way it /should/ work.  I guess it's possible that 
 btrfs isn't doing those routine bump-into-it-and-fix-it fixes yet, but 
 if so, that's the first /I/ remember reading of it.
 
 Other than that detail, what you posted matches my knowledge and 
 experience, such as it may be as a non-dev list regular, as well.
 

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- Great oxymorons of the world, no. 7: The Simple Truth ---  


signature.asc
Description: Digital signature

Re: What is the vision for btrfs fs repair?


On 2014-10-09 07:53, Duncan wrote:

Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as
excerpted:


Also, you should be running btrfs scrub regularly to correct bit-rot
and force remapping of blocks with read errors.  While BTRFS
technically handles both transparently on reads, it only corrects thing
on disk when you do a scrub.


AFAIK that isn't quite correct.  Currently, the number of copies is
limited to two, meaning if one of the two is bad, there's a 50% chance of
btrfs reading the good one on first try.

If btrfs reads the good copy, it simply uses it.  If btrfs reads the bad
one, it checks the other one and assuming it's good, replaces the bad one
with the good one both for the read (which otherwise errors out), and by
overwriting the bad one.

But here's the rub.  The chances of detecting that bad block are
relatively low in most cases.  First, the system must try reading it for
some reason, but even then, chances are 50% it'll pick the good one and
won't even notice the bad one.

Thus, while btrfs may randomly bump into a bad block and rewrite it with
the good copy, scrub is the only way to systematically detect and (if
there's a good copy) fix these checksum errors.  It's not that btrfs
doesn't do it if it finds them, it's that the chances of finding them are
relatively low, unless you do a scrub, which systematically checks the
entire filesystem (well, other than files marked nocsum, or nocow, which
implies nocsum, or files written when mounted with nodatacow or
nodatasum).

At least that's the way it /should/ work.  I guess it's possible that
btrfs isn't doing those routine bump-into-it-and-fix-it fixes yet, but
if so, that's the first /I/ remember reading of it.


I'm not 100% certain, but I believe it doesn't actually fix things on 
disk when it detects an error during a read, I know it doesn't it the fs 
is mounted ro (even if the media is writable), because I did some 
testing to see how 'read-only' mounting a btrfs filesystem really is.


Also, that's a much better description of how multiple copies work than 
I could probably have ever given.





smime.p7s
Description: S/MIME Cryptographic Signature

Re: What is the vision for btrfs fs repair?

2014-10-09 Thread Hugo Mills

On Thu, Oct 09, 2014 at 08:07:51AM -0400, Austin S Hemmelgarn wrote:
 On 2014-10-09 07:53, Duncan wrote:
 Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as
 excerpted:
 
 Also, you should be running btrfs scrub regularly to correct bit-rot
 and force remapping of blocks with read errors.  While BTRFS
 technically handles both transparently on reads, it only corrects thing
 on disk when you do a scrub.
 
 AFAIK that isn't quite correct.  Currently, the number of copies is
 limited to two, meaning if one of the two is bad, there's a 50% chance of
 btrfs reading the good one on first try.
 
 If btrfs reads the good copy, it simply uses it.  If btrfs reads the bad
 one, it checks the other one and assuming it's good, replaces the bad one
 with the good one both for the read (which otherwise errors out), and by
 overwriting the bad one.
 
 But here's the rub.  The chances of detecting that bad block are
 relatively low in most cases.  First, the system must try reading it for
 some reason, but even then, chances are 50% it'll pick the good one and
 won't even notice the bad one.
 
 Thus, while btrfs may randomly bump into a bad block and rewrite it with
 the good copy, scrub is the only way to systematically detect and (if
 there's a good copy) fix these checksum errors.  It's not that btrfs
 doesn't do it if it finds them, it's that the chances of finding them are
 relatively low, unless you do a scrub, which systematically checks the
 entire filesystem (well, other than files marked nocsum, or nocow, which
 implies nocsum, or files written when mounted with nodatacow or
 nodatasum).
 
 At least that's the way it /should/ work.  I guess it's possible that
 btrfs isn't doing those routine bump-into-it-and-fix-it fixes yet, but
 if so, that's the first /I/ remember reading of it.
 
 I'm not 100% certain, but I believe it doesn't actually fix things on disk
 when it detects an error during a read,

   I'm fairly sure it does, as I've had it happen to me. :)

 I know it doesn't it the fs is
 mounted ro (even if the media is writable), because I did some testing to
 see how 'read-only' mounting a btrfs filesystem really is.

   If the FS is RO, then yes, it won't fix things.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- Great films about cricket:  Interview with the Umpire ---  


signature.asc
Description: Digital signature

Re: What is the vision for btrfs fs repair?


On 2014-10-09 08:12, Hugo Mills wrote:

On Thu, Oct 09, 2014 at 08:07:51AM -0400, Austin S Hemmelgarn wrote:

On 2014-10-09 07:53, Duncan wrote:

Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as
excerpted:


Also, you should be running btrfs scrub regularly to correct bit-rot
and force remapping of blocks with read errors.  While BTRFS
technically handles both transparently on reads, it only corrects thing
on disk when you do a scrub.


AFAIK that isn't quite correct.  Currently, the number of copies is
limited to two, meaning if one of the two is bad, there's a 50% chance of
btrfs reading the good one on first try.

If btrfs reads the good copy, it simply uses it.  If btrfs reads the bad
one, it checks the other one and assuming it's good, replaces the bad one
with the good one both for the read (which otherwise errors out), and by
overwriting the bad one.

But here's the rub.  The chances of detecting that bad block are
relatively low in most cases.  First, the system must try reading it for
some reason, but even then, chances are 50% it'll pick the good one and
won't even notice the bad one.

Thus, while btrfs may randomly bump into a bad block and rewrite it with
the good copy, scrub is the only way to systematically detect and (if
there's a good copy) fix these checksum errors.  It's not that btrfs
doesn't do it if it finds them, it's that the chances of finding them are
relatively low, unless you do a scrub, which systematically checks the
entire filesystem (well, other than files marked nocsum, or nocow, which
implies nocsum, or files written when mounted with nodatacow or
nodatasum).

At least that's the way it /should/ work.  I guess it's possible that
btrfs isn't doing those routine bump-into-it-and-fix-it fixes yet, but
if so, that's the first /I/ remember reading of it.


I'm not 100% certain, but I believe it doesn't actually fix things on disk
when it detects an error during a read,


I'm fairly sure it does, as I've had it happen to me. :)
I probably just misinterpreted the source code, while I know enough C to 
generally understand things, I'm by far no expert.



I know it doesn't it the fs is
mounted ro (even if the media is writable), because I did some testing to
see how 'read-only' mounting a btrfs filesystem really is.


If the FS is RO, then yes, it won't fix things.

Hugo.






smime.p7s
Description: S/MIME Cryptographic Signature

Re: What is the vision for btrfs fs repair?

On Thu, 09 Oct 2014 08:07:51 -0400
Austin S Hemmelgarn ahferro...@gmail.com wrote:

 On 2014-10-09 07:53, Duncan wrote:
  Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as
  excerpted:
 
  Also, you should be running btrfs scrub regularly to correct
  bit-rot and force remapping of blocks with read errors.  While
  BTRFS technically handles both transparently on reads, it only
  corrects thing on disk when you do a scrub.
 
  AFAIK that isn't quite correct.  Currently, the number of copies is
  limited to two, meaning if one of the two is bad, there's a 50%
  chance of btrfs reading the good one on first try.
 
  If btrfs reads the good copy, it simply uses it.  If btrfs reads
  the bad one, it checks the other one and assuming it's good,
  replaces the bad one with the good one both for the read (which
  otherwise errors out), and by overwriting the bad one.
 
  But here's the rub.  The chances of detecting that bad block are
  relatively low in most cases.  First, the system must try reading
  it for some reason, but even then, chances are 50% it'll pick the
  good one and won't even notice the bad one.
 
  Thus, while btrfs may randomly bump into a bad block and rewrite it
  with the good copy, scrub is the only way to systematically detect
  and (if there's a good copy) fix these checksum errors.  It's not
  that btrfs doesn't do it if it finds them, it's that the chances of
  finding them are relatively low, unless you do a scrub, which
  systematically checks the entire filesystem (well, other than files
  marked nocsum, or nocow, which implies nocsum, or files written
  when mounted with nodatacow or nodatasum).
 
  At least that's the way it /should/ work.  I guess it's possible
  that btrfs isn't doing those routine bump-into-it-and-fix-it
  fixes yet, but if so, that's the first /I/ remember reading of it.
 
 I'm not 100% certain, but I believe it doesn't actually fix things on 
 disk when it detects an error during a read, I know it doesn't it the
 fs is mounted ro (even if the media is writable), because I did some 
 testing to see how 'read-only' mounting a btrfs filesystem really is.

Definitely it won't with a read-only mount.  But then scrub shouldn't
be able to write to a read-only mount either.  The only way a read-only
mount should be writable is if it's mounted (bind-mounted or
btrfs-subvolume-mounted) read-write elsewhere, and the write occurs to
that mount, not the read-only mounted location.

There's even debate about replaying the journal or doing orphan-delete
on read-only mounts (at least on-media, the change could, and arguably
should, occur in RAM and be cached, marking the cache dirty at the
same time so it's appropriately flushed if/when the filesystem goes
writable), with some arguing read-only means just that, don't
write /anything/ to it until it's read-write mounted.

But writable-mounted, detected checksum errors (with a good copy
available) should be rewritten as far as I know.  If not, I'd call it
a bug.  The problem is in the detection, not in the rewriting.  Scrub's
the only way to reliably detect these errors since it's the only thing
that systematically checks /everything/.

 Also, that's a much better description of how multiple copies work
 than I could probably have ever given.

Thanks.  =:^)

-- 
Duncan - No HTML messages please, as they are filtered as spam.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What is the vision for btrfs fs repair?

On Thu, 9 Oct 2014 12:55:50 +0100
Hugo Mills h...@carfax.org.uk wrote:

 On Thu, Oct 09, 2014 at 11:53:23AM +, Duncan wrote:
  Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as
  excerpted:
  
   Also, you should be running btrfs scrub regularly to correct
   bit-rot and force remapping of blocks with read errors.  While
   BTRFS technically handles both transparently on reads, it only
   corrects thing on disk when you do a scrub.
  
  AFAIK that isn't quite correct.  Currently, the number of copies is 
  limited to two, meaning if one of the two is bad, there's a 50%
  chance of btrfs reading the good one on first try.
 
Scrub checks both copies, though. It's ordinary reads that don't.

While I believe I was clear in full context (see below), agreed.  I was
talking about normal reads in the above, not scrub, as the full quote
should make clear.  I guess I could have made it clearer in the
immediate context, however.  Thanks.

  Thus, while btrfs may randomly bump into a bad block and rewrite it
  with the good copy, scrub is the only way to systematically detect
  and (if there's a good copy) fix these checksum errors.



-- 
Duncan - No HTML messages please, as they are filtered as spam.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What is the vision for btrfs fs repair?


On 2014-10-09 08:34, Duncan wrote:

On Thu, 09 Oct 2014 08:07:51 -0400
Austin S Hemmelgarn ahferro...@gmail.com wrote:


On 2014-10-09 07:53, Duncan wrote:

Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as
excerpted:


Also, you should be running btrfs scrub regularly to correct
bit-rot and force remapping of blocks with read errors.  While
BTRFS technically handles both transparently on reads, it only
corrects thing on disk when you do a scrub.


AFAIK that isn't quite correct.  Currently, the number of copies is
limited to two, meaning if one of the two is bad, there's a 50%
chance of btrfs reading the good one on first try.

If btrfs reads the good copy, it simply uses it.  If btrfs reads
the bad one, it checks the other one and assuming it's good,
replaces the bad one with the good one both for the read (which
otherwise errors out), and by overwriting the bad one.

But here's the rub.  The chances of detecting that bad block are
relatively low in most cases.  First, the system must try reading
it for some reason, but even then, chances are 50% it'll pick the
good one and won't even notice the bad one.

Thus, while btrfs may randomly bump into a bad block and rewrite it
with the good copy, scrub is the only way to systematically detect
and (if there's a good copy) fix these checksum errors.  It's not
that btrfs doesn't do it if it finds them, it's that the chances of
finding them are relatively low, unless you do a scrub, which
systematically checks the entire filesystem (well, other than files
marked nocsum, or nocow, which implies nocsum, or files written
when mounted with nodatacow or nodatasum).

At least that's the way it /should/ work.  I guess it's possible
that btrfs isn't doing those routine bump-into-it-and-fix-it
fixes yet, but if so, that's the first /I/ remember reading of it.


I'm not 100% certain, but I believe it doesn't actually fix things on
disk when it detects an error during a read, I know it doesn't it the
fs is mounted ro (even if the media is writable), because I did some
testing to see how 'read-only' mounting a btrfs filesystem really is.


Definitely it won't with a read-only mount.  But then scrub shouldn't
be able to write to a read-only mount either.  The only way a read-only
mount should be writable is if it's mounted (bind-mounted or
btrfs-subvolume-mounted) read-write elsewhere, and the write occurs to
that mount, not the read-only mounted location.

In theory yes, but there are caveats to this, namely:
* atime updates still happen unless you have mounted the fs with noatime
* The superblock gets updated if there are 'any' writes
* The free space cache 'might' be updated if there are any writes

All in all, a BTRFS filesystem mounted ro is much more read-only than 
say ext4 (which at least updates the sb, and old versions replayed the 
journal, in addition to the atime updates).


There's even debate about replaying the journal or doing orphan-delete
on read-only mounts (at least on-media, the change could, and arguably
should, occur in RAM and be cached, marking the cache dirty at the
same time so it's appropriately flushed if/when the filesystem goes
writable), with some arguing read-only means just that, don't
write /anything/ to it until it's read-write mounted.

But writable-mounted, detected checksum errors (with a good copy
available) should be rewritten as far as I know.  If not, I'd call it
a bug.  The problem is in the detection, not in the rewriting.  Scrub's
the only way to reliably detect these errors since it's the only thing
that systematically checks /everything/.


Also, that's a much better description of how multiple copies work
than I could probably have ever given.


Thanks.  =:^)






smime.p7s
Description: S/MIME Cryptographic Signature

Re: What is the vision for btrfs fs repair?

Austin S Hemmelgarn posted on Thu, 09 Oct 2014 09:18:22 -0400 as
excerpted:

 On 2014-10-09 08:34, Duncan wrote:

 The only way a read-only
 mount should be writable is if it's mounted (bind-mounted or
 btrfs-subvolume-mounted) read-write elsewhere, and the write occurs to
 that mount, not the read-only mounted location.

 In theory yes, but there are caveats to this, namely:
 * atime updates still happen unless you have mounted the fs with noatime

I've been mounting noatime for well over a decade now, exactly due to 
such problems.  But I believe at least /some/ filesystems are truly read-
only when they're mounted as such, and atime updates don't happen on them.

These days I actually apply a patch that changes the default relatime to 
noatime, so I don't even have to have it in my mount-options. =:^)

 * The superblock gets updated if there are 'any' writes

Yeah.  At least in theory, there shouldn't be, however.  As I said, in 
theory, even journal replay and orphan delete shouldn't hit media, altho 
handling it in memory and dirtying the cache, so if the filesystem is 
ever remounted read-write they get written, is reasonable.

 * The free space cache 'might' be updated if there are any writes

Makes sense.  But of course that's what I'm arguing, there shouldn't /be/ 
any writes.  Read-only should mean exactly that, don't touch media, 
period.

I remember at one point activating an mdraid1 degraded, read-only, just a 
single device of the 4-way raid1 I was running at the time, to recover 
data from it after the system it was running in died.  The idea was don't 
write to the device at all, because I was still testing the new system, 
and in case I decided to try to reassemble the raid at some point.  Read-
only really NEEDS to be read-only, under such conditions.

Similarly for forensic examination, of course.  If there's a write, any 
write, it's evidence tampering.  Read-only needs to MEAN read-only!

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

deadlock with 3.16.3

2014-10-09 Thread E V

running wheezy Debian 3.16.3-2~bpo70+1 system has locked up 2 nights
in a row running rsync copying from remote to a ~100TB btrfs. Only job
running on the server, no interactive users or anything. soft locks
showed up in kern.log across many CPUs shortly before system became
non-responsive. First lines of the call traces via: grep '8 17:12'
/var/log/kern.log | cut -d  --complement -f 1,2,3,4,5,6 | grep -A1
'Call Trace:' | egrep -v '\-\-|Call Trace:'

I've gone back to 3.14, which I've never had issues with. sar report
from 5 minutes prior looked pretty normal, no swap depletion. Let me
know if you want more info

[30328.735511]  [a0227df8] ?
btrfs_lookup_file_extent+0x38/0x40 [btrfs]
[30328.747283]  [a026d353] ? btrfs_tree_lock+0xd3/0x1e0 [btrfs]
[30328.759289]  [a026d353] ? btrfs_tree_lock+0xd3/0x1e0 [btrfs]
[30328.775303]  [a026d353] ? btrfs_tree_lock+0xd3/0x1e0 [btrfs]
[30328.787311]  [a026d353] ? btrfs_tree_lock+0xd3/0x1e0 [btrfs]
[30328.799324]  [a026cfaf] ? btrfs_tree_read_lock+0x3f/0x110 [btrfs]
[30328.839351]  [a026cfaf] ? btrfs_tree_read_lock+0x3f/0x110 [btrfs]
[30328.855363]  [a026d353] ? btrfs_tree_lock+0xd3/0x1e0 [btrfs]
[30328.867373]  [a026cecc] ?
btrfs_clear_lock_blocking_rw+0x4c/0xf0 [btrfs]
[30328.879384]  [a026cfaf] ? btrfs_tree_read_lock+0x3f/0x110 [btrfs]
[30328.891394]  [a026cecc] ?
btrfs_clear_lock_blocking_rw+0x4c/0xf0 [btrfs]
[30328.907407]  [a026d353] ? btrfs_tree_lock+0xd3/0x1e0 [btrfs]
[30328.919396]  [a026cfaf] ? btrfs_tree_read_lock+0x3f/0x110 [btrfs]
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

btrfs balance segfault, kernel BUG at fs/btrfs/extent-tree.c:7727

2014-10-09 Thread Petr Janecek

Hello,

  I have trouble finishing btrfs balance on five disk raid10 fs.
I added a disk to 4x3TB raid10 fs and run btrfs balance start
/mnt/b3, which segfaulted after few hours, probably because of the BUG
below. btrfs check does not find any errors, both before the balance
and after reboot (the fs becomes un-umountable).

This was second attempt of balance run, the first one ended the same,
see Comment 10 on bugzilla https://bugzilla.kernel.org/show_bug.cgi?id=64961

There are ~7.5M files on /mnt/b3; one subvolume with 4.8M files has been
snapshot 85 times.

root@fs0:~# uname -a
Linux fs0 3.17.0 #10 SMP Mon Oct 6 11:31:13 CEST 2014 x86_64 GNU/Linux
root@fs0:~# btrfs fi show /mnt/b3
Label: 'BTR3'  uuid: f181dd81-c219-44fc-b113-3a8cfd0d3295
Total devices 5 FS bytes used 2.35TiB
devid1 size 2.73TiB used 1.05TiB path /dev/sde
devid2 size 2.73TiB used 1.05TiB path /dev/sdf
devid3 size 2.73TiB used 1.05TiB path /dev/sdg
devid4 size 2.73TiB used 1.05TiB path /dev/sdh
devid5 size 3.64TiB used 524.03GiB path /dev/sdp

Btrfs v3.16
root@fs0:~# btrfs fi df /mnt/b3
Data, RAID10: total=2.34TiB, used=2.34TiB
System, RAID1: total=32.00MiB, used=304.00KiB
Metadata, RAID1: total=15.00GiB, used=13.75GiB
unknown, single: total=512.00MiB, used=496.00KiB


[22717.728944] BTRFS info (device sdp): relocating block group 299458816 
flags 65
[22735.276539] BTRFS info (device sdp): found 60187 extents
[22744.233882] [ cut here ]
[22744.238559] WARNING: CPU: 0 PID: 4211 at fs/btrfs/extent-tree.c:876 
btrfs_lookup_extent_info+0x292/0x30a [btrfs]()
[22744.248953] Modules linked in: nfsd auth_rpcgss oid_registry nfs_acl nfs 
lockd fscache sunrpc xfs libcrc32c loop raid10 md_mod iTCO_wdt 
x86_pkg_temp_thermal coretemp kvm_intel kvm crc32_pclmul ghash_clmulni_intel 
iTCO_vendor_support aesni_intel aes_x86_64 ablk_helper cryptd lrw gf128mul 
glue_helper lpc_ich mfd_core i2c_i801 i2c_core pcspkr psmouse evdev microcode 
serio_raw battery ipmi_si ipmi_msghandler video tpm_tis tpm button acpi_cpufreq 
processor ie31200_edac edac_core btrfs xor raid6_pq sg sd_mod uas usb_storage 
hid_generic usbhid hid ahci libahci mpt2sas raid_class libata 
scsi_transport_sas crc32c_intel ehci_pci e1000e ehci_hcd ptp pps_core scsi_mod 
thermal fan thermal_sys usbcore usb_common
[22744.312827] CPU: 0 PID: 4211 Comm: btrfs Not tainted 3.17.0 #10
[22744.318770] Hardware name: Supermicro X9SCL/X9SCM/X9SCL/X9SCM, BIOS 2.0a 
06/08/2012
[22744.326475]   0009 813a6a46 

[22744.333983]  8103b591  c0593cc4 
88027beef380
[22744.341503]  88028da8ff50   
880136d1a000
[22744.349019] Call Trace:
[22744.351493]  [813a6a46] ? dump_stack+0x41/0x51
[22744.356827]  [8103b591] ? warn_slowpath_common+0x78/0x90
[22744.363037]  [c0593cc4] ? btrfs_lookup_extent_info+0x292/0x30a 
[btrfs]
[22744.370471]  [c0593cc4] ? btrfs_lookup_extent_info+0x292/0x30a 
[btrfs]
[22744.377912]  [c059438f] ? walk_down_proc+0xaf/0x1e3 [btrfs]
[22744.384373]  [8110bc2a] ? kmem_cache_alloc+0x91/0x104
[22744.390321]  [c05965b8] ? walk_down_tree+0x40/0xa9 [btrfs]
[22744.396706]  [c0598f3e] ? btrfs_drop_snapshot+0x2c4/0x656 [btrfs]
[22744.403702]  [c05e6297] ? merge_reloc_roots+0xf0/0x1ca [btrfs]
[22744.410434]  [c05e6972] ? relocate_block_group+0x445/0x4bd [btrfs]
[22744.417520]  [c05e6b39] ? btrfs_relocate_block_group+0x14f/0x267 
[btrfs]
[22744.425138]  [c05c56b7] ? btrfs_relocate_chunk.isra.58+0x58/0x5e2 
[btrfs]
[22744.432862]  [c0586786] ? btrfs_item_key_to_cpu+0x12/0x30 [btrfs]
[22744.439851]  [c05ba695] ? btrfs_get_token_64+0x76/0xc6 [btrfs]
[22744.446590]  [c05c190b] ? release_extent_buffer+0x9d/0xa4 [btrfs]
[22744.453585]  [c05c8186] ? btrfs_balance+0x9b0/0xb9d [btrfs]
[22744.460064]  [c05cf646] ? btrfs_ioctl_balance+0x21a/0x297 [btrfs]
[22744.467057]  [c05d2462] ? btrfs_ioctl+0x10f4/0x20a5 [btrfs]
[22744.473531]  [81121b9e] ? path_openat+0x233/0x4c5
[22744.479129]  [81030620] ? __do_page_fault+0x339/0x3df
[22744.485072]  [810f2b9c] ? __vma_link_rb+0x58/0x73
[22744.490668]  [810f2c22] ? vma_link+0x6b/0x8a
[22744.495824]  [811237f8] ? do_vfs_ioctl+0x3ec/0x435
[22744.501509]  [8105b9e0] ? vtime_account_user+0x35/0x40
[22744.507539]  [8112388a] ? SyS_ioctl+0x49/0x77
[22744.512781]  [813aafac] ? tracesys+0x7e/0xe2
[22744.517935]  [813ab00b] ? tracesys+0xdd/0xe2
[22744.523092] ---[ end trace fac5e12cd6384894 ]---
22744.527735] [ cut here ]
[22744.532378] kernel BUG at fs/btrfs/extent-tree.c:7727!
[22744.537532] invalid opcode:  [#1] SMP 
[22744.541684] Modules linked in: nfsd auth_rpcgss oid_registry nfs_acl nfs 
lockd fscache sunrpc xfs

Re: Fwd: Re: [PATCH] btrfs: add more superblock checks

2014-10-09 Thread David Sterba

On Tue, Oct 07, 2014 at 04:51:11PM +0800, Qu Wenruo wrote:
  +   struct btrfs_super_block *sb = fs_info-super_copy;
  +   int ret = 0;
  +
  +   if (sb-root_level  BTRFS_MAX_LEVEL) {
  +   printk(KERN_ERR BTRFS: tree_root level too big: %d  %d\n,
  +   sb-root_level, BTRFS_MAX_LEVEL);
  +   ret = -EINVAL;
  +   }
  +   if (sb-chunk_root_level  BTRFS_MAX_LEVEL) {
  +   printk(KERN_ERR BTRFS: chunk_root level too big: %d  %d\n,
  +   sb-chunk_root_level, BTRFS_MAX_LEVEL);
  +   ret = -EINVAL;
  +   }
  +   if (sb-log_root_level  BTRFS_MAX_LEVEL) {
  +   printk(KERN_ERR BTRFS: log_root level too big: %d  %d\n,
  +   sb-log_root_level, BTRFS_MAX_LEVEL);
  +   ret = -EINVAL;
  +   }
  +
  /*
  -* Placeholder for checks
  +* The common minimum, we don't know if we can trust the 
  nodesize/sectorsize
  +* items yet, they'll be verified later. Issue just a warning.
   */
  -   return 0;
  +   if (!IS_ALIGNED(sb-root, 4096))
  +   printk(KERN_WARNING BTRFS: tree_root block unaligned: %llu\n,
  +   sb-root);
  +   if (!IS_ALIGNED(sb-chunk_root, 4096))
  +   printk(KERN_WARNING BTRFS: tree_root block unaligned: %llu\n,
  +   sb-chunk_root);
  +   if (!IS_ALIGNED(sb-log_root, 4096))
  +   printk(KERN_WARNING BTRFS: tree_root block unaligned: %llu\n,
  +   sb-log_root);
 1) it is better not to call IS_ALIGNED to immediate value//.
 Although current btrfs implement ensures that all sectorsize is larger 
 equal than page_size,
 but Chandan Rajendra is trying to support subpage-sized blocksize,
 which may cause false alert later.

The patch reflects current state, so when the subpage blocksize patches
are merged, this will have to be changed accordingly.

 It would be much better using btrfs_super_sectorsize() instead to 
 improve extendability.

See the comment above, we don't trust the superblock yet and cannot use
the sectorsize reliably.

 2) missing endian convert.
 On big endian system it would be a disaster
 btrfs_super_* marco should be used.

Thanks, will fix it.

  +   if (memcmp(fs_info-fsid, sb-dev_item.fsid, BTRFS_UUID_SIZE) != 0) {
  +   printk(KERN_ERR BTRFS: dev_item UUID does not match fsid: %pU 
  != %pU\n,
  +   fs_info-fsid, sb-dev_item.fsid);
  +   ret = -EINVAL;
  +   }
  +
  +   /*
  +* Hint to catch really bogus numbers, bitflips or so, more exact 
  checks are
  +* done later
  +*/
  +   if (sb-num_devices  (1UL  31))
  +   printk(KERN_WARNING BTRFS: suspicious number of devices: 
  %llu\n,
  +   sb-num_devices);

 What about also check the devid with sb-num_devices too?
 Every valid devid should be less equal than sb-num_devices if I am right.
 Although iterate dev_item here may be overkilled...

This could be done of course, I've tried to keep the checks very small
and using only directly accessible information. More is possible of
course.

  +
  +   if (sb-bytenr != BTRFS_SUPER_INFO_OFFSET) {
  +   printk(KERN_ERR BTRFS: super offset mismatch %llu != %u\n,
  +   sb-bytenr, BTRFS_SUPER_INFO_OFFSET);
  +   ret = -EINVAL;
  +   }
  +
  +   /*
  +* The generation is a global counter, we'll trust it more than the 
  others
  +* but it's still possible that it's the one that's wrong.
  +*/
  +   if (sb-generation  sb-chunk_root_generation)
  +   printk(KERN_WARNING
  +   BTRFS: suspicious: generation  chunk_root_generation: 
  %llu  %llu\n,
  +   sb-generation, sb-chunk_root_generation);
  +   if (sb-generation  sb-cache_generation  sb-cache_generation != 
  (u64)-1)
  +   printk(KERN_WARNING
  +   BTRFS: suspicious: generation  cache_generation: %llu 
   %llu\n,
  +   sb-generation, sb-cache_generation);
  +
  +   return ret;
}

 Still the endian problem.

Will fix, thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: move struct btrfs_ioctl_defrag_range_args from ctree.h to linux/btrfs.h

2014-10-09 Thread David Sterba

On Wed, Oct 08, 2014 at 01:23:41AM +0100, Marios Titas wrote:
 include/uapi/linux/btrfs.h is a more logical place to put the struct
 btrfs_ioctl_defrag_range_args as it is being used by the
 BTRFS_IOC_DEFRAG_RANGE IOCTL which is defined in that file.
 Additionally, this is where the btrfs-progs defines that struct. Thus
 this patches reduces the gap between the btrfs-progs headers and the
 kernel headers.
 
 Signed-off-by: Marios Titas red...@gmx.com

Reviewed-by: David Sterba dste...@suse.cz
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What is the vision for btrfs fs repair?

2014-10-09 Thread Eric Sandeen

On 10/9/14 8:49 AM, Duncan wrote:
 Austin S Hemmelgarn posted on Thu, 09 Oct 2014 09:18:22 -0400 as
 excerpted:
 
 On 2014-10-09 08:34, Duncan wrote:
 
 The only way a read-only
 mount should be writable is if it's mounted (bind-mounted or
 btrfs-subvolume-mounted) read-write elsewhere, and the write occurs to
 that mount, not the read-only mounted location.
 
 In theory yes, but there are caveats to this, namely:
 * atime updates still happen unless you have mounted the fs with noatime

Getting off the topic a bit, but that really shouldn't happen:

#define IS_NOATIME(inode)   __IS_FLG(inode, MS_RDONLY|MS_NOATIME)

and in touch_atime():

if (IS_NOATIME(inode))
return;

-Eric
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Confusion with newly converted filesystem

2014-10-09 Thread Tim Cuthbertson

I will try to explain what I have done, but also try to keep it fairly
short. I installed a Linux distro that does not support installing to
btrfs to an ext4. I ran dist-upgrade to ensure that I have the latest
btrfs-tools. I upgraded the Debian kernel from 3.13 to 3.16.3. When
all this was completed, there was something like 900 MB used on a 40
GB partition.

Next, I booted to another distro (Arch Linux) which also has the
latest kernel and btrfs-progs. I ran btrfs-convert /dev/sda6. When I
rebooted to the new Debian system, the btrfs was mounted read-only.
btrfs fi show / showed all 40 GB as used. I did some internet
research, then I remounted the filesystem as rw and added another 40
GB partition on a separate disk drive. Then I ran btrfs balance start
-dusage=30. This seemed to stabilize the filesystem to the point that
it is usable.

I proceeded with my original plan, which was to make it a two-drive
RAID filesystem, using -dconvert =raid0 -mconvert=raid1. This
succeeded, but the data and metadata usage stats still look all out of
whack. After several rebalance attempts, my usage stats look like the
following:

btrfs fi show / shows a total usage of 1.76 GB, with 40 GB allocated
and 14.03 GB used on each device. btrfs fi df / shows total data of
2 GB allocated with 1.69 GB used and metadata of 13 GB total with
72.41 MB used.

Why is 13 GB needed for 72 MB of metadata? Is there any understandable
way to fix this? I am not a newbie, but am by no means an expert with
btrfs

Thank you,
Tim
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Two uncorrectable errors across RAID1 at same logical block?

2014-10-09 Thread Rich Rauenzahn


On 10/9/2014 12:13 AM, Liu Bo wrote:

sudo ./btrfs inspect-internal logical-resolve -v 58464632832  /


$ sudo ./btrfs inspect-internal logical-resolve -v 58464632832  /
ioctl ret=0, total_size=4096, bytes_left=4080, bytes_missing=0, cnt=0, 
missed=0


I also tried -P and -s 1 

Also did this:

$ sudo ./btrfs-map-logical  -l 58464632832   -o /tmp/58464632832 /dev/sdf3
mirror 1 logical 58464632832 physical 1536393216 device /dev/sdg3
mirror 2 logical 58464632832 physical 58464632832 device /dev/sdf3

And looked at the 4k block.  strings doesn't show anything useful: +V0T
File doesn't recognize it as anything particular.

Weird.

I have one other clue which I think is irrelevant.  I had another error 
on a different drive/different fs and it turned out to be the vmem file 
for a virtual machine under vmware workstation.  I deleted the file 
since it was just the memory image and the error went away.  It was easy 
to map the bad block to the file from dmesg and the inode.   I may have 
also created a vm at some point on this drive we're looking at now and 
then moved it.  So I think that information is not relevant... but maybe 
you've seen this before.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Fwd: Confusion with newly converted filesystem

2014-10-09 Thread Tim Cuthbertson

Never mind. I have stumbled my way into a solution.

I ran btrfs subv delete /ext2_saved. Then I ran btrfs balance start
/. That relocated 15 of 15 chunks. Now fi show shows 2.03 GB used on
each device and fi df shows 1 GB of metadata total.

Apparently, that saved ext4 subvolume was a real mess.

Tim

-- Forwarded message --
From: Tim Cuthbertson ratch...@gmail.com
Date: Thu, Oct 9, 2014 at 11:41 AM
Subject: Confusion with newly converted filesystem
To: linux-btrfs@vger.kernel.org


I will try to explain what I have done, but also try to keep it fairly
short. I installed a Linux distro that does not support installing to
btrfs to an ext4. I ran dist-upgrade to ensure that I have the latest
btrfs-tools. I upgraded the Debian kernel from 3.13 to 3.16.3. When
all this was completed, there was something like 900 MB used on a 40
GB partition.

Next, I booted to another distro (Arch Linux) which also has the
latest kernel and btrfs-progs. I ran btrfs-convert /dev/sda6. When I
rebooted to the new Debian system, the btrfs was mounted read-only.
btrfs fi show / showed all 40 GB as used. I did some internet
research, then I remounted the filesystem as rw and added another 40
GB partition on a separate disk drive. Then I ran btrfs balance start
-dusage=30. This seemed to stabilize the filesystem to the point that
it is usable.

I proceeded with my original plan, which was to make it a two-drive
RAID filesystem, using -dconvert =raid0 -mconvert=raid1. This
succeeded, but the data and metadata usage stats still look all out of
whack. After several rebalance attempts, my usage stats look like the
following:

btrfs fi show / shows a total usage of 1.76 GB, with 40 GB allocated
and 14.03 GB used on each device. btrfs fi df / shows total data of
2 GB allocated with 1.69 GB used and metadata of 13 GB total with
72.41 MB used.

Why is 13 GB needed for 72 MB of metadata? Is there any understandable
way to fix this? I am not a newbie, but am by no means an expert with
btrfs

Thank you,
Tim
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] Btrfs: make inode.c:compress_file_range() return void

Its return value is useless, its single caller ignores it and can't do
anything with it anyway, since it's a workqueue task and not the task
calling filemap_fdatawrite_range (writepages) nor filemap_fdatawait_range().
Failure is communicated to such functions via start and end of writeback
with the respective pages tagged with an error and AS_EIO flag set in the
inode's imapping.

Signed-off-by: Filipe Manana fdman...@suse.com
---
 fs/btrfs/inode.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b91a171..aef0fa3 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -382,7 +382,7 @@ static inline int inode_need_compress(struct inode *inode)
  * are written in the same order that the flusher thread sent them
  * down.
  */
-static noinline int compress_file_range(struct inode *inode,
+static noinline void compress_file_range(struct inode *inode,
struct page *locked_page,
u64 start, u64 end,
struct async_cow *async_cow,
@@ -621,8 +621,7 @@ cleanup_and_bail_uncompressed:
*num_added += 1;
}
 
-out:
-   return ret;
+   return;
 
 free_pages_out:
for (i = 0; i  nr_pages_ret; i++) {
@@ -630,8 +629,6 @@ free_pages_out:
page_cache_release(pages[i]);
}
kfree(pages);
-
-   goto out;
 }
 
 static void free_async_extent_pages(struct async_extent *async_extent)
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] Btrfs: report error after failure inlining extent in compressed write path

If cow_file_range_inline() failed, when called from compress_file_range(),
we were tagging the locked page for writeback, end its writeback and unlock it,
but not marking it with an error nor setting AS_EIO in inode's mapping flags.

This made it impossible for a caller of filemap_fdatawrite_range (writepages)
or filemap_fdatawait_range() to know that an error happened. And the return
value of compress_file_range() is useless because it's returned to a workqueue
task and not to the task calling filemap_fdatawrite_range (writepages).

This change applies on top of the previous patchset starting at the patch
titled:

[1/5] Btrfs: set page and mapping error on compressed write failure

Which changed extent_clear_unlock_delalloc() to use SetPageError and
mapping_set_error().

Signed-off-by: Filipe Manana fdman...@suse.com
---
 fs/btrfs/inode.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7635b1d..b91a171 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -538,6 +538,7 @@ cont:
 clear_flags, PAGE_UNLOCK |
 PAGE_CLEAR_DIRTY |
 PAGE_SET_WRITEBACK |
+PAGE_SET_ERROR |
 PAGE_END_WRITEBACK);
goto free_pages_out;
}
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] Btrfs: correctly flush compressed data before/after direct IO

For compressed writes, after doing the first filemap_fdatawrite_range() we
don't get the pages tagged for writeback immediately. Instead we create
a workqueue task, which is run by other kthread, and keep the pages locked.
That other kthread compresses data, creates the respective ordered extent/s,
tags the pages for writeback and unlocks them. Therefore we need a second
call to filemap_fdatawrite_range() if we have compressed writes, as this
second call will wait for the pages to become unlocked, then see they became
tagged for writeback and finally wait for the writeback to finish.

Signed-off-by: Filipe Manana fdman...@suse.com
---
 fs/btrfs/file.c  | 12 +++-
 fs/btrfs/inode.c | 16 +---
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 29b147d..82c7229 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1692,8 +1692,18 @@ static ssize_t __btrfs_direct_write(struct kiocb *iocb,
err = written_buffered;
goto out;
}
+   /*
+* Ensure all data is persisted. We want the next direct IO read to be
+* able to read what was just written.
+*/
endbyte = pos + written_buffered - 1;
-   err = filemap_write_and_wait_range(file-f_mapping, pos, endbyte);
+   err = filemap_fdatawrite_range(file-f_mapping, pos, endbyte);
+   if (!err  test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
+BTRFS_I(file_inode(file))-runtime_flags))
+   err = filemap_fdatawrite_range(file-f_mapping, pos, endbyte);
+   if (err)
+   goto out;
+   err = filemap_fdatawait_range(file-f_mapping, pos, endbyte);
if (err)
goto out;
written += written_buffered;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index aef0fa3..752ff18 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7052,9 +7052,19 @@ static int lock_extent_direct(struct inode *inode, u64 
lockstart, u64 lockend,
btrfs_put_ordered_extent(ordered);
} else {
/* Screw you mmap */
-   ret = filemap_write_and_wait_range(inode-i_mapping,
-  lockstart,
-  lockend);
+   ret = filemap_fdatawrite_range(inode-i_mapping,
+  lockstart,
+  lockend);
+   if (!ret  test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
+BTRFS_I(inode)-runtime_flags))
+   ret = filemap_fdatawrite_range(inode-i_mapping,
+  lockstart,
+  lockend);
+   if (ret)
+   break;
+   ret = filemap_fdatawait_range(inode-i_mapping,
+ lockstart,
+ lockend);
if (ret)
break;
 
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] Btrfs: add helper btrfs_fdatawrite_range

To avoid duplicating this double filemap_fdatawrite_range() call for
inodes with async extents (compressed writes) so often.

Signed-off-by: Filipe Manana fdman...@suse.com
---
 fs/btrfs/ctree.h|  1 +
 fs/btrfs/file.c | 36 
 fs/btrfs/inode.c|  9 +
 fs/btrfs/ordered-data.c | 24 ++--
 4 files changed, 32 insertions(+), 38 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 089f6da..4e0ad8c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3896,6 +3896,7 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct 
inode *inode,
  struct page **pages, size_t num_pages,
  loff_t pos, size_t write_bytes,
  struct extent_state **cached);
+int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
 
 /* tree-defrag.c */
 int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 82c7229..2df1dce 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1697,10 +1697,7 @@ static ssize_t __btrfs_direct_write(struct kiocb *iocb,
 * able to read what was just written.
 */
endbyte = pos + written_buffered - 1;
-   err = filemap_fdatawrite_range(file-f_mapping, pos, endbyte);
-   if (!err  test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
-BTRFS_I(file_inode(file))-runtime_flags))
-   err = filemap_fdatawrite_range(file-f_mapping, pos, endbyte);
+   err = btrfs_fdatawrite_range(file-f_mapping, pos, endbyte);
if (err)
goto out;
err = filemap_fdatawait_range(file-f_mapping, pos, endbyte);
@@ -1864,10 +1861,7 @@ static int start_ordered_ops(struct inode *inode, loff_t 
start, loff_t end)
int ret;
 
atomic_inc(BTRFS_I(inode)-sync_writers);
-   ret = filemap_fdatawrite_range(inode-i_mapping, start, end);
-   if (!ret  test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
-BTRFS_I(inode)-runtime_flags))
-   ret = filemap_fdatawrite_range(inode-i_mapping, start, end);
+   ret = btrfs_fdatawrite_range(inode-i_mapping, start, end);
atomic_dec(BTRFS_I(inode)-sync_writers);
 
return ret;
@@ -2820,3 +2814,29 @@ int btrfs_auto_defrag_init(void)
 
return 0;
 }
+
+int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end)
+{
+   int ret;
+
+   /*
+* So with compression we will find and lock a dirty page and clear the
+* first one as dirty, setup an async extent, and immediately return
+* with the entire range locked but with nobody actually marked with
+* writeback.  So we can't just filemap_write_and_wait_range() and
+* expect it to work since it will just kick off a thread to do the
+* actual work.  So we need to call filemap_fdatawrite_range _again_
+* since it will wait on the page lock, which won't be unlocked until
+* after the pages have been marked as writeback and so we're good to go
+* from there.  We have to do this otherwise we'll miss the ordered
+* extents and that results in badness.  Please Josef, do not think you
+* know better and pull this out at some point in the future, it is
+* right and you are wrong.
+*/
+   ret = filemap_fdatawrite_range(inode-i_mapping, start, end);
+   if (!ret  test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
+BTRFS_I(inode)-runtime_flags))
+   ret = filemap_fdatawrite_range(inode-i_mapping, start, end);
+
+   return ret;
+}
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 752ff18..be955481 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7052,14 +7052,7 @@ static int lock_extent_direct(struct inode *inode, u64 
lockstart, u64 lockend,
btrfs_put_ordered_extent(ordered);
} else {
/* Screw you mmap */
-   ret = filemap_fdatawrite_range(inode-i_mapping,
-  lockstart,
-  lockend);
-   if (!ret  test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
-BTRFS_I(inode)-runtime_flags))
-   ret = filemap_fdatawrite_range(inode-i_mapping,
-  lockstart,
-  lockend);
+   ret = btrfs_fdatawrite_range(inode, lockstart, lockend);
if (ret)
break;
ret = filemap_fdatawait_range(inode-i_mapping,
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index ac734ec..1401b1a 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -725,30

Re: What is the vision for btrfs fs repair?

2014-10-09 Thread Chris Murphy

On Oct 8, 2014, at 3:11 PM, Eric Sandeen sand...@redhat.com wrote:

I was looking at Marc's post:

http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html

and it feels like there isn't exactly a cohesive, overarching vision for
repair of a corrupted btrfs filesystem.

It's definitely confusing compared to any other filesystem I've used on four
different platforms. And that's when excluding scraping and the functions
unique to any multiple device volume: scrubs, degraded mount.

To be fair, mdadm doesn't even have a scrub command, it's done via 'echo check
/sys/block/mdX/md/sync_action'. And meanwhile LVM has pvck, vgck, and for
scrubs it's lvchange --syncaction {check|repair}. These are also completely
non-obvious.

* mount -o recovery
Enable autorecovery attempts if a bad tree root is found at mount
time.

I'm confused why it's not the default yet. Maybe it's continuing to evolve at a
pace that suggests something could sneak in that makes things worse? It is
almost an oxymoron in that I'm manually enabling an autorecovery

If true, maybe the closest indication we'd get of btrfs stablity is the
default enabling of autorecovery.

* btrfs-zero-log
remove the log tree if log tree is corrupt
* btrfs rescue
Recover a damaged btrfs filesystem
chunk-recover
super-recover
How does this relate to btrfs check?
* btrfs check
repair a btrfs filesystem
--repair
--init-csum-tree
--init-extent-tree
How does this relate to btrfs rescue?

These three translate into eight combinations of repairs, adding -o recovery
there are 9 combinations. I think this is the main source of confusion, there
are just too many options, but also it's completely non-obvious which one to
use in which situation.

My expectation is that eventually these get consolidated into just check and
check --repair. As the repair code matures, it'd go into kernel autorecovery
code. That's a guess on my part, but it's consistent with design goals.

It feels like recovery tools have been badly splintered, and if there's an
overarching design or vision for btrfs fs repair, I can't tell what it is.
Can anyone help me?

I suspect it's unintended splintering, and is an artifact that will go away.
I'd rather the convoluted, fractured nature of repair go away before the scary
experimental warnings do.

Chris Murphy--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Fwd: Confusion with newly converted filesystem

Tim Cuthbertson posted on Thu, 09 Oct 2014 13:58:58 -0500 as excerpted:

 I ran btrfs subv delete /ext2_saved. Then I ran btrfs balance start
 /.
 That relocated 15 of 15 chunks. Now fi show shows 2.03 GB used on each
 device and fi df shows 1 GB of metadata total.
 
 Apparently, that saved ext4 subvolume was a real mess.

Yes and no.

The problem is that ext4 and btrfs work rather differently from each 
other, and btrfs can't manage the saved ext4 subvolume as it would normal 
btrfs subvolumes because doing so would break the ext4 side, killing the 
ability to roll-back to ext4 that's the whole point of keeping that 
dedicated subvolume.

So once you are sure you aren't going to roll-back, deleting the ext4 
saved subvolume, thus allowing btrfs to manage the entire filesystem 
without the previously ext4 stuff getting in the way, is high priority.

IOW the conversion, like many conversions, is a compromise.  It serves a 
certain purpose, but until the legacy stuff is gone, the new stuff is 
hobbled and can't be used to full effect.

So yes, any btrfs converted from ext4 is going to be a real mess, in 
btrfs terms, until that ext4 saved subvolume is deleted, because it 
simply can't manage it like it can native btrfs since doing so would 
break the ability to roll back to the ext4.  But it's an expected mess, 
and it's only a mess because the native formats differ.  The ext4 image 
can be just fine in ext4, and when it is removed, btrfs is normally just 
fine as well.  It's just the btrfs with the ext4 image still there that's 
a problem, and that only because the ext4 image isn't really playing by 
btrfs native rules, so btrfs can't properly manage it.


BTW, if it was letting you balance without an error than you probably 
didn't run into this particular problem that often happens with ext* 
conversions, likely because the filesystem was new and basically all 
relatively small (under 1 GiB) distro files, but it's worth knowing about 
and doing the one additional step, just to be sure, plus for possible 
future conversions.

With ext4, extent size is effectively unlimited.  A full 4.7 GiB DVD ISO 
image file, for instance, properly defragged, can appear as a single 4.7 
GiB extent.  No problem on ext4 and in fact that'd be the ideal.

On btrfs, by contrast, data chunk size, and thus largest possible extent 
size, is 1 GiB.  That 4.7 GiB DVD ISO image would have to be broken up 
into at least five extents, four of a full GiB each plus the sub-GiB 
remainder of the file.  In practice it'd likely be six extents, the 
beginning of the file using what was left of the current data chunk, four 
complete 1 GiB data chunks, and whatever was left beginning a sixth data 
chunk, that would eventually be filled with other file data as well.

Of course the same thing applies to other large files, whatever their 
content and size.  Big VM images are one example, big database files 
another, big multi-gig archive files yet another, big non-ISO media files 
again another.

As a result, people with these sorts of large files on their originating 
ext4 filesystem tend to run into problems with btrfs balance, etc, after 
the conversion, because btrfs balance expects to see extents no larger 
than the btrfs-native 1 GiB data chunk, and doesn't know what to do with 
these  1 GiB super-extents.

On converted btrfs with this sort of file, balance will simply error out 
while the ext4 saved subvolume remains, and normally even after it is 
gone, until a btrfs filesystem defrag is run on the former ext4 content 
to break up these super-extents into 1 GiB maximum native btrfs data 
chunks that btrfs in general and btrfs balance in particular can actually 
handle.

Since you didn't run into this problem, you evidently either didn't have 
any of these  1 GiB files, not surprising on a fresh install, or if you 
did, they were already fragmented enough for btrfs balance to handle.

However, I'd still recommend doing a proper btrfs filesystem defrag and 
then another balance, the combination of which should ensure that every 
last bit of what remains of the ext4 formatting is properly converted to 
btrfs native.  Given that you already completed a balance the defrag and 
rebalance may not matter, but better to do it unnecessarily now and be 
sure, than to run into problems and /wish/ you had done so later.

Additionally, doing it now, before you add too many additional files to 
the filesystem, will be easier and take less time than doing it later.


One more tip while we're talking about defrag:  If you don't have any big 
( half a GiB) files to deal with, or if you do but they're all 
essentially static files (like already written media files that aren't 
going to be edited in-place), I'd strongly recommend using btrfs' 
autodefrag mount option, which I use on all my btrfs here.

OTOH, for large internal rewrite pattern files such as active VM image 
files, big database files, even big torrented files until they're fully 
downloaded

Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent map

2014-10-09 Thread Qu Wenruo

 Original Message 
Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to 
insert best fitted extent map

From: Filipe David Manana fdman...@gmail.com
To: Qu Wenruo quwen...@cn.fujitsu.com
Date: 2014年10月09日 18:27

On Thu, Oct 9, 2014 at 1:28 AM, Qu Wenruo quwen...@cn.fujitsu.com wrote:

 Original Message 
Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to insert
best fitted extent map
From: Filipe David Manana fdman...@gmail.com
To: Qu Wenruo quwen...@cn.fujitsu.com
Date: 2014年10月08日 20:08

On Fri, Sep 19, 2014 at 1:31 AM, Qu Wenruo quwen...@cn.fujitsu.com
wrote:

 Original Message 
Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to
insert
best fitted extent map
From: Filipe David Manana fdman...@gmail.com
To: Qu Wenruo quwen...@cn.fujitsu.com
Date: 2014年09月18日 21:16

On Wed, Sep 17, 2014 at 4:53 AM, Qu Wenruo quwen...@cn.fujitsu.com
wrote:

The following commit enhanced the merge_extent_mapping() to reduce
fragment in extent map tree, but it can't handle case which existing
lies before map_start:
51f39 btrfs: Use right extent length when inserting overlap extent map.

[BUG]
When existing extent map's start is before map_start,
the em-len will be minus, which will corrupt the extent map and fail
to
insert the new extent map.
This will happen when someone get a large extent map, but when it is
going to insert it into extent map tree, some one has already commit
some write and split the huge extent into small parts.

This sounds like very deterministic to me.
Any reason to not add tests to the sanity tests that exercise
this/these case/cases?

Yes, thanks for the informing.
Will add the test case for it soon.

Hi Qu,

Any progress on the test?

This is a very important one IMHO, not only because of the bad
consequences of the bug (extent map corruption, leading to all sorts
of chaos), but also because this problem was not found by the full
xfstests suite on several developer machines.

thanks

Still trying to reproduce it under xfstest framework.

That's the problem, wasn't apparently reproducible (or detectable at
least) by anyone with xfstests.
I'll try to build a C program to behave the same of filebench and to see 
if it works.
At least with filebench, it can be triggered in 60s with 100% 
possibility to reproduce.

But even followiiing the FileBench randomrw behavior(1 thread random read 1
thread random write on preallocated space),
I still failed to reproduce it.

Still investigating how to reproduce it.
Worst case may be add a new C program into src of xfstests?

How about the sanity tests (fs/btrfs/tests/*.c)? Create an empty map
tree, add some extent maps, then try to merge some new extent maps
that used to fail before this fix. Seems simple, no?

thanks Qu
It needs concurrency read and write(commit) to trigger it, I am not sure 
it can be reproduced in sanity tests

since it seems not commit things and lacks multithread facility.

I'll give it a try on the filebench-behavior C program first, and then 
sanity tests if former doesn't work at all

Thanks,
Qu

Thanks,
Qu

Thanks,
Qu

Thanks

[REPRODUCER]
It is very easy to tiger using filebench with randomrw personality.
It is about 100% to reproduce when using 8G preallocated file in 60s
randonrw test.

[FIX]
This patch can now handle any existing extent position.
Since it does not directly use existing-start, now it will find the
previous and next extent around map_start.
So the old existing-start  map_start bug will never happen again.

[ENHANCE]
This patch will insert the best fitted extent map into extent map tree,
other than the oldest [map_start, map_start + sectorsize) or the
relatively newer but not perfect [map_start, existing-start).

The patch will first search existing extent that does not intersects
with
the desired map range [map_start, map_start + len).
The existing extent will be either before or behind map_start, and
based
on the existing extent, we can find out the previous and next extent
around map_start.

So the best fitted extent would be [prev-end, next-start).
For prev or next is not found, em-start would be prev-end and em-end
wold be next-start.

With this patch, the fragment in extent map tree should be reduced much
more than the 51f39 commit and reduce an unneeded extent map tree
search.

Reported-by: Tsutomu Itoh t-i...@jp.fujitsu.com
Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
---
fs/btrfs/inode.c | 79

1 file changed, 57 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 016c403..8039021 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6191,21 +6191,60 @@ out_fail_inode:
   goto out_fail;
}

+/* Find next extent map of a given extent map, caller needs to ensure
locks */
+static struct extent_map *next_extent_map(struct extent_map *em)
+{
+   struct rb_node *next;
+
+   next =

Re: What is the vision for btrfs fs repair?