Re: btrfs random filesystem corruption in kernel 3.17

2014-10-14 Thread admin

Summarizing what I've seen on the threads...


First of all many thanks for summarizing the info.


1) The bug seems to be read-only snapshot related.  The connection to
send is that send creates read-only snapshots, but people creating 
read-
only snapshots for other purposes are now reporting the same problem, 
so

it's not send, it's the read-only snapshots.


In fact send does not create a read-only snapshot, snapshots are created 
manually prior to calling send.



2) Writable snapshots haven't been implicated yet, and the working set
from which the snapshots are taken doesn't seem to be affected, either.
So in that sense it's not affecting ordinary usage, only the read-only
snapshots themselves.

3) More problematic, however, is the fact that these apparently 
corrupted

read-only snapshots often are not listed properly and can't be deleted,
tho I'm not sure if that's /all/ the corrupted snapshots or only part 
of

them. So while it may not affect ordinary operation in the short term,
over time until there's a fix, people routinely doing read-only 
snapshots
are going to be getting more and more of these undeletable snapshots, 
and

depending on whether the eventual patch only prevents more or can
actually fix the bad ones (possibly via btrfs check or the like),
affected filesystems may ultimately have to be blown away and recreated
with a fresh mkfs, in ordered to kill the currently undeletable 
snapshots.


So the first thing to do would be to shut off whatever's making 
read-only

snapshots, so you don't make the problem worse while it's being
investigated.  For those who can do that without too big an 
interruption
to their normal routine (who don't depend on send/receive, for 
instance),

just keep it off for the time being.  For those who depend on read-only
snapshots (send-receive for backup and the data is too valuable to not 
do

the backups for a few days), consider switching back to 3.16-stable --
from 3.16.3 at least, the patch for the compress bug is there, so that
shouldn't be a problem.

And if you're affected, be aware that until we have a fix, we don't 
know

if it'll be possible to remove the affected and currently undeletable
snapshots.  If it's not, at some point you'll need to do a fresh
mkfs.btrfs, to get rid of the damage.  Since the bug doesn't appear to
affect writable snapshots or the head from which snapshots are made,
it's not urgent, and a full fix is likely to include a patch to detect
and fix the problem as well, but until we know what the problem is we
can't be sure of that, so be prepared to do that mkfs at some point, as
at this point it's possible that's the only way you'll be able to kill
the corrupted snapshots.


I don't agree with you concerning the not urgent part. In my opinion, 
any problem leading to filesystem or other data corruption should be 
considered as urgent, at least as long as it isn't known what exactly is 
affected and whether there is a simple way to salvage the corruption 
without going the backup/restore route.



4) Total speculation on my part, but given the wanted transid (aka
generation, in different contexts) is significantly lower than the 
found

transid, and the fact that the problem appears to be limited to
/read-only/ snapshots, my first suspicion is that something's getting
updated that would normally apply to all snapshots, but the read-only
nature of the snapshots is preventing the full update there.  The 
transid

of the block is updated, but the snapshot being read-only is preventing
update of the pointer in that snapshot accordingly.

What I do /not/ know is whether the bug is that something's getting
updated that should NOT be, and it's simply the read-only snapshots
letting us know about it since the writable snapshots are fully 
updated,

even if that breaks the snapshot (breaking writable snapshots in a
different and currently undetected way), or if instead, it's a 
legitimate

update, like a balance simply moving the snapshot around but not
affecting it otherwise, and the bug is that the read-only snapshots
aren't allowing the legitimate update.

Either way, this more or less developed over the weekend, and it's 
Monday

now, so the devs should be on it.  If it's anything like the 3.15/3.16
compression bug, it'll take some time for them to properly trace it, 
and
then to figure out an appropriate fix, but they will.  Chances are 
we'll
have at least some decent progress on a trace by Friday, and maybe even 
a

good-to-go patch. =:^)

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs random filesystem corruption in kernel 3.17

2014-10-14 Thread David Arendt
The corruption seems to be worse than expected. In kernel 3.16.5 I can 
not mount this filesystem read/write.


I'm in progress of doing a tar - mkfs.btrfs - untar recovery and staying 
on 3.16.5 for now.


[   55.465584] parent transid verify failed on 51150848 wanted 272368 
found 276401
[   55.468415] parent transid verify failed on 918274048 wanted 273135 
found 274590
[   55.470915] parent transid verify failed on 508444672 wanted 274054 
found 276617
[   55.473758] parent transid verify failed on 18317623296 wanted 275876 
found 278431
[   55.476240] parent transid verify failed on 127254528 wanted 276488 
found 276490

[   55.479494] [ cut here ]
[   55.479499] WARNING: CPU: 1 PID: 1723 at fs/btrfs/extent-tree.c:876 
btrfs_lookup_extent_info+0x44c/0x490()

[   55.479500] Modules linked in:
[   55.479502] CPU: 1 PID: 1723 Comm: ls Not tainted 3.16.5 #1
[   55.479502] Hardware name: ASUS All Series/H87M-PRO, BIOS 2101 07/21/2014
[   55.479503]   0009 816ff873 

[   55.479504]  81078261 8807f7084770 8807ed8ca000 
3dcf4000
[   55.479506]  8807f7133de0  812be9bc 
4000

[   55.479507] Call Trace:
[   55.479511]  [816ff873] ? dump_stack+0x41/0x51
[   55.479514]  [81078261] ? warn_slowpath_common+0x81/0xb0
[   55.479515]  [812be9bc] ? btrfs_lookup_extent_info+0x44c/0x490
[   55.479516]  [812c4998] ? btrfs_alloc_free_block+0x2c8/0x450
[   55.479519]  [812af7df] ? update_ref_for_cow+0x1ff/0x3f0
[   55.479520]  [812afc0a] ? __btrfs_cow_block+0x23a/0x5a0
[   55.479522]  [812d14fd] ? btrfs_buffer_uptodate+0x6d/0x80
[   55.479524]  [812b0136] ? btrfs_cow_block+0x126/0x190
[   55.479525]  [812b43bd] ? btrfs_search_slot+0x1fd/0xaa0
[   55.479527]  [812e07a3] ? 
btrfs_truncate_inode_items+0x123/0x8e0

[   55.479529]  [812e204a] ? btrfs_evict_inode+0x32a/0x490
[   55.479532]  [8112e02a] ? unlock_new_inode+0x3a/0x60
[   55.479533]  [8113abb5] ? __inode_wait_for_writeback+0x65/0xb0
[   55.479536]  [810a8f70] ? wake_atomic_t_function+0x30/0x30
[   55.479537]  [8112f276] ? evict+0xa6/0x160
[   55.479539]  [812e2c2d] ? btrfs_orphan_cleanup+0x1ed/0x430
[   55.479540]  [812e31c8] ? btrfs_lookup_dentry+0x358/0x4c0
[   55.479542]  [812e3339] ? btrfs_lookup+0x9/0x30
[   55.479543]  [8111f6c4] ? lookup_real+0x14/0x50
[   55.479545]  [81120292] ? __lookup_hash+0x32/0x50
[   55.479546]  [81120938] ? lookup_slow+0x48/0xc0
[   55.479547]  [811227bc] ? path_lookupat+0x73c/0x770
[   55.479550]  [81164860] ? posix_acl_xattr_get+0x40/0xb0
[   55.479551]  [81137a80] ? generic_getxattr+0x50/0x80
[   55.479552]  [8112281e] ? filename_lookup.isra.51+0x2e/0x90
[   55.479554]  [8112553f] ? user_path_at_empty+0x5f/0xb0
[   55.479555]  [81125549] ? user_path_at_empty+0x69/0xb0
[   55.479556]  [8111b690] ? vfs_fstatat+0x40/0x90
[   55.479557]  [8111b862] ? SyS_newlstat+0x12/0x30
[   55.479559]  [8111f89d] ? path_put+0xd/0x20
[   55.479560]  [81138ab7] ? SyS_getxattr+0x57/0x80
[   55.479562]  [817053d2] ? system_call_fastpath+0x16/0x1b
[   55.479563] ---[ end trace a8ad56fd476f7474 ]---
[   55.479564] BTRFS: error (device sda2) in update_ref_for_cow:1018: 
errno=-30 Readonly filesystem

[   55.479565] BTRFS info (device sda2): forced readonly
[   55.479565] [ cut here ]
[   55.479567] WARNING: CPU: 1 PID: 1723 at fs/btrfs/super.c:259 
__btrfs_abort_transaction+0x5a/0x140()

[   55.479567] BTRFS: Transaction aborted (error -30)
[   55.479568] Modules linked in:
[   55.479569] CPU: 1 PID: 1723 Comm: ls Tainted: GW 3.16.5 #1
[   55.479569] Hardware name: ASUS All Series/H87M-PRO, BIOS 2101 07/21/2014
[   55.479570]   0009 816ff873 
8807f2dcf788
[   55.479571]  81078261 ffe2 8807ed8ca000 
8807f7133de0
[   55.479572]  8184d800 0488 81078345 
8197afd8

[   55.479573] Call Trace:
[   55.479574]  [816ff873] ? dump_stack+0x41/0x51
[   55.479576]  [81078261] ? warn_slowpath_common+0x81/0xb0
[   55.479578]  [81078345] ? warn_slowpath_fmt+0x45/0x50
[   55.479579]  [812aa41a] ? __btrfs_abort_transaction+0x5a/0x140
[   55.479580]  [812afe02] ? __btrfs_cow_block+0x432/0x5a0
[   55.479582]  [812d14fd] ? btrfs_buffer_uptodate+0x6d/0x80
[   55.479583]  [812b0136] ? btrfs_cow_block+0x126/0x190
[   55.479584]  [812b43bd] ? btrfs_search_slot+0x1fd/0xaa0
[   55.479586]  [812e07a3] ? 
btrfs_truncate_inode_items+0x123/0x8e0

[   55.479587]  [812e204a] ? btrfs_evict_inode+0x32a/0x490
[   55.479588]  [8112e02a] ? unlock_new_inode+0x3a/0x60
[   55.479590]  

Re: btrfs random filesystem corruption in kernel 3.17

2014-10-14 Thread Duncan
admin posted on Tue, 14 Oct 2014 13:17:41 +0200 as excerpted:

 And if you're affected, be aware that until we have a fix, we don't
 know if it'll be possible to remove the affected and currently
 undeletable snapshots.  If it's not, at some point you'll need to do a
 fresh mkfs.btrfs, to get rid of the damage.  Since the bug doesn't
 appear to affect writable snapshots or the head from which snapshots
 are made, it's not urgent, and a full fix is likely to include a patch
 to detect and fix the problem as well, but until we know what the
 problem is we can't be sure of that, so be prepared to do that mkfs at
 some point, as at this point it's possible that's the only way you'll
 be able to kill the corrupted snapshots.
 
 I don't agree with you concerning the not urgent part. In my opinion,
 any problem leading to filesystem or other data corruption should be
 considered as urgent, at least as long as it isn't known what exactly is
 affected and whether there is a simple way to salvage the corruption
 without going the backup/restore route.

I shouldn't have used a pronoun there as it wasn't clear.

By it, I didn't mean the bug, which I agree is urgent for the reasons 
you state, but the mkfs.  Since there's currently no fix for the bug but 
it (the bug) seems to be limited to read-only snapshots at this point, 
_doing_the_mkfs_ isn't urgent.  With the damage limited to the read-only 
snapshots, you don't have to drop everything and do a mkfs _right_now_ to 
be rid of it.

But at some point, presumably after a fix is in place, since the damaged 
snapshots aren't currently always deletable, if the fix only prevents new 
damage from occurring and doesn't provide a way to fix the damaged ones, 
then mkfs would be the only way to do so.  With the damage limited to 
those snapshots and not spreading to normal writable snapshots or the 
working copy, dropping everything to do that mkfs isn't urgent, but it 
(the mkfs) will need to be done at some point to clear the undeletable 
snapshots, again, assuming the fix doesn't provide a way to get rid of 
them (the currently undeletable snapshots).

That's what I meant.  Yes the bug is urgent.  Doing a mkfs _right_now_ to 
get rid of the damage, not so much, because by all accounts so far the 
damage is limited to those read-only snapshots and isn't affecting 
ordinary writable snapshots or the working copies.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs random filesystem corruption in kernel 3.17

2014-10-14 Thread Robert White

On 10/14/2014 02:35 PM, Duncan wrote:

But at some point, presumably after a fix is in place, since the damaged
snapshots aren't currently always deletable, if the fix only prevents new
damage from occurring and doesn't provide a way to fix the damaged ones,
then mkfs would be the only way to do so.  With the damage limited to
those snapshots and not spreading to normal writable snapshots or the
working copy, dropping everything to do that mkfs isn't urgent, but it
(the mkfs) will need to be done at some point to clear the undeletable
snapshots, again, assuming the fix doesn't provide a way to get rid of
them (the currently undeletable snapshots).



What happens if btrfs property set is used to (attempt to) promote the 
snapshot from read-only to read-write? Can the damaged snapshot then be 
subjected to scrub of btrfsck?


e.g.

btrfs property set /path/to/snapshot ro false
(maintenance here)

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs random filesystem corruption in kernel 3.17

2014-10-14 Thread Duncan
Robert White posted on Tue, 14 Oct 2014 15:03:21 -0700 as excerpted:

 What happens if btrfs property set is used to (attempt to) promote the
 snapshot from read-only to read-write? Can the damaged snapshot then be
 subjected to scrub of btrfsck?
 
 e.g.
 
 btrfs property set /path/to/snapshot ro false (maintenance here)

Very good question not yet answered. =:^)

But it's one I can't answer as my use-case doesn't call for such 
snapshots in the first place and I don't have any to be personally 
affected by this bug, so my interest is academic.

I simply saw the big hairy thread and tried to summarize what I could get 
out of it to that point, with a bit of my own speculation as to what the 
reversed transid complaints meant.

(Since transids are normally sequential, in most corruption cases, the 
filesystem has moved on and has a higher transid that's wanted, but can 
only find an older/lower transid for something or other.  Or at least 
that's what I've seen here and what seems common in the other reports 
I've seen posted.  This bug reverses that, with an older/lower wanted 
transid, but finding a newer/higher one.  That's the strange point that 
leapt out to me and I'd guess it's a strong hint at the problem, thus my 
definitely admin-not-coder speculation on that point.)

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs random filesystem corruption in kernel 3.17

2014-10-13 Thread David Arendt
From my own experience and based on what other people are saying, I
think there is a random btrfs filesystem corruption problem in kernel
3.17 at least related to snapshots, therefore I decided to post using
another subject to draw attention from people not concerned about btrfs
send to it. More information can be found in the brtfs send posts.

Did the filesystem you tried to balance contain snapshots ? Read only ones ?

On 10/13/2014 07:22 PM, Rich Freeman wrote:
 On Sun, Oct 12, 2014 at 7:11 AM, David Arendt ad...@prnet.org wrote:
 This weekend I finally had time to try btrfs send again on the newly
 created fs. Now I am running into another problem:

 btrfs send returns: ERROR: send ioctl failed with -12: Cannot allocate
 memory

 In dmesg I see only the following output:

 parent transid verify failed on 21325004800 wanted 2620 found 8325

 I'm not using send at all, but I've been running into parent transid
 verify failed messages where the wanted is way smaller than the found
 when trying to balance a raid1 after adding a new drive.  Originally I
 had gotten a BUG, and after reboot the drive finished balancing
 (interestingly enough without moving any chunks to the new drive -
 just consolidating everything on the old drives), and then when I try
 to do another balance I get:
 [ 4426.987177] BTRFS info (device sdc2): relocating block group
 10367073779712 flags 17
 [ 4446.287998] BTRFS info (device sdc2): found 13 extents
 [ 4451.330887] parent transid verify failed on 10063286579200 wanted
 987432 found 993678
 [ 4451.350663] parent transid verify failed on 10063286579200 wanted
 987432 found 993678

 The btrfs program itself outputs:
 btrfs balance start -v /data
 Dumping filters: flags 0x7, state 0x0, force is off
   DATA (flags 0x0): balancing
   METADATA (flags 0x0): balancing
   SYSTEM (flags 0x0): balancing
 ERROR: error during balancing '/data' - Cannot allocate memory
 There may be more info in syslog - try dmesg | tail

 This is also on 3.17.  This may be completely unrelated, but it seemed
 similar enough to be worth mentioning.

 The filesystem otherwise seems to work fine, other than the new drive
 not having any data on it:
 Label: 'datafs'  uuid: cd074207-9bc3-402d-bee8-6a8c77d56959
 Total devices 6 FS bytes used 2.16TiB
 devid1 size 2.73TiB used 2.40TiB path /dev/sdc2
 devid2 size 931.32GiB used 695.03GiB path /dev/sda2
 devid3 size 931.32GiB used 700.00GiB path /dev/sdb2
 devid4 size 931.32GiB used 700.00GiB path /dev/sdd2
 devid5 size 931.32GiB used 699.00GiB path /dev/sde2
 devid6 size 2.73TiB used 0.00 path /dev/sdf2

 This is btrfs-progs-3.16.2.

 --
 Rich

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs random filesystem corruption in kernel 3.17

2014-10-13 Thread Rich Freeman
On Mon, Oct 13, 2014 at 4:27 PM, David Arendt ad...@prnet.org wrote:
 From my own experience and based on what other people are saying, I
 think there is a random btrfs filesystem corruption problem in kernel
 3.17 at least related to snapshots, therefore I decided to post using
 another subject to draw attention from people not concerned about btrfs
 send to it. More information can be found in the brtfs send posts.

 Did the filesystem you tried to balance contain snapshots ? Read only ones ?

The filesystem contains numerous subvolumes and snapshots, many of
which are read-only.  I'm managing many with snapper.

The similarity of the transid verify errors made me think this issue
is related, and the root cause may have nothing to do with btrfs send.

As far as I can tell these errors aren't having any affect on my data
- hopefully the system is catching the problems before there are
actual disk writes/etc.

--
Rich
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs random filesystem corruption in kernel 3.17

2014-10-13 Thread john terragon
I think I just found a consistent simple way to trigger the problem
(at least on my system). And, as I guessed before, it seems to be
related just to readonly snapshots:

1) I create a readonly snapshot
2) I do some changes on the source subvolume for the snapshot (I'm not
sure changes are strictly needed)
3) reboot (or probably just unmount and remount. I reboot because the
fs I've problems with contains my root subvolume)

After the rebooting (or the remount) I consistently have the corruption
with the usual multitude of these in dmesg
parent transid verify failed on 902316032 wanted 2484 found 4101
and the characteristic ls -la output

drwxr-xr-x 1 root root  250 Oct 10 15:37 root
d? ? ??   ?? root-b2
drwxr-xr-x 1 root root  250 Oct 10 15:37 root-b3
d? ? ??   ?? root-backup

root-backup and root-b2 are both readonly whereas root-b3 is rw (and
it didn't get corrupted).

David, maybe you can try the same steps on one of your machines?

John
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs random filesystem corruption in kernel 3.17

2014-10-13 Thread Rich Freeman
On Mon, Oct 13, 2014 at 4:48 PM, john terragon jterra...@gmail.com wrote:
 I think I just found a consistent simple way to trigger the problem
 (at least on my system). And, as I guessed before, it seems to be
 related just to readonly snapshots:

 1) I create a readonly snapshot
 2) I do some changes on the source subvolume for the snapshot (I'm not
 sure changes are strictly needed)
 3) reboot (or probably just unmount and remount. I reboot because the
 fs I've problems with contains my root subvolume)

 After the rebooting (or the remount) I consistently have the corruption
 with the usual multitude of these in dmesg
 parent transid verify failed on 902316032 wanted 2484 found 4101
 and the characteristic ls -la output

 drwxr-xr-x 1 root root  250 Oct 10 15:37 root
 d? ? ??   ?? root-b2
 drwxr-xr-x 1 root root  250 Oct 10 15:37 root-b3
 d? ? ??   ?? root-backup

 root-backup and root-b2 are both readonly whereas root-b3 is rw (and
 it didn't get corrupted).

 David, maybe you can try the same steps on one of your machines?


Look at that.  I didn't realize it, but indeed I have a corrupted snapshot:
/data/.snapshots/5338/:
ls: cannot access /data/.snapshots/5338/snapshot: Cannot allocate memory
total 4
drwxr-xr-x 1 root root  32 Oct 11 06:09 .
drwxr-x--- 1 root root  32 Oct 11 07:42 ..
-rw--- 1 root root 135 Oct 11 06:09 info.xml
d? ? ??  ?? snapshot

Several older snapshots are fine, and those predate my 3.17 upgrade.

I noticed that this corrupted snapshot isn't even listed in my snapper lists.

btrfs su delete /data/.snapshots/5338/snapshot
Transaction commit: none (default)
ERROR: error accessing '/data/.snapshots/5338/snapshot'

Removing them appears to be problematic as well.  I might just disable
compress=lzo and go back to 3.16 to see how that goes.

--
Rich
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs random filesystem corruption in kernel 3.17

2014-10-13 Thread Rich Freeman
On Mon, Oct 13, 2014 at 4:55 PM, Rich Freeman
r-bt...@thefreemanclan.net wrote:
 On Mon, Oct 13, 2014 at 4:48 PM, john terragon jterra...@gmail.com wrote:

 After the rebooting (or the remount) I consistently have the corruption
 with the usual multitude of these in dmesg
 parent transid verify failed on 902316032 wanted 2484 found 4101
 and the characteristic ls -la output

Sorry to double-reply, but I left this out.  I have a long string of
these early in boot as well that I never noticed before.

--
Rich
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs random filesystem corruption in kernel 3.17

2014-10-13 Thread john terragon
I'm using compress=no so compression doesn't seem to be related, at
least in my case. Just read-only snapshots on 3.17 (although I haven't
tried 3.16).

John
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs random filesystem corruption in kernel 3.17

2014-10-13 Thread David Arendt
As these to machines are running as server for different purposes (yes,
I know that btrfs is unstable and any corruption or data loss is at my
own risk therefore I have good backups), I want to reboot them not more
then necessary.

However I tried to bring my reboot times in relation with corruptions:

machine 1:

d? ? ?  ? ?? root.20141009.000503.backup

reboot   system boot  3.17.0   Thu Oct  9 23:20   still running
reboot   system boot  3.17.0   Tue Oct  7 21:25 - 23:18 (2+01:53)
reboot   system boot  3.17.0   Mon Oct  6 22:47 - 23:18 (3+00:31)

For this machine, corruption seems to have occurred for a snapshot
created after a reboot.


machine 2:

d? ? ??  ?? root.20141006.003239.backup
d? ? ??  ?? root.20141007.001616.backup
d? ? ??  ?? root.20141008.000501.backup
d? ? ??  ?? root.20141009.052436.backup

reboot   system boot  3.17.0   Thu Oct  9 21:31   still running
reboot   system boot  3.17.0   Tue Oct  7 21:27 - 21:30 (2+00:03)
reboot   system boot  3.17.0   Tue Oct  7 17:51 - 21:26  (03:34)
reboot   system boot  3.17.0   Sun Oct  5 23:50 - 17:50 (1+17:59)
reboot   system boot  3.17.0   Sun Oct  5 23:47 - 23:49  (00:01)

During the next days, I will setup a virtual machine to do more tests.

On 10/13/2014 10:48 PM, john terragon wrote:
 I think I just found a consistent simple way to trigger the problem
 (at least on my system). And, as I guessed before, it seems to be
 related just to readonly snapshots:

 1) I create a readonly snapshot
 2) I do some changes on the source subvolume for the snapshot (I'm not
 sure changes are strictly needed)
 3) reboot (or probably just unmount and remount. I reboot because the
 fs I've problems with contains my root subvolume)

 After the rebooting (or the remount) I consistently have the corruption
 with the usual multitude of these in dmesg
 parent transid verify failed on 902316032 wanted 2484 found 4101
 and the characteristic ls -la output

 drwxr-xr-x 1 root root  250 Oct 10 15:37 root
 d? ? ??   ?? root-b2
 drwxr-xr-x 1 root root  250 Oct 10 15:37 root-b3
 d? ? ??   ?? root-backup

 root-backup and root-b2 are both readonly whereas root-b3 is rw (and
 it didn't get corrupted).

 David, maybe you can try the same steps on one of your machines?

 John

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs random filesystem corruption in kernel 3.17

2014-10-13 Thread David Arendt
I'm also using no compression.

On 10/13/2014 11:22 PM, john terragon wrote:
 I'm using compress=no so compression doesn't seem to be related, at
 least in my case. Just read-only snapshots on 3.17 (although I haven't
 tried 3.16).

 John

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs random filesystem corruption in kernel 3.17

2014-10-13 Thread Duncan
David Arendt posted on Mon, 13 Oct 2014 23:25:23 +0200 as excerpted:

 I'm also using no compression.
 
 On 10/13/2014 11:22 PM, john terragon wrote:
 I'm using compress=no so compression doesn't seem to be related, at
 least in my case. Just read-only snapshots on 3.17 (although I haven't
 tried 3.16).

While I'm not a mind-reader and thus don't know for sure, Rich's 
reference to 3.16 and compression might not be related to this bug at 
all.  In 3.15 and early 3.16, there was a different bug related to 
compression, tho IIRC it was patched in 3.16.2 and 3.17-rc2 (or maybe .3 
and rc3, it's patched in the latest 3.16.x anyway, and in 3.17).  So how 
I read his comment was that he was considering going back to 3.16 and 
disabling compression to deal with that bug (he may not know the patch 
was marked for stable and is in current 3.16.x), rather than stay on 
3.17, since this bug hasn't even been traced yet, let alone patched.

Meanwhile, this bug makes me glad my use-case doesn't involve snapshots, 
and I've seen nothing of it. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs random filesystem corruption in kernel 3.17

2014-10-13 Thread Duncan
Rich Freeman posted on Mon, 13 Oct 2014 16:42:14 -0400 as excerpted:

 On Mon, Oct 13, 2014 at 4:27 PM, David Arendt ad...@prnet.org wrote:
 From my own experience and based on what other people are saying, I
 think there is a random btrfs filesystem corruption problem in kernel
 3.17 at least related to snapshots, therefore I decided to post using
 another subject to draw attention from people not concerned about btrfs
 send to it. More information can be found in the brtfs send posts.

 Did the filesystem you tried to balance contain snapshots ? Read only
 ones ?
 
 The filesystem contains numerous subvolumes and snapshots, many of which
 are read-only.  I'm managing many with snapper.
 
 The similarity of the transid verify errors made me think this issue is
 related, and the root cause may have nothing to do with btrfs send.
 
 As far as I can tell these errors aren't having any affect on my data -
 hopefully the system is catching the problems before there are actual
 disk writes/etc.

Summarizing what I've seen on the threads...

1) The bug seems to be read-only snapshot related.  The connection to 
send is that send creates read-only snapshots, but people creating read-
only snapshots for other purposes are now reporting the same problem, so 
it's not send, it's the read-only snapshots.

2) Writable snapshots haven't been implicated yet, and the working set 
from which the snapshots are taken doesn't seem to be affected, either.  
So in that sense it's not affecting ordinary usage, only the read-only 
snapshots themselves.

3) More problematic, however, is the fact that these apparently corrupted 
read-only snapshots often are not listed properly and can't be deleted, 
tho I'm not sure if that's /all/ the corrupted snapshots or only part of 
them. So while it may not affect ordinary operation in the short term, 
over time until there's a fix, people routinely doing read-only snapshots 
are going to be getting more and more of these undeletable snapshots, and 
depending on whether the eventual patch only prevents more or can 
actually fix the bad ones (possibly via btrfs check or the like), 
affected filesystems may ultimately have to be blown away and recreated 
with a fresh mkfs, in ordered to kill the currently undeletable snapshots.

So the first thing to do would be to shut off whatever's making read-only 
snapshots, so you don't make the problem worse while it's being 
investigated.  For those who can do that without too big an interruption 
to their normal routine (who don't depend on send/receive, for instance), 
just keep it off for the time being.  For those who depend on read-only 
snapshots (send-receive for backup and the data is too valuable to not do 
the backups for a few days), consider switching back to 3.16-stable -- 
from 3.16.3 at least, the patch for the compress bug is there, so that 
shouldn't be a problem.

And if you're affected, be aware that until we have a fix, we don't know 
if it'll be possible to remove the affected and currently undeletable 
snapshots.  If it's not, at some point you'll need to do a fresh 
mkfs.btrfs, to get rid of the damage.  Since the bug doesn't appear to 
affect writable snapshots or the head from which snapshots are made, 
it's not urgent, and a full fix is likely to include a patch to detect 
and fix the problem as well, but until we know what the problem is we 
can't be sure of that, so be prepared to do that mkfs at some point, as 
at this point it's possible that's the only way you'll be able to kill 
the corrupted snapshots.

4) Total speculation on my part, but given the wanted transid (aka 
generation, in different contexts) is significantly lower than the found 
transid, and the fact that the problem appears to be limited to
/read-only/ snapshots, my first suspicion is that something's getting 
updated that would normally apply to all snapshots, but the read-only 
nature of the snapshots is preventing the full update there.  The transid 
of the block is updated, but the snapshot being read-only is preventing 
update of the pointer in that snapshot accordingly.

What I do /not/ know is whether the bug is that something's getting 
updated that should NOT be, and it's simply the read-only snapshots 
letting us know about it since the writable snapshots are fully updated, 
even if that breaks the snapshot (breaking writable snapshots in a 
different and currently undetected way), or if instead, it's a legitimate 
update, like a balance simply moving the snapshot around but not 
affecting it otherwise, and the bug is that the read-only snapshots 
aren't allowing the legitimate update.

Either way, this more or less developed over the weekend, and it's Monday 
now, so the devs should be on it.  If it's anything like the 3.15/3.16 
compression bug, it'll take some time for them to properly trace it, and 
then to figure out an appropriate fix, but they will.  Chances are we'll 
have at least some decent progress on a trace by Friday, and maybe 

Re: btrfs random filesystem corruption in kernel 3.17

2014-10-13 Thread Rich Freeman
On Mon, Oct 13, 2014 at 5:22 PM, john terragon jterra...@gmail.com wrote:
 I'm using compress=no so compression doesn't seem to be related, at
 least in my case. Just read-only snapshots on 3.17 (although I haven't
 tried 3.16).

I was using lzo compression, and hence my comment about turning it off
before going back to 3.16 (not realizing that 3.16 has subsequently
been fixed).

Ironically enough I discovered this as I was about to migrate my ext4
backup drive into my btrfs raid1.  Maybe I'll go ahead and wait on
that and have an rsync backup of the filesystem handy (minus
snapshots) just in case.  :)

I'd switch to 3.16, but it sounds like there is no way to remove the
snapshots at the moment, and I can live for a while without the
ability to create new ones.

interestingly enough it doesn't look like ALL snapshots are affected.
I checked and some of the snapshots I made last weekend while doing
system updates look accessible.  They are significantly smaller, and
the subvolumes they were made from are also fairly new - though I have
no idea if that is related.

The subvolumes do show up in btrfs su list.  They cannot be examined
using btrfs su show.

It would be VERY nice to have a way of cleaning this up without
blowing away the entire filesystem...

--
Rich
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs random filesystem corruption in kernel 3.17

2014-10-13 Thread john terragon
And another worrying thing I didn't notice before. Two snapshots have
dates that do not make sense. root-b3 and root-b4 have been created
Oct 14th (and btw root's modification time was also on Oct the 14th).
So why do they show Oct 10th? And root-prov has actually been created
on Oct 10 15:37, as it correctly shows, so it's like btrfs sub snap
picks up old stale data from who knows were or when or for what
reason. Moreover, root-b4 was created with 3.16.5not good.

drwxrwsr-x 1 root staff  30 Sep 11 16:15 home
d? ? ??   ?? home-backup
drwxr-xr-x 1 root root  250 Oct 14 03:02 root
d? ? ??   ?? root-b2
drwxr-xr-x 1 root root  250 Oct 10 15:37 root-b3
drwxr-xr-x 1 root root  250 Oct 10 15:37 root-b4
drwxr-xr-x 1 root root  250 Oct 14 03:02 root-b5
drwxr-xr-x 1 root root  250 Oct 14 03:02 root-b6
d? ? ??   ?? root-backup
drwxr-xr-x 1 root root  250 Oct 10 15:37 root-prov
drwxr-xr-x 1 root root   88 Sep 15 16:02 vms

On Tue, Oct 14, 2014 at 1:18 AM, Rich Freeman
r-bt...@thefreemanclan.net wrote:
 On Mon, Oct 13, 2014 at 5:22 PM, john terragon jterra...@gmail.com wrote:
 I'm using compress=no so compression doesn't seem to be related, at
 least in my case. Just read-only snapshots on 3.17 (although I haven't
 tried 3.16).

 I was using lzo compression, and hence my comment about turning it off
 before going back to 3.16 (not realizing that 3.16 has subsequently
 been fixed).

 Ironically enough I discovered this as I was about to migrate my ext4
 backup drive into my btrfs raid1.  Maybe I'll go ahead and wait on
 that and have an rsync backup of the filesystem handy (minus
 snapshots) just in case.  :)

 I'd switch to 3.16, but it sounds like there is no way to remove the
 snapshots at the moment, and I can live for a while without the
 ability to create new ones.

 interestingly enough it doesn't look like ALL snapshots are affected.
 I checked and some of the snapshots I made last weekend while doing
 system updates look accessible.  They are significantly smaller, and
 the subvolumes they were made from are also fairly new - though I have
 no idea if that is related.

 The subvolumes do show up in btrfs su list.  They cannot be examined
 using btrfs su show.

 It would be VERY nice to have a way of cleaning this up without
 blowing away the entire filesystem...

 --
 Rich
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html