Re: Random file system corruption in 3.17 (not BTRFS related...?)

2014-10-17 Thread Filipe David Manana
On Wed, Oct 15, 2014 at 9:20 PM, Josef Bacik jba...@fb.com wrote:
 On 10/15/2014 03:30 PM, Rich Freeman wrote:

 On Wed, Oct 15, 2014 at 10:30 AM, Josef Bacik jba...@fb.com wrote:

 We've found it, the Fedora guys are reverting the bad patch now, we'll
 get
 the fix sent back to stable shortly.  Sorry about that.


 After reverting this commit, can the bad snapshots be
 deleted/repaired/etc without wiping and restoring the entire
 filesystem?  Copying 2.3TB of data isn't a particularly fast
 operation...


 I would certainly like to make fsck repair this sort of problem, let me
 reproduce the corruption locally and then make fsck fix it and then you can
 use that.  Thanks,

I just sent out a patch for fsck to fix this issue - i.e. bad
read-only snapshots (inaccessible without errors, impossible to
delete, etc).
It fixes the snapshots if, and only if, you haven't run fsck in repair
mode (--repair) before, as that would touch back references and other
metadata as it didn't expect for root items to incorrect (which is
essentially what the snapshots bug made).

The patch is this one:  https://patchwork.kernel.org/patch/5098331/

Also, if you have errors accessing files through a path that doesn't
contain any of the read-only snapshots, it's possible that it's the
corruption bug we had in 3.17 - bad extent map manipulation, that
manifests itself in several ways (e.g. reports:
http://www.spinics.net/lists/linux-btrfs/msg38045.html and
http://www.spinics.net/lists/linux-btrfs/msg37567.html).

Anyway, if you run into further issues, please report them.

thanks


 Josef


 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Random file system corruption in 3.17 (not BTRFS related...?)

2014-10-15 Thread Juan Orti Alcaine

El 2014-10-14 18:54, Robert White escribió:

Howdy,

So I run several gentoo systems and I upgraded two of them to kernel 
3.17.0


One using BTRFS for root.
One using ext3 for root (via the ext4 driver)

_Both_ systems exhibited strange behavior (long pauses and then hangs
requiring hard-power) within several hours. Both then had random
filesystem damage.

On the BTRFS system much of my browser settings for firefox were
trashed, particularly the cookies and saved conifigurations for
add-ons (like which sites had scripts enabled/disabled in no-script)
etc.

On the ext3/4 system there were several corruptions including a
pipe/special file with a large non-zero size that required I do a
fsck -fyD /dev/sda3 to repair. (one comment from fsck was that the
pipe/special file looked like a directory or some such)

So I can say that corruption is taking place, but I suspect it is
_not_ happening in the BTRFS specific code.

(ASIDE: both systems are older amd64 using built-in radeon display 
hardware.)




I've also experienced Btrfs corruptions with 3.17.0 (Fedora 21 alpha). 
It has happened two times, each one after a clean reinstall and a wipe 
of the old fs. In less than a day, both installations got corrupted and 
the filesystems went readonly. When listing the contents, I saw many 
directories with question marks.


My system has 4 drives and 2 fs:
- 1 SSD in single
- 3 HDD in RAID1

I do readonly snapshots every hour of all the subvolumes, so I have 
hundreds of snapshots.


Now I'm back in 3.16.4 without any problems. I'm trying to reproduce my 
setup in a virtual machine. If the corruption happens again, I'll send 
you more data on this problem.


--
Juan Orti
https://miceliux.com

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Random file system corruption in 3.17 (not BTRFS related...?)

2014-10-15 Thread Duncan
Juan Orti Alcaine posted on Wed, 15 Oct 2014 09:08:14 +0200 as excerpted:

 I've also experienced Btrfs corruptions with 3.17.0

 I do readonly snapshots every hour of all the subvolumes, so I have
 hundreds of snapshots.

That's a known issue with read-only snapshots in 3.17.0.  There's quite a 
thread on the list about it.

So I'd suggest either turning off read-only snapshots on 3.17 (which I'm 
running here without snapshots, no problem), possibly switching to 
writable snapshots as they don't seem to trigger the problem, or as you 
mentioned doing already, going back to 3.16.x (x2 due to another bug, 
latest should be good), until the read-only snapshots issue with 3.17.0 
is traced down and fixed.

Given the approximately two kernel cycles it took for the widely 
reproduced but rather difficult to trace compression-related bug in 3.15 
to be reported in 3.15 and traced and fixed in 3.17-rc2 and 3.16.2, I'd 
guess a fix for this similarly widely reproduced read-only-snapshot-
related bug should be no later than 3.19-rc3 and 3.18.3, possibly rather 
earlier if it proves easier to trace, especially since this one seems to 
have been reported and recognized as widely occurring a bit faster than 
the compression-related bug.  But with testing, etc, it's still likely to 
be late in the 3.18-rc cycle before mainline commit, so it'll probably be 
rather late in the 3.17.x stable cycle, if it makes it at all.  Unless it 
gets picked as a long-term support kernel, the full 3.17 stable cycle 
might in fact be blacklisted for btrfs due to this bug, much like the 
full 3.15 stable cycle ended up being blacklisted due to the compression-
related bug.

So either switch your snapshots to writable if it's not going to 
interfere with your use-case, or stay on the 3.16.x, x2, stable series 
until the problem is fixed, hopefully with 3.18.0, tho it might be 3.18.2 
or so.  I seriously doubt it'll be longer than that, because it's a well 
reproduced bug which makes it both high priority and easy to test fixes 
for.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Random file system corruption in 3.17 (not BTRFS related...?)

2014-10-15 Thread Josef Bacik

On 10/15/2014 03:08 AM, Juan Orti Alcaine wrote:

El 2014-10-14 18:54, Robert White escribió:

Howdy,

So I run several gentoo systems and I upgraded two of them to kernel
3.17.0

One using BTRFS for root.
One using ext3 for root (via the ext4 driver)

_Both_ systems exhibited strange behavior (long pauses and then hangs
requiring hard-power) within several hours. Both then had random
filesystem damage.

On the BTRFS system much of my browser settings for firefox were
trashed, particularly the cookies and saved conifigurations for
add-ons (like which sites had scripts enabled/disabled in no-script)
etc.

On the ext3/4 system there were several corruptions including a
pipe/special file with a large non-zero size that required I do a
fsck -fyD /dev/sda3 to repair. (one comment from fsck was that the
pipe/special file looked like a directory or some such)

So I can say that corruption is taking place, but I suspect it is
_not_ happening in the BTRFS specific code.

(ASIDE: both systems are older amd64 using built-in radeon display
hardware.)



I've also experienced Btrfs corruptions with 3.17.0 (Fedora 21 alpha).
It has happened two times, each one after a clean reinstall and a wipe
of the old fs. In less than a day, both installations got corrupted and
the filesystems went readonly. When listing the contents, I saw many
directories with question marks.

My system has 4 drives and 2 fs:
- 1 SSD in single
- 3 HDD in RAID1


Did it happen on both fs'es or just one?  Thanks,

Josef

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Random file system corruption in 3.17 (not BTRFS related...?)

2014-10-15 Thread Juan Orti Alcaine

El 2014-10-15 15:46, Josef Bacik escribió:

On 10/15/2014 03:08 AM, Juan Orti Alcaine wrote:

I've also experienced Btrfs corruptions with 3.17.0 (Fedora 21 alpha).
It has happened two times, each one after a clean reinstall and a wipe
of the old fs. In less than a day, both installations got corrupted 
and

the filesystems went readonly. When listing the contents, I saw many
directories with question marks.

My system has 4 drives and 2 fs:
- 1 SSD in single
- 3 HDD in RAID1


Did it happen on both fs'es or just one?  Thanks,

Josef


Both filesystems were corrupted. I have / in the SSD and /home in the 
HDDs.


I didn't notice anything while working with the system, I only 
discovered the problem when booting up after the second or third reboot 
and seeing the service failing to start. Could it be something related 
to the mount/umount logic?


--
Juan Orti
https://miceliux.com

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Random file system corruption in 3.17 (not BTRFS related...?)

2014-10-15 Thread Josef Bacik

On 10/15/2014 10:05 AM, Juan Orti Alcaine wrote:

El 2014-10-15 15:46, Josef Bacik escribió:

On 10/15/2014 03:08 AM, Juan Orti Alcaine wrote:

I've also experienced Btrfs corruptions with 3.17.0 (Fedora 21 alpha).
It has happened two times, each one after a clean reinstall and a wipe
of the old fs. In less than a day, both installations got corrupted and
the filesystems went readonly. When listing the contents, I saw many
directories with question marks.

My system has 4 drives and 2 fs:
- 1 SSD in single
- 3 HDD in RAID1


Did it happen on both fs'es or just one?  Thanks,

Josef


Both filesystems were corrupted. I have / in the SSD and /home in the HDDs.

I didn't notice anything while working with the system, I only
discovered the problem when booting up after the second or third reboot
and seeing the service failing to start. Could it be something related
to the mount/umount logic?



We've found it, the Fedora guys are reverting the bad patch now, we'll 
get the fix sent back to stable shortly.  Sorry about that.


Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Random file system corruption in 3.17 (not BTRFS related...?)

2014-10-15 Thread Juan Orti Alcaine

El 2014-10-15 16:30, Josef Bacik escribió:

On 10/15/2014 10:05 AM, Juan Orti Alcaine wrote:

El 2014-10-15 15:46, Josef Bacik escribió:

On 10/15/2014 03:08 AM, Juan Orti Alcaine wrote:
I've also experienced Btrfs corruptions with 3.17.0 (Fedora 21 
alpha).
It has happened two times, each one after a clean reinstall and a 
wipe
of the old fs. In less than a day, both installations got corrupted 
and

the filesystems went readonly. When listing the contents, I saw many
directories with question marks.

My system has 4 drives and 2 fs:
- 1 SSD in single
- 3 HDD in RAID1


Did it happen on both fs'es or just one?  Thanks,

Josef


Both filesystems were corrupted. I have / in the SSD and /home in the 
HDDs.


I didn't notice anything while working with the system, I only
discovered the problem when booting up after the second or third 
reboot

and seeing the service failing to start. Could it be something related
to the mount/umount logic?



We've found it, the Fedora guys are reverting the bad patch now, we'll
get the fix sent back to stable shortly.  Sorry about that.


Thanks to you. Fortunately I have good backups.

--
Juan Orti
https://miceliux.com

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Random file system corruption in 3.17 (not BTRFS related...?)

2014-10-15 Thread Rich Freeman
On Wed, Oct 15, 2014 at 10:30 AM, Josef Bacik jba...@fb.com wrote:
 We've found it, the Fedora guys are reverting the bad patch now, we'll get
 the fix sent back to stable shortly.  Sorry about that.

After reverting this commit, can the bad snapshots be
deleted/repaired/etc without wiping and restoring the entire
filesystem?  Copying 2.3TB of data isn't a particularly fast
operation...

--
Rich
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Random file system corruption in 3.17 (not BTRFS related...?)

2014-10-15 Thread Josef Bacik

On 10/15/2014 03:30 PM, Rich Freeman wrote:

On Wed, Oct 15, 2014 at 10:30 AM, Josef Bacik jba...@fb.com wrote:

We've found it, the Fedora guys are reverting the bad patch now, we'll get
the fix sent back to stable shortly.  Sorry about that.


After reverting this commit, can the bad snapshots be
deleted/repaired/etc without wiping and restoring the entire
filesystem?  Copying 2.3TB of data isn't a particularly fast
operation...



I would certainly like to make fsck repair this sort of problem, let me 
reproduce the corruption locally and then make fsck fix it and then you 
can use that.  Thanks,


Josef

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Random file system corruption in 3.17 (not BTRFS related...?)

2014-10-14 Thread David Arendt
I didn't notice a corruption on other filesystems with kernel 3.17.0. 
Also I didn't experience any hangs except when trying to mount a 
corrupted btrfs but this was causing a hang within less than 10 seconds. 
It could be that your problem is unrelated and that the corruption you 
are experiencing is due to an unrelated hang followed by a hard 
powerdown. Have you been able to capture any btrfs related kernel panics ?


On 10/14/14 6:54 PM, Robert White wrote:

Howdy,

So I run several gentoo systems and I upgraded two of them to kernel 
3.17.0


One using BTRFS for root.
One using ext3 for root (via the ext4 driver)

_Both_ systems exhibited strange behavior (long pauses and then hangs 
requiring hard-power) within several hours. Both then had random 
filesystem damage.


On the BTRFS system much of my browser settings for firefox were 
trashed, particularly the cookies and saved conifigurations for 
add-ons (like which sites had scripts enabled/disabled in no-script) etc.


On the ext3/4 system there were several corruptions including a 
pipe/special file with a large non-zero size that required I do a 
fsck -fyD /dev/sda3 to repair. (one comment from fsck was that the 
pipe/special file looked like a directory or some such)


So I can say that corruption is taking place, but I suspect it is 
_not_ happening in the BTRFS specific code.


(ASIDE: both systems are older amd64 using built-in radeon display 
hardware.)


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Random file system corruption in 3.17 (not BTRFS related...?)

2014-10-14 Thread Robert White

On 10/14/2014 10:22 AM, David Arendt wrote:

I didn't notice a corruption on other filesystems with kernel 3.17.0.
Also I didn't experience any hangs except when trying to mount a
corrupted btrfs but this was causing a hang within less than 10 seconds.
It could be that your problem is unrelated and that the corruption you
are experiencing is due to an unrelated hang followed by a hard
powerdown. Have you been able to capture any btrfs related kernel panics ?


My installation is _not_ well suited for capturing panics.

I have not been able to capture any panics on either system and I had to 
just switch back to 3.16.3 as the two systems were my firewall (ext4) 
and my primary laptop (BTRFS). I didn't want to grind them up with 
repeated crashes and corruptions. I only let the firewall fault once 
before switching back.


The laptop faulted and hung twice under 3.17.0 before I switched it 
back, thinking it was a radeon graphics driver issue. Then I logged into 
the firewall via ssh to check something and three shell commands or so 
in, it went to lunch (but the firewall layer was still passing packets).


The only actual sign of filesystem corruption on the laptop was the 
sudden absence or corruption of the (sqlite3 format) history and 
settings files. But firefox was the only thing I'd been actively using.


Given the way the firewall jammed up and died, and the kind of 
corruption (special files don't get updated that much, let alone to link 
up a directory) -- and the fact that it ran fine as a firewall for 
several hours then died as soon as I touched the file system. I suspect 
that there is something fishy in dcache or the vnode layers.


It was too much too soon on two otherwise stable systems.

I offered this email here because I noticed that people were seeing 
BTRFS corruption with 3.17 and I'd seen both BTRFS and EXT4 corruption 
which suggests that BTRFS _isn't_ particularly culpable.



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Random file system corruption in 3.17 (not BTRFS related...?)

2014-10-14 Thread Duncan
Robert White posted on Tue, 14 Oct 2014 09:54:51 -0700 as excerpted:

 On the BTRFS system much of my browser settings for firefox were
 trashed, particularly the cookies and saved conifigurations for add-ons
 (like which sites had scripts enabled/disabled in no-script) etc.

FWIW, this reply is more toward the firefox corruption than the
why-particulars of the crash.

The prefs.js file in the profile dir holds addon settings and seems to be 
particularly sensitive to corruption.  At least here, firefox has created 
several backups, prefs-1.js thru prefs-7.js, I suppose at upgrade.  The 
first time I lost settings I restored prefs-7.js (the newest/largest of 
the backups) as prefs.js, and only lost a few settings that I had changed 
since the last upgrade, which had changed the firefox interface so I had 
to change my settings accordingly.  The time or two since then that I 
hard-crashed and lost my addons, I was able to replace the prefs.js file 
from a recent /home backup.

Anyway, it's the prefs.js file that you want to restore.  Whether it's 
from the last prefs-N.js backup that firefox did, or from your own 
backup, prefs.js is it.

As for cookies, history, etc.  I didn't notice them going corrupt.  I do 
run raid1 btrfs and after a crash, do a scrub, which may recover some 
files.  And I run tight enough security that most cookies are session-
only (and no third-party), so that file won't be written to much, which 
probably saves it.  I don't know about history.  Maybe it was corrupted 
and I simply didn't notice it, as I don't use history that often.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html