Re: btrfs random filesystem corruption in kernel 3.17
Robert White posted on Tue, 14 Oct 2014 15:03:21 -0700 as excerpted: > What happens if "btrfs property set" is used to (attempt to) promote the > snapshot from read-only to read-write? Can the damaged snapshot then be > subjected to scrub of btrfsck? > > e.g. > > btrfs property set /path/to/snapshot ro false (maintenance here) Very good question not yet answered. =:^) But it's one I can't answer as my use-case doesn't call for such snapshots in the first place and I don't have any to be personally affected by this bug, so my interest is academic. I simply saw the big hairy thread and tried to summarize what I could get out of it to that point, with a bit of my own speculation as to what the "reversed" transid complaints meant. (Since transids are normally sequential, in most corruption cases, the filesystem has moved on and has a higher transid that's "wanted", but can only find an older/lower transid for something or other. Or at least that's what I've seen here and what seems common in the other reports I've seen posted. This bug reverses that, with an older/lower "wanted" transid, but finding a newer/higher one. That's the strange point that leapt out to me and I'd guess it's a strong hint at the problem, thus my definitely admin-not-coder speculation on that point.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs random filesystem corruption in kernel 3.17
On 10/14/2014 02:35 PM, Duncan wrote: But at some point, presumably after a fix is in place, since the damaged snapshots aren't currently always deletable, if the fix only prevents new damage from occurring and doesn't provide a way to fix the damaged ones, then mkfs would be the only way to do so. With the damage limited to those snapshots and not spreading to normal writable snapshots or the working copy, dropping everything to do that mkfs isn't urgent, but it (the mkfs) will need to be done at some point to clear the undeletable snapshots, again, assuming the fix doesn't provide a way to get rid of them (the currently undeletable snapshots). What happens if "btrfs property set" is used to (attempt to) promote the snapshot from read-only to read-write? Can the damaged snapshot then be subjected to scrub of btrfsck? e.g. btrfs property set /path/to/snapshot ro false (maintenance here) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs random filesystem corruption in kernel 3.17
admin posted on Tue, 14 Oct 2014 13:17:41 +0200 as excerpted: >> And if you're affected, be aware that until we have a fix, we don't >> know if it'll be possible to remove the affected and currently >> undeletable snapshots. If it's not, at some point you'll need to do a >> fresh mkfs.btrfs, to get rid of the damage. Since the bug doesn't >> appear to affect writable snapshots or the "head" from which snapshots >> are made, it's not urgent, and a full fix is likely to include a patch >> to detect and fix the problem as well, but until we know what the >> problem is we can't be sure of that, so be prepared to do that mkfs at >> some point, as at this point it's possible that's the only way you'll >> be able to kill the corrupted snapshots. > > I don't agree with you concerning the not urgent part. In my opinion, > any problem leading to filesystem or other data corruption should be > considered as urgent, at least as long as it isn't known what exactly is > affected and whether there is a simple way to salvage the corruption > without going the backup/restore route. I shouldn't have used a pronoun there as "it" wasn't clear. By "it", I didn't mean the bug, which I agree is urgent for the reasons you state, but the mkfs. Since there's currently no fix for the bug but it (the bug) seems to be limited to read-only snapshots at this point, _doing_the_mkfs_ isn't urgent. With the damage limited to the read-only snapshots, you don't have to drop everything and do a mkfs _right_now_ to be rid of it. But at some point, presumably after a fix is in place, since the damaged snapshots aren't currently always deletable, if the fix only prevents new damage from occurring and doesn't provide a way to fix the damaged ones, then mkfs would be the only way to do so. With the damage limited to those snapshots and not spreading to normal writable snapshots or the working copy, dropping everything to do that mkfs isn't urgent, but it (the mkfs) will need to be done at some point to clear the undeletable snapshots, again, assuming the fix doesn't provide a way to get rid of them (the currently undeletable snapshots). That's what I meant. Yes the bug is urgent. Doing a mkfs _right_now_ to get rid of the damage, not so much, because by all accounts so far the damage is limited to those read-only snapshots and isn't affecting ordinary writable snapshots or the working copies. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs random filesystem corruption in kernel 3.17
The corruption seems to be worse than expected. In kernel 3.16.5 I can not mount this filesystem read/write. I'm in progress of doing a tar - mkfs.btrfs - untar recovery and staying on 3.16.5 for now. [ 55.465584] parent transid verify failed on 51150848 wanted 272368 found 276401 [ 55.468415] parent transid verify failed on 918274048 wanted 273135 found 274590 [ 55.470915] parent transid verify failed on 508444672 wanted 274054 found 276617 [ 55.473758] parent transid verify failed on 18317623296 wanted 275876 found 278431 [ 55.476240] parent transid verify failed on 127254528 wanted 276488 found 276490 [ 55.479494] [ cut here ] [ 55.479499] WARNING: CPU: 1 PID: 1723 at fs/btrfs/extent-tree.c:876 btrfs_lookup_extent_info+0x44c/0x490() [ 55.479500] Modules linked in: [ 55.479502] CPU: 1 PID: 1723 Comm: ls Not tainted 3.16.5 #1 [ 55.479502] Hardware name: ASUS All Series/H87M-PRO, BIOS 2101 07/21/2014 [ 55.479503] 0009 816ff873 [ 55.479504] 81078261 8807f7084770 8807ed8ca000 3dcf4000 [ 55.479506] 8807f7133de0 812be9bc 4000 [ 55.479507] Call Trace: [ 55.479511] [] ? dump_stack+0x41/0x51 [ 55.479514] [] ? warn_slowpath_common+0x81/0xb0 [ 55.479515] [] ? btrfs_lookup_extent_info+0x44c/0x490 [ 55.479516] [] ? btrfs_alloc_free_block+0x2c8/0x450 [ 55.479519] [] ? update_ref_for_cow+0x1ff/0x3f0 [ 55.479520] [] ? __btrfs_cow_block+0x23a/0x5a0 [ 55.479522] [] ? btrfs_buffer_uptodate+0x6d/0x80 [ 55.479524] [] ? btrfs_cow_block+0x126/0x190 [ 55.479525] [] ? btrfs_search_slot+0x1fd/0xaa0 [ 55.479527] [] ? btrfs_truncate_inode_items+0x123/0x8e0 [ 55.479529] [] ? btrfs_evict_inode+0x32a/0x490 [ 55.479532] [] ? unlock_new_inode+0x3a/0x60 [ 55.479533] [] ? __inode_wait_for_writeback+0x65/0xb0 [ 55.479536] [] ? wake_atomic_t_function+0x30/0x30 [ 55.479537] [] ? evict+0xa6/0x160 [ 55.479539] [] ? btrfs_orphan_cleanup+0x1ed/0x430 [ 55.479540] [] ? btrfs_lookup_dentry+0x358/0x4c0 [ 55.479542] [] ? btrfs_lookup+0x9/0x30 [ 55.479543] [] ? lookup_real+0x14/0x50 [ 55.479545] [] ? __lookup_hash+0x32/0x50 [ 55.479546] [] ? lookup_slow+0x48/0xc0 [ 55.479547] [] ? path_lookupat+0x73c/0x770 [ 55.479550] [] ? posix_acl_xattr_get+0x40/0xb0 [ 55.479551] [] ? generic_getxattr+0x50/0x80 [ 55.479552] [] ? filename_lookup.isra.51+0x2e/0x90 [ 55.479554] [] ? user_path_at_empty+0x5f/0xb0 [ 55.479555] [] ? user_path_at_empty+0x69/0xb0 [ 55.479556] [] ? vfs_fstatat+0x40/0x90 [ 55.479557] [] ? SyS_newlstat+0x12/0x30 [ 55.479559] [] ? path_put+0xd/0x20 [ 55.479560] [] ? SyS_getxattr+0x57/0x80 [ 55.479562] [] ? system_call_fastpath+0x16/0x1b [ 55.479563] ---[ end trace a8ad56fd476f7474 ]--- [ 55.479564] BTRFS: error (device sda2) in update_ref_for_cow:1018: errno=-30 Readonly filesystem [ 55.479565] BTRFS info (device sda2): forced readonly [ 55.479565] [ cut here ] [ 55.479567] WARNING: CPU: 1 PID: 1723 at fs/btrfs/super.c:259 __btrfs_abort_transaction+0x5a/0x140() [ 55.479567] BTRFS: Transaction aborted (error -30) [ 55.479568] Modules linked in: [ 55.479569] CPU: 1 PID: 1723 Comm: ls Tainted: GW 3.16.5 #1 [ 55.479569] Hardware name: ASUS All Series/H87M-PRO, BIOS 2101 07/21/2014 [ 55.479570] 0009 816ff873 8807f2dcf788 [ 55.479571] 81078261 ffe2 8807ed8ca000 8807f7133de0 [ 55.479572] 8184d800 0488 81078345 8197afd8 [ 55.479573] Call Trace: [ 55.479574] [] ? dump_stack+0x41/0x51 [ 55.479576] [] ? warn_slowpath_common+0x81/0xb0 [ 55.479578] [] ? warn_slowpath_fmt+0x45/0x50 [ 55.479579] [] ? __btrfs_abort_transaction+0x5a/0x140 [ 55.479580] [] ? __btrfs_cow_block+0x432/0x5a0 [ 55.479582] [] ? btrfs_buffer_uptodate+0x6d/0x80 [ 55.479583] [] ? btrfs_cow_block+0x126/0x190 [ 55.479584] [] ? btrfs_search_slot+0x1fd/0xaa0 [ 55.479586] [] ? btrfs_truncate_inode_items+0x123/0x8e0 [ 55.479587] [] ? btrfs_evict_inode+0x32a/0x490 [ 55.479588] [] ? unlock_new_inode+0x3a/0x60 [ 55.479590] [] ? __inode_wait_for_writeback+0x65/0xb0 [ 55.479591] [] ? wake_atomic_t_function+0x30/0x30 [ 55.479592] [] ? evict+0xa6/0x160 [ 55.479594] [] ? btrfs_orphan_cleanup+0x1ed/0x430 [ 55.479595] [] ? btrfs_lookup_dentry+0x358/0x4c0 [ 55.479596] [] ? btrfs_lookup+0x9/0x30 [ 55.479598] [] ? lookup_real+0x14/0x50 [ 55.479599] [] ? __lookup_hash+0x32/0x50 [ 55.479600] [] ? lookup_slow+0x48/0xc0 [ 55.479601] [] ? path_lookupat+0x73c/0x770 [ 55.479603] [] ? posix_acl_xattr_get+0x40/0xb0 [ 55.479605] [] ? generic_getxattr+0x50/0x80 [ 55.479606] [] ? filename_lookup.isra.51+0x2e/0x90 [ 55.479607] [] ? user_path_at_empty+0x5f/0xb0 [ 55.479608] [] ? user_pat
Re: btrfs random filesystem corruption in kernel 3.17
Summarizing what I've seen on the threads... First of all many thanks for summarizing the info. 1) The bug seems to be read-only snapshot related. The connection to send is that send creates read-only snapshots, but people creating read- only snapshots for other purposes are now reporting the same problem, so it's not send, it's the read-only snapshots. In fact send does not create a read-only snapshot, snapshots are created manually prior to calling send. 2) Writable snapshots haven't been implicated yet, and the working set from which the snapshots are taken doesn't seem to be affected, either. So in that sense it's not affecting ordinary usage, only the read-only snapshots themselves. 3) More problematic, however, is the fact that these apparently corrupted read-only snapshots often are not listed properly and can't be deleted, tho I'm not sure if that's /all/ the corrupted snapshots or only part of them. So while it may not affect ordinary operation in the short term, over time until there's a fix, people routinely doing read-only snapshots are going to be getting more and more of these undeletable snapshots, and depending on whether the eventual patch only prevents more or can actually fix the bad ones (possibly via btrfs check or the like), affected filesystems may ultimately have to be blown away and recreated with a fresh mkfs, in ordered to kill the currently undeletable snapshots. So the first thing to do would be to shut off whatever's making read-only snapshots, so you don't make the problem worse while it's being investigated. For those who can do that without too big an interruption to their normal routine (who don't depend on send/receive, for instance), just keep it off for the time being. For those who depend on read-only snapshots (send-receive for backup and the data is too valuable to not do the backups for a few days), consider switching back to 3.16-stable -- from 3.16.3 at least, the patch for the compress bug is there, so that shouldn't be a problem. And if you're affected, be aware that until we have a fix, we don't know if it'll be possible to remove the affected and currently undeletable snapshots. If it's not, at some point you'll need to do a fresh mkfs.btrfs, to get rid of the damage. Since the bug doesn't appear to affect writable snapshots or the "head" from which snapshots are made, it's not urgent, and a full fix is likely to include a patch to detect and fix the problem as well, but until we know what the problem is we can't be sure of that, so be prepared to do that mkfs at some point, as at this point it's possible that's the only way you'll be able to kill the corrupted snapshots. I don't agree with you concerning the not urgent part. In my opinion, any problem leading to filesystem or other data corruption should be considered as urgent, at least as long as it isn't known what exactly is affected and whether there is a simple way to salvage the corruption without going the backup/restore route. 4) Total speculation on my part, but given the wanted transid (aka generation, in different contexts) is significantly lower than the found transid, and the fact that the problem appears to be limited to /read-only/ snapshots, my first suspicion is that something's getting updated that would normally apply to all snapshots, but the read-only nature of the snapshots is preventing the full update there. The transid of the block is updated, but the snapshot being read-only is preventing update of the pointer in that snapshot accordingly. What I do /not/ know is whether the bug is that something's getting updated that should NOT be, and it's simply the read-only snapshots letting us know about it since the writable snapshots are fully updated, even if that breaks the snapshot (breaking writable snapshots in a different and currently undetected way), or if instead, it's a legitimate update, like a balance simply moving the snapshot around but not affecting it otherwise, and the bug is that the read-only snapshots aren't allowing the legitimate update. Either way, this more or less developed over the weekend, and it's Monday now, so the devs should be on it. If it's anything like the 3.15/3.16 compression bug, it'll take some time for them to properly trace it, and then to figure out an appropriate fix, but they will. Chances are we'll have at least some decent progress on a trace by Friday, and maybe even a good-to-go patch. =:^) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs random filesystem corruption in kernel 3.17
And another worrying thing I didn't notice before. Two snapshots have dates that do not make sense. root-b3 and root-b4 have been created Oct 14th (and btw root's modification time was also on Oct the 14th). So why do they show Oct 10th? And root-prov has actually been created on Oct 10 15:37, as it correctly shows, so it's like btrfs sub snap picks up old stale data from who knows were or when or for what reason. Moreover, root-b4 was created with 3.16.5not good. drwxrwsr-x 1 root staff 30 Sep 11 16:15 home d? ? ?? ?? home-backup drwxr-xr-x 1 root root 250 Oct 14 03:02 root d? ? ?? ?? root-b2 drwxr-xr-x 1 root root 250 Oct 10 15:37 root-b3 drwxr-xr-x 1 root root 250 Oct 10 15:37 root-b4 drwxr-xr-x 1 root root 250 Oct 14 03:02 root-b5 drwxr-xr-x 1 root root 250 Oct 14 03:02 root-b6 d? ? ?? ?? root-backup drwxr-xr-x 1 root root 250 Oct 10 15:37 root-prov drwxr-xr-x 1 root root 88 Sep 15 16:02 vms On Tue, Oct 14, 2014 at 1:18 AM, Rich Freeman wrote: > On Mon, Oct 13, 2014 at 5:22 PM, john terragon wrote: >> I'm using "compress=no" so compression doesn't seem to be related, at >> least in my case. Just read-only snapshots on 3.17 (although I haven't >> tried 3.16). > > I was using lzo compression, and hence my comment about turning it off > before going back to 3.16 (not realizing that 3.16 has subsequently > been fixed). > > Ironically enough I discovered this as I was about to migrate my ext4 > backup drive into my btrfs raid1. Maybe I'll go ahead and wait on > that and have an rsync backup of the filesystem handy (minus > snapshots) just in case. :) > > I'd switch to 3.16, but it sounds like there is no way to remove the > snapshots at the moment, and I can live for a while without the > ability to create new ones. > > interestingly enough it doesn't look like ALL snapshots are affected. > I checked and some of the snapshots I made last weekend while doing > system updates look accessible. They are significantly smaller, and > the subvolumes they were made from are also fairly new - though I have > no idea if that is related. > > The subvolumes do show up in btrfs su list. They cannot be examined > using btrfs su show. > > It would be VERY nice to have a way of cleaning this up without > blowing away the entire filesystem... > > -- > Rich -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs random filesystem corruption in kernel 3.17
On Mon, Oct 13, 2014 at 5:22 PM, john terragon wrote: > I'm using "compress=no" so compression doesn't seem to be related, at > least in my case. Just read-only snapshots on 3.17 (although I haven't > tried 3.16). I was using lzo compression, and hence my comment about turning it off before going back to 3.16 (not realizing that 3.16 has subsequently been fixed). Ironically enough I discovered this as I was about to migrate my ext4 backup drive into my btrfs raid1. Maybe I'll go ahead and wait on that and have an rsync backup of the filesystem handy (minus snapshots) just in case. :) I'd switch to 3.16, but it sounds like there is no way to remove the snapshots at the moment, and I can live for a while without the ability to create new ones. interestingly enough it doesn't look like ALL snapshots are affected. I checked and some of the snapshots I made last weekend while doing system updates look accessible. They are significantly smaller, and the subvolumes they were made from are also fairly new - though I have no idea if that is related. The subvolumes do show up in btrfs su list. They cannot be examined using btrfs su show. It would be VERY nice to have a way of cleaning this up without blowing away the entire filesystem... -- Rich -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs random filesystem corruption in kernel 3.17
Rich Freeman posted on Mon, 13 Oct 2014 16:42:14 -0400 as excerpted: > On Mon, Oct 13, 2014 at 4:27 PM, David Arendt wrote: >> From my own experience and based on what other people are saying, I >> think there is a random btrfs filesystem corruption problem in kernel >> 3.17 at least related to snapshots, therefore I decided to post using >> another subject to draw attention from people not concerned about btrfs >> send to it. More information can be found in the brtfs send posts. >> >> Did the filesystem you tried to balance contain snapshots ? Read only >> ones ? > > The filesystem contains numerous subvolumes and snapshots, many of which > are read-only. I'm managing many with snapper. > > The similarity of the transid verify errors made me think this issue is > related, and the root cause may have nothing to do with btrfs send. > > As far as I can tell these errors aren't having any affect on my data - > hopefully the system is catching the problems before there are actual > disk writes/etc. Summarizing what I've seen on the threads... 1) The bug seems to be read-only snapshot related. The connection to send is that send creates read-only snapshots, but people creating read- only snapshots for other purposes are now reporting the same problem, so it's not send, it's the read-only snapshots. 2) Writable snapshots haven't been implicated yet, and the working set from which the snapshots are taken doesn't seem to be affected, either. So in that sense it's not affecting ordinary usage, only the read-only snapshots themselves. 3) More problematic, however, is the fact that these apparently corrupted read-only snapshots often are not listed properly and can't be deleted, tho I'm not sure if that's /all/ the corrupted snapshots or only part of them. So while it may not affect ordinary operation in the short term, over time until there's a fix, people routinely doing read-only snapshots are going to be getting more and more of these undeletable snapshots, and depending on whether the eventual patch only prevents more or can actually fix the bad ones (possibly via btrfs check or the like), affected filesystems may ultimately have to be blown away and recreated with a fresh mkfs, in ordered to kill the currently undeletable snapshots. So the first thing to do would be to shut off whatever's making read-only snapshots, so you don't make the problem worse while it's being investigated. For those who can do that without too big an interruption to their normal routine (who don't depend on send/receive, for instance), just keep it off for the time being. For those who depend on read-only snapshots (send-receive for backup and the data is too valuable to not do the backups for a few days), consider switching back to 3.16-stable -- from 3.16.3 at least, the patch for the compress bug is there, so that shouldn't be a problem. And if you're affected, be aware that until we have a fix, we don't know if it'll be possible to remove the affected and currently undeletable snapshots. If it's not, at some point you'll need to do a fresh mkfs.btrfs, to get rid of the damage. Since the bug doesn't appear to affect writable snapshots or the "head" from which snapshots are made, it's not urgent, and a full fix is likely to include a patch to detect and fix the problem as well, but until we know what the problem is we can't be sure of that, so be prepared to do that mkfs at some point, as at this point it's possible that's the only way you'll be able to kill the corrupted snapshots. 4) Total speculation on my part, but given the wanted transid (aka generation, in different contexts) is significantly lower than the found transid, and the fact that the problem appears to be limited to /read-only/ snapshots, my first suspicion is that something's getting updated that would normally apply to all snapshots, but the read-only nature of the snapshots is preventing the full update there. The transid of the block is updated, but the snapshot being read-only is preventing update of the pointer in that snapshot accordingly. What I do /not/ know is whether the bug is that something's getting updated that should NOT be, and it's simply the read-only snapshots letting us know about it since the writable snapshots are fully updated, even if that breaks the snapshot (breaking writable snapshots in a different and currently undetected way), or if instead, it's a legitimate update, like a balance simply moving the snapshot around but not affecting it otherwise, and the bug is that the read-only snapshots aren't allowing the legitimate update. Either way, this more or less developed over the weekend, and it's Monday now, so the devs should be on it. If it's anything like the 3.15/3.16 compression bug, it'll take some time for them to properly trace it, and then to figure out an appropriate fix, but they will. Chances are we'll have at least some decent progress on a trace by Frida
Re: btrfs random filesystem corruption in kernel 3.17
David Arendt posted on Mon, 13 Oct 2014 23:25:23 +0200 as excerpted: > I'm also using no compression. > > On 10/13/2014 11:22 PM, john terragon wrote: >> I'm using "compress=no" so compression doesn't seem to be related, at >> least in my case. Just read-only snapshots on 3.17 (although I haven't >> tried 3.16). While I'm not a mind-reader and thus don't know for sure, Rich's reference to 3.16 and compression might not be related to this bug at all. In 3.15 and early 3.16, there was a different bug related to compression, tho IIRC it was patched in 3.16.2 and 3.17-rc2 (or maybe .3 and rc3, it's patched in the latest 3.16.x anyway, and in 3.17). So how I read his comment was that he was considering going back to 3.16 and disabling compression to deal with that bug (he may not know the patch was marked for stable and is in current 3.16.x), rather than stay on 3.17, since this bug hasn't even been traced yet, let alone patched. Meanwhile, this bug makes me glad my use-case doesn't involve snapshots, and I've seen nothing of it. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs random filesystem corruption in kernel 3.17
I'm also using no compression. On 10/13/2014 11:22 PM, john terragon wrote: > I'm using "compress=no" so compression doesn't seem to be related, at > least in my case. Just read-only snapshots on 3.17 (although I haven't > tried 3.16). > > John -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs random filesystem corruption in kernel 3.17
As these to machines are running as server for different purposes (yes, I know that btrfs is unstable and any corruption or data loss is at my own risk therefore I have good backups), I want to reboot them not more then necessary. However I tried to bring my reboot times in relation with corruptions: machine 1: d? ? ? ? ?? root.20141009.000503.backup reboot system boot 3.17.0 Thu Oct 9 23:20 still running reboot system boot 3.17.0 Tue Oct 7 21:25 - 23:18 (2+01:53) reboot system boot 3.17.0 Mon Oct 6 22:47 - 23:18 (3+00:31) For this machine, corruption seems to have occurred for a snapshot created after a reboot. machine 2: d? ? ?? ?? root.20141006.003239.backup d? ? ?? ?? root.20141007.001616.backup d? ? ?? ?? root.20141008.000501.backup d? ? ?? ?? root.20141009.052436.backup reboot system boot 3.17.0 Thu Oct 9 21:31 still running reboot system boot 3.17.0 Tue Oct 7 21:27 - 21:30 (2+00:03) reboot system boot 3.17.0 Tue Oct 7 17:51 - 21:26 (03:34) reboot system boot 3.17.0 Sun Oct 5 23:50 - 17:50 (1+17:59) reboot system boot 3.17.0 Sun Oct 5 23:47 - 23:49 (00:01) During the next days, I will setup a virtual machine to do more tests. On 10/13/2014 10:48 PM, john terragon wrote: > I think I just found a consistent simple way to trigger the problem > (at least on my system). And, as I guessed before, it seems to be > related just to readonly snapshots: > > 1) I create a readonly snapshot > 2) I do some changes on the source subvolume for the snapshot (I'm not > sure changes are strictly needed) > 3) reboot (or probably just unmount and remount. I reboot because the > fs I've problems with contains my root subvolume) > > After the rebooting (or the remount) I consistently have the corruption > with the usual multitude of these in dmesg > "parent transid verify failed on 902316032 wanted 2484 found 4101" > and the characteristic ls -la output > > drwxr-xr-x 1 root root 250 Oct 10 15:37 root > d? ? ?? ?? root-b2 > drwxr-xr-x 1 root root 250 Oct 10 15:37 root-b3 > d? ? ?? ?? root-backup > > root-backup and root-b2 are both readonly whereas root-b3 is rw (and > it didn't get corrupted). > > David, maybe you can try the same steps on one of your machines? > > John -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs random filesystem corruption in kernel 3.17
I'm using "compress=no" so compression doesn't seem to be related, at least in my case. Just read-only snapshots on 3.17 (although I haven't tried 3.16). John -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs random filesystem corruption in kernel 3.17
On Mon, Oct 13, 2014 at 4:55 PM, Rich Freeman wrote: > On Mon, Oct 13, 2014 at 4:48 PM, john terragon wrote: >> >> After the rebooting (or the remount) I consistently have the corruption >> with the usual multitude of these in dmesg >> "parent transid verify failed on 902316032 wanted 2484 found 4101" >> and the characteristic ls -la output Sorry to double-reply, but I left this out. I have a long string of these early in boot as well that I never noticed before. -- Rich -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs random filesystem corruption in kernel 3.17
On Mon, Oct 13, 2014 at 4:48 PM, john terragon wrote: > I think I just found a consistent simple way to trigger the problem > (at least on my system). And, as I guessed before, it seems to be > related just to readonly snapshots: > > 1) I create a readonly snapshot > 2) I do some changes on the source subvolume for the snapshot (I'm not > sure changes are strictly needed) > 3) reboot (or probably just unmount and remount. I reboot because the > fs I've problems with contains my root subvolume) > > After the rebooting (or the remount) I consistently have the corruption > with the usual multitude of these in dmesg > "parent transid verify failed on 902316032 wanted 2484 found 4101" > and the characteristic ls -la output > > drwxr-xr-x 1 root root 250 Oct 10 15:37 root > d? ? ?? ?? root-b2 > drwxr-xr-x 1 root root 250 Oct 10 15:37 root-b3 > d? ? ?? ?? root-backup > > root-backup and root-b2 are both readonly whereas root-b3 is rw (and > it didn't get corrupted). > > David, maybe you can try the same steps on one of your machines? > Look at that. I didn't realize it, but indeed I have a corrupted snapshot: /data/.snapshots/5338/: ls: cannot access /data/.snapshots/5338/snapshot: Cannot allocate memory total 4 drwxr-xr-x 1 root root 32 Oct 11 06:09 . drwxr-x--- 1 root root 32 Oct 11 07:42 .. -rw--- 1 root root 135 Oct 11 06:09 info.xml d? ? ?? ?? snapshot Several older snapshots are fine, and those predate my 3.17 upgrade. I noticed that this corrupted snapshot isn't even listed in my snapper lists. btrfs su delete /data/.snapshots/5338/snapshot Transaction commit: none (default) ERROR: error accessing '/data/.snapshots/5338/snapshot' Removing them appears to be problematic as well. I might just disable compress=lzo and go back to 3.16 to see how that goes. -- Rich -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs random filesystem corruption in kernel 3.17
I think I just found a consistent simple way to trigger the problem (at least on my system). And, as I guessed before, it seems to be related just to readonly snapshots: 1) I create a readonly snapshot 2) I do some changes on the source subvolume for the snapshot (I'm not sure changes are strictly needed) 3) reboot (or probably just unmount and remount. I reboot because the fs I've problems with contains my root subvolume) After the rebooting (or the remount) I consistently have the corruption with the usual multitude of these in dmesg "parent transid verify failed on 902316032 wanted 2484 found 4101" and the characteristic ls -la output drwxr-xr-x 1 root root 250 Oct 10 15:37 root d? ? ?? ?? root-b2 drwxr-xr-x 1 root root 250 Oct 10 15:37 root-b3 d? ? ?? ?? root-backup root-backup and root-b2 are both readonly whereas root-b3 is rw (and it didn't get corrupted). David, maybe you can try the same steps on one of your machines? John -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs random filesystem corruption in kernel 3.17
On Mon, Oct 13, 2014 at 4:27 PM, David Arendt wrote: > From my own experience and based on what other people are saying, I > think there is a random btrfs filesystem corruption problem in kernel > 3.17 at least related to snapshots, therefore I decided to post using > another subject to draw attention from people not concerned about btrfs > send to it. More information can be found in the brtfs send posts. > > Did the filesystem you tried to balance contain snapshots ? Read only ones ? The filesystem contains numerous subvolumes and snapshots, many of which are read-only. I'm managing many with snapper. The similarity of the transid verify errors made me think this issue is related, and the root cause may have nothing to do with btrfs send. As far as I can tell these errors aren't having any affect on my data - hopefully the system is catching the problems before there are actual disk writes/etc. -- Rich -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs random filesystem corruption in kernel 3.17
>From my own experience and based on what other people are saying, I think there is a random btrfs filesystem corruption problem in kernel 3.17 at least related to snapshots, therefore I decided to post using another subject to draw attention from people not concerned about btrfs send to it. More information can be found in the brtfs send posts. Did the filesystem you tried to balance contain snapshots ? Read only ones ? On 10/13/2014 07:22 PM, Rich Freeman wrote: > On Sun, Oct 12, 2014 at 7:11 AM, David Arendt wrote: >> This weekend I finally had time to try btrfs send again on the newly >> created fs. Now I am running into another problem: >> >> btrfs send returns: ERROR: send ioctl failed with -12: Cannot allocate >> memory >> >> In dmesg I see only the following output: >> >> parent transid verify failed on 21325004800 wanted 2620 found 8325 >> > I'm not using send at all, but I've been running into parent transid > verify failed messages where the wanted is way smaller than the found > when trying to balance a raid1 after adding a new drive. Originally I > had gotten a BUG, and after reboot the drive finished balancing > (interestingly enough without moving any chunks to the new drive - > just consolidating everything on the old drives), and then when I try > to do another balance I get: > [ 4426.987177] BTRFS info (device sdc2): relocating block group > 10367073779712 flags 17 > [ 4446.287998] BTRFS info (device sdc2): found 13 extents > [ 4451.330887] parent transid verify failed on 10063286579200 wanted > 987432 found 993678 > [ 4451.350663] parent transid verify failed on 10063286579200 wanted > 987432 found 993678 > > The btrfs program itself outputs: > btrfs balance start -v /data > Dumping filters: flags 0x7, state 0x0, force is off > DATA (flags 0x0): balancing > METADATA (flags 0x0): balancing > SYSTEM (flags 0x0): balancing > ERROR: error during balancing '/data' - Cannot allocate memory > There may be more info in syslog - try dmesg | tail > > This is also on 3.17. This may be completely unrelated, but it seemed > similar enough to be worth mentioning. > > The filesystem otherwise seems to work fine, other than the new drive > not having any data on it: > Label: 'datafs' uuid: cd074207-9bc3-402d-bee8-6a8c77d56959 > Total devices 6 FS bytes used 2.16TiB > devid1 size 2.73TiB used 2.40TiB path /dev/sdc2 > devid2 size 931.32GiB used 695.03GiB path /dev/sda2 > devid3 size 931.32GiB used 700.00GiB path /dev/sdb2 > devid4 size 931.32GiB used 700.00GiB path /dev/sdd2 > devid5 size 931.32GiB used 699.00GiB path /dev/sde2 > devid6 size 2.73TiB used 0.00 path /dev/sdf2 > > This is btrfs-progs-3.16.2. > > -- > Rich -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html