Re: How to handle a RAID5 arrawy with a failing drive? - raid5 mostly works, just no rebuilds
On Wed, Mar 19, 2014 at 10:53:33AM -0600, Chris Murphy wrote: On Mar 19, 2014, at 9:40 AM, Marc MERLIN m...@merlins.org wrote: After adding a drive, I couldn't quite tell if it was striping over 11 drive2 or 10, but it felt that at least at times, it was striping over 11 drives with write failures on the missing drive. I can't prove it, but I'm thinking the new data I was writing was being striped in degraded mode. Well it does sound fragile after all to add a drive to a degraded array, especially when it's not expressly treating the faulty drive as faulty. I think iotop will show what block devices are being written to. And in a VM it's easy (albeit rudimentary) with sparse files, as you can see them grow. Yes, although it's limited, you apparently only lose new data that was added after you went into degraded mode and only if you add another drive where you write more data. In real life this shouldn't be too common, even if it is indeed a bug. It's entirely plausible a drive power/data cable becomes lose, runs for hours degraded before the wayward device is reseated. It'll be common enough. It's definitely not OK for all of that data in the interim to vanish just because the volume has resumed from degraded to normal. Two states of data, normal vs degraded, is scary. It sounds like totally silent data loss. So yeah if it's reproducible it's worthy of a separate bug. I just got around to filing that bug: https://bugzilla.kernel.org/show_bug.cgi?id=72811 In other news, I was able to 1) remove a drive 2) mount degraded 3) add a new drive 4) rebalance (that took 2 days with little data, 4 deadlocks and reboots though) 5) remove the missing drive from the filesystem 6) remount the array without -o degraded Now, I'm testing 1) add a new drive 2 remove a working drive 3) automatic rebalance from #2 should rebuild on the new drive automatically Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to handle a RAID5 arrawy with a failing drive? - raid5 mostly works, just no rebuilds
Marc MERLIN posted on Wed, 19 Mar 2014 08:40:31 -0700 as excerpted: That's the thing though. If the bad device hadn't been forcibly removed, and apparently the only way to do this was to unmount, make the device node disappear, and remount in degraded mode, it looked to me like btrfs was still consideing that the drive was part of the array and trying to write to it. After adding a drive, I couldn't quite tell if it was striping over 11 drive2 or 10, but it felt that at least at times, it was striping over 11 drives with write failures on the missing drive. I can't prove it, but I'm thinking the new data I was writing was being striped in degraded mode. FWIW, there's at least two problems here, one a bug (or perhaps it'd more accurately be described as an as yet incomplete feature) unrelated to btrfs raid5/6 mode, the other the incomplete raid5/6 support. Both are known issues, however. The incomplete raid5/6 is discussed well enough elsewhere including in this thread as a whole, which leaves the other issue. The other issue, not specifically raid5/6 mode related, is that currently, in-kernel btrfs is basically oblivious to disappearing drives, thus explaining some of the more complex bits of the behavior you described. Yes, the kernel has the device data and other layers know when a device goes missing, but it's basically a case of the right hand not knowing what the left hand is doing -- once setup on a set of devices, in-kernel btrfs basically doesn't do anything with the device information available to it, at least in terms of removing a device from its listing when it goes missing. (It does seem to transparently handle a missing btrfs component device reappearing, arguably /too/ transparently!) Basically all btrfs does is log errors when a component device disappears. It doesn't do anything with the disappeared device, and really doesn't know it has disappeared at all, until an unmount and (possibly degraded) remount, at which point it re-enumerates the devices and again knows what's actually there... until a device disappears again. There's actually patches being worked on to fix that situation as we speak, and it's possible they're actually in btrfs-next already. (I've seen the patches and discussion go by on the list but haven't tracked them to the extent that I know current status, other than that they're not in mainline yet.) Meanwhile, counter-intuitively, btrfs-userspace is sometimes more aware of current device status than btrfs-kernel is ATM, since parts of userspace actually either get current status from the kernel, or trigger a rescan in ordered to get it. But even after a rescan updates what userspace knows and thus what the kernel as a whole knows, btrfs-kernel still doesn't actually use that new information available to it in the same kernel that btrfs-userspace used to get it from! Knowing that rather counterintuitive little inconsistency, that isn't actually so little, goes quite a way toward explaining what otherwise looks like illogical btrfs behavior -- how could kernel-btrfs not know the status of its own devices? -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to handle a RAID5 arrawy with a failing drive? - raid5 mostly works, just no rebuilds
I think after the balance it was a fine, non-degraded RAID again... As far as I remember. Tobby 2014-03-20 1:46 GMT+01:00 Marc MERLIN m...@merlins.org: On Thu, Mar 20, 2014 at 01:44:20AM +0100, Tobias Holst wrote: I tried the RAID6 implementation of btrfs and I looks like I had the same problem. Rebuild with balance worked but when a drive was removed when mounted and then readded, the chaos began. I tried it a few times. So when a drive fails (and this is just because of connection lost or similar non severe problems), then it is necessary to wipe the disc first before readding it, so btrfs will add it as a new disk and not try to readd the old one. Good to know you got this too. Just to confirm: did you get it to rebuild, or once a drive is lost/gets behind, you're in degraded mode forever for those blocks? Or were you able to balance? Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to handle a RAID5 arrawy with a failing drive? - raid5 mostly works, just no rebuilds
On Tue, Mar 18, 2014 at 09:02:07AM +, Duncan wrote: First just a note that you hijacked Mr Manana's patch thread. Replying (...) I did, I use mutt, I know about in Reply-To, I was tired, I screwed up, sorry, and there was no undo :) Since you don't have to worry about the data I'd suggest blowing it away and starting over. Btrfs raid5/6 code is known to be incomplete at this point, to work in normal mode and write everything out, but with incomplete recovery code. So I'd treat it like the raid-0 mode it effectively is, and consider it lost if a device drops. Which I haven't. My use-case wouldn't be looking at raid5/6 (or raid0) anyway, but even if it were, I'd not touch the current code unless it /was/ just for something I'd consider risking on a raid0. Other than Thank you for the warning, and yes I know the risk and the data I'm putting on it is ok with that risk :) So, I was bit quiet because I diagnosed problems with the underlying hardware. My disk array was creating disk faults due to insufficient power coming in. Now that I fixed that and made sure the drives work with a full run of hdrecover of all the drives in parallel (exercises the drives while making sure all their blocks work), I did tests again: Summary: 1) You can grow and shrink a raid5 volume while it's mounted = very cool 2) shrinking causes a rebalance 3) growing requires you to run rebalance 4) btrfs cannot replace a drive in raid5, whether it's there or not that's the biggest thing missing: just no rebuilds in any way 5) you can mount a raid5 with a missing device with -o degraded 6) adding a drive to a degraded arrays will grow the array, not rebuild the missing bits 7) you can remove a drive from an array, add files, and then if you plug the drive in, it apparently gets auto sucked in back in the array. There is no rebuild that happens, you now have an inconsistent array where one drive is not at the same level than the other ones (I lost all files I added after the drive was removed when I added the drive back). In other words, everything seems to work except there is no rebuild that I could see anywhere. Here are all the details: Creation polgara:/dev/disk/by-id# mkfs.btrfs -f -d raid5 -m raid5 -L backupcopy /dev/mapper/crypt_sd[bdfghijkl]1 WARNING! - Btrfs v3.12 IS EXPERIMENTAL WARNING! - see http://btrfs.wiki.kernel.org before using Turning ON incompat feature 'extref': increased hardlink limit per file to 65536 Turning ON incompat feature 'raid56': raid56 extended format adding device /dev/mapper/crypt_sdd1 id 2 adding device /dev/mapper/crypt_sdf1 id 3 adding device /dev/mapper/crypt_sdg1 id 4 adding device /dev/mapper/crypt_sdh1 id 5 adding device /dev/mapper/crypt_sdi1 id 6 adding device /dev/mapper/crypt_sdj1 id 7 adding device /dev/mapper/crypt_sdk1 id 8 adding device /dev/mapper/crypt_sdl1 id 9 fs created label backupcopy on /dev/mapper/crypt_sdb1 nodesize 16384 leafsize 16384 sectorsize 4096 size 4.09TiB polgara:/dev/disk/by-id# mount -L backupcopy /mnt/btrfs_backupcopy/ polgara:/mnt/btrfs_backupcopy# df -h . Filesystem Size Used Avail Use% Mounted on /dev/mapper/crypt_sdb1 4.1T 3.0M 4.1T 1% /mnt/btrfs_backupcopy Let's add one drive polgara:/mnt/btrfs_backupcopy# btrfs device add -f /dev/mapper/crypt_sdm1 /mnt/btrfs_backupcopy/ polgara:/mnt/btrfs_backupcopy# df -h . Filesystem Size Used Avail Use% Mounted on /dev/mapper/crypt_sdb1 4.6T 3.0M 4.6T 1% /mnt/btrfs_backupcopy Oh look it's bigger now. We need to manual rebalance to use the new drive: polgara:/mnt/btrfs_backupcopy# btrfs balance start . Done, had to relocate 6 out of 6 chunks polgara:/mnt/btrfs_backupcopy# btrfs device delete /dev/mapper/crypt_sdm1 . BTRFS info (device dm-9): relocating block group 23314563072 flags 130 BTRFS info (device dm-9): relocating block group 22106603520 flags 132 BTRFS info (device dm-9): found 6 extents BTRFS info (device dm-9): relocating block group 12442927104 flags 129 BTRFS info (device dm-9): found 1 extents polgara:/mnt/btrfs_backupcopy# df -h . Filesystem Size Used Avail Use% Mounted on /dev/mapper/crypt_sdb1 4.1T 4.7M 4.1T 1% /mnt/btrfs_backupcopy Ah, it's smaller again. Note that it's not degraded, you can just keep removing drives and it'll do a force reblance to fit the data in the remaining drives. Ok, I've unounted the filesystem, and will manually remove a device: polgara:~# dmsetup remove crypt_sdl1 polgara:~# mount -L backupcopy /mnt/btrfs_backupcopy/ mount: wrong fs type, bad option, bad superblock on /dev/mapper/crypt_sdk1, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so BTRFS: open /dev/dm-9 failed BTRFS info (device dm-7): disk space caching is enabled BTRFS: failed to read chunk tree on dm-7 BTRFS: open_ctree failed So a normal mount fails. You
Re: How to handle a RAID5 arrawy with a failing drive? - raid5 mostly works, just no rebuilds
On Mar 19, 2014, at 12:09 AM, Marc MERLIN m...@merlins.org wrote: 7) you can remove a drive from an array, add files, and then if you plug the drive in, it apparently gets auto sucked in back in the array. There is no rebuild that happens, you now have an inconsistent array where one drive is not at the same level than the other ones (I lost all files I added after the drive was removed when I added the drive back). Seems worthy of a dedicated bug report and keeping an eye on in the future, not good. polgara:/mnt/btrfs_backupcopy# df -h . Filesystem Size Used Avail Use% Mounted on /dev/mapper/crypt_sdb1 4.1T 3.0M 4.1T 1% /mnt/btrfs_backupcopy Let's add one drive polgara:/mnt/btrfs_backupcopy# btrfs device add -f /dev/mapper/crypt_sdm1 /mnt/btrfs_backupcopy/ polgara:/mnt/btrfs_backupcopy# df -h . Filesystem Size Used Avail Use% Mounted on /dev/mapper/crypt_sdb1 4.6T 3.0M 4.6T 1% /mnt/btrfs_backupcopy Oh look it's bigger now. We need to manual rebalance to use the new drive: You don't have to. As soon as you add the additional drive, newly allocated chunks will stripe across all available drives. e.g. 1 GB allocations striped across 3x drives, if I add a 4th drive, initially any additional writes are only to the first three drives but once a new data chunk is allocated it gets striped across 4 drives. In other words, btrfs happily added my device that was way behind and gave me an incomplete fileystem instead of noticing that sdj1 was behind and giving me a degraded filesystem. Moral of the story: do not ever re-add a device that got kicked out if you wrote new data after that, or you will end up with an older version of your filesystem (on the plus side, it's consistent and apparently without data corruption. That said, btrfs scrub complained loudly of many errors it didn't know how to fix. Sure the whole thing isn't corrupt. But if anything written while degraded vanishes once the missing device is reattached, and you remount normally (non-degraded), that's data loss. Yikes! There you go, hope this helps. Yes. Thanks! Chris Murphy-- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to handle a RAID5 arrawy with a failing drive? - raid5 mostly works, just no rebuilds
On Wed, Mar 19, 2014 at 12:32:55AM -0600, Chris Murphy wrote: On Mar 19, 2014, at 12:09 AM, Marc MERLIN m...@merlins.org wrote: 7) you can remove a drive from an array, add files, and then if you plug the drive in, it apparently gets auto sucked in back in the array. There is no rebuild that happens, you now have an inconsistent array where one drive is not at the same level than the other ones (I lost all files I added after the drive was removed when I added the drive back). Seems worthy of a dedicated bug report and keeping an eye on in the future, not good. Since it's not supposed to be working, I didn't file a bug, but I figured it'd be good for people to know about it in the meantime. polgara:/mnt/btrfs_backupcopy# btrfs device add -f /dev/mapper/crypt_sdm1 /mnt/btrfs_backupcopy/ polgara:/mnt/btrfs_backupcopy# df -h . Filesystem Size Used Avail Use% Mounted on /dev/mapper/crypt_sdb1 4.6T 3.0M 4.6T 1% /mnt/btrfs_backupcopy Oh look it's bigger now. We need to manual rebalance to use the new drive: You don't have to. As soon as you add the additional drive, newly allocated chunks will stripe across all available drives. e.g. 1 GB allocations striped across 3x drives, if I add a 4th drive, initially any additional writes are only to the first three drives but once a new data chunk is allocated it gets striped across 4 drives. That's the thing though. If the bad device hadn't been forcibly removed, and apparently the only way to do this was to unmount, make the device node disappear, and remount in degraded mode, it looked to me like btrfs was still consideing that the drive was part of the array and trying to write to it. After adding a drive, I couldn't quite tell if it was striping over 11 drive2 or 10, but it felt that at least at times, it was striping over 11 drives with write failures on the missing drive. I can't prove it, but I'm thinking the new data I was writing was being striped in degraded mode. Sure the whole thing isn't corrupt. But if anything written while degraded vanishes once the missing device is reattached, and you remount normally (non-degraded), that's data loss. Yikes! Yes, although it's limited, you apparently only lose new data that was added after you went into degraded mode and only if you add another drive where you write more data. In real life this shouldn't be too common, even if it is indeed a bug. Cheers, Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to handle a RAID5 arrawy with a failing drive? - raid5 mostly works, just no rebuilds
On Mar 19, 2014, at 9:40 AM, Marc MERLIN m...@merlins.org wrote: After adding a drive, I couldn't quite tell if it was striping over 11 drive2 or 10, but it felt that at least at times, it was striping over 11 drives with write failures on the missing drive. I can't prove it, but I'm thinking the new data I was writing was being striped in degraded mode. Well it does sound fragile after all to add a drive to a degraded array, especially when it's not expressly treating the faulty drive as faulty. I think iotop will show what block devices are being written to. And in a VM it's easy (albeit rudimentary) with sparse files, as you can see them grow. Yes, although it's limited, you apparently only lose new data that was added after you went into degraded mode and only if you add another drive where you write more data. In real life this shouldn't be too common, even if it is indeed a bug. It's entirely plausible a drive power/data cable becomes lose, runs for hours degraded before the wayward device is reseated. It'll be common enough. It's definitely not OK for all of that data in the interim to vanish just because the volume has resumed from degraded to normal. Two states of data, normal vs degraded, is scary. It sounds like totally silent data loss. So yeah if it's reproducible it's worthy of a separate bug. Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to handle a RAID5 arrawy with a failing drive? - raid5 mostly works, just no rebuilds
On Wed, Mar 19, 2014 at 10:53:33AM -0600, Chris Murphy wrote: Yes, although it's limited, you apparently only lose new data that was added after you went into degraded mode and only if you add another drive where you write more data. In real life this shouldn't be too common, even if it is indeed a bug. It's entirely plausible a drive power/data cable becomes lose, runs for hours degraded before the wayward device is reseated. It'll be common enough. It's definitely not OK for all of that data in the interim to vanish just because the volume has resumed from degraded to normal. Two states of data, normal vs degraded, is scary. It sounds like totally silent data loss. So yeah if it's reproducible it's worthy of a separate bug. Actually what I did is more complex, I first added a drive to a degraded array, and then re-added the drive that had been removed. I don't know if re-adding the same drive that was removed would cause the bug I saw. For now, my array is back to actually trying to store the backup I had meant for it, and the drives seems stable now that I fixed the power issue. Does someone else want to try? :) Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html