Re: How to handle a RAID5 arrawy with a failing drive? - raid5 mostly works, just no rebuilds

2014-03-23 Thread Marc MERLIN
On Wed, Mar 19, 2014 at 10:53:33AM -0600, Chris Murphy wrote:
 
 On Mar 19, 2014, at 9:40 AM, Marc MERLIN m...@merlins.org wrote:
  
  After adding a drive, I couldn't quite tell if it was striping over 11
  drive2 or 10, but it felt that at least at times, it was striping over 11
  drives with write failures on the missing drive.
  I can't prove it, but I'm thinking the new data I was writing was being
  striped in degraded mode.
 
 Well it does sound fragile after all to add a drive to a degraded array, 
 especially when it's not expressly treating the faulty drive as faulty. I 
 think iotop will show what block devices are being written to. And in a VM 
 it's easy (albeit rudimentary) with sparse files, as you can see them grow.
 
  
  Yes, although it's limited, you apparently only lose new data that was added
  after you went into degraded mode and only if you add another drive where
  you write more data.
  In real life this shouldn't be too common, even if it is indeed a bug.
 
 It's entirely plausible a drive power/data cable becomes lose, runs for hours 
 degraded before the wayward device is reseated. It'll be common enough. It's 
 definitely not OK for all of that data in the interim to vanish just because 
 the volume has resumed from degraded to normal. Two states of data, normal vs 
 degraded, is scary. It sounds like totally silent data loss. So yeah if it's 
 reproducible it's worthy of a separate bug.

I just got around to filing that bug:
https://bugzilla.kernel.org/show_bug.cgi?id=72811

In other news, I was able to
1) remove a drive
2) mount degraded
3) add a new drive
4) rebalance (that took 2 days with little data, 4 deadlocks and reboots
though)
5) remove the missing drive from the filesystem
6) remount the array without -o degraded

Now, I'm testing
1) add a new drive
2 remove a working drive
3) automatic rebalance from #2 should rebuild on the new drive automatically

Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to handle a RAID5 arrawy with a failing drive? - raid5 mostly works, just no rebuilds

2014-03-20 Thread Duncan
Marc MERLIN posted on Wed, 19 Mar 2014 08:40:31 -0700 as excerpted:

 That's the thing though. If the bad device hadn't been forcibly removed,
 and apparently the only way to do this was to unmount, make the device
 node disappear, and remount in degraded mode, it looked to me like btrfs
 was still consideing that the drive was part of the array and trying to
 write to it.
 After adding a drive, I couldn't quite tell if it was striping over 11
 drive2 or 10, but it felt that at least at times, it was striping over
 11 drives with write failures on the missing drive.
 I can't prove it, but I'm thinking the new data I was writing was being
 striped in degraded mode.

FWIW, there's at least two problems here, one a bug (or perhaps it'd more 
accurately be described as an as yet incomplete feature) unrelated to 
btrfs raid5/6 mode, the other the incomplete raid5/6 support.  Both are 
known issues, however.

The incomplete raid5/6 is discussed well enough elsewhere including in 
this thread as a whole, which leaves the other issue.

The other issue, not specifically raid5/6 mode related, is that 
currently, in-kernel btrfs is basically oblivious to disappearing drives, 
thus explaining some of the more complex bits of the behavior you 
described.  Yes, the kernel has the device data and other layers know 
when a device goes missing, but it's basically a case of the right hand 
not knowing what the left hand is doing -- once setup on a set of 
devices, in-kernel btrfs basically doesn't do anything with the device 
information available to it, at least in terms of removing a device from 
its listing when it goes missing.  (It does seem to transparently handle 
a missing btrfs component device reappearing, arguably /too/ 
transparently!)

Basically all btrfs does is log errors when a component device 
disappears.  It doesn't do anything with the disappeared device, and 
really doesn't know it has disappeared at all, until an unmount and 
(possibly degraded) remount, at which point it re-enumerates the devices 
and again knows what's actually there... until a device disappears again.

There's actually patches being worked on to fix that situation as we 
speak, and it's possible they're actually in btrfs-next already.  (I've 
seen the patches and discussion go by on the list but haven't tracked 
them to the extent that I know current status, other than that they're 
not in mainline yet.)

Meanwhile, counter-intuitively, btrfs-userspace is sometimes more aware 
of current device status than btrfs-kernel is ATM, since parts of 
userspace actually either get current status from the kernel, or trigger 
a rescan in ordered to get it.  But even after a rescan updates what 
userspace knows and thus what the kernel as a whole knows, btrfs-kernel 
still doesn't actually use that new information available to it in the 
same kernel that btrfs-userspace used to get it from!

Knowing that rather counterintuitive little inconsistency, that isn't 
actually so little, goes quite a way toward explaining what otherwise 
looks like illogical btrfs behavior -- how could kernel-btrfs not know 
the status of its own devices?

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to handle a RAID5 arrawy with a failing drive? - raid5 mostly works, just no rebuilds

2014-03-20 Thread Tobias Holst
I think after the balance it was a fine, non-degraded RAID again... As
far as I remember.

Tobby


2014-03-20 1:46 GMT+01:00 Marc MERLIN m...@merlins.org:

 On Thu, Mar 20, 2014 at 01:44:20AM +0100, Tobias Holst wrote:
  I tried the RAID6 implementation of btrfs and I looks like I had the
  same problem. Rebuild with balance worked but when a drive was
  removed when mounted and then readded, the chaos began. I tried it a
  few times. So when a drive fails (and this is just because of
  connection lost or similar non severe problems), then it is necessary
  to wipe the disc first before readding it, so btrfs will add it as a
  new disk and not try to readd the old one.

 Good to know you got this too.

 Just to confirm: did you get it to rebuild, or once a drive is lost/gets
 behind, you're in degraded mode forever for those blocks?

 Or were you able to balance?

 Marc
 --
 A mouse is a device used to point at the xterm you want to type in - A.S.R.
 Microsoft is to operating systems 
    what McDonalds is to gourmet 
 cooking
 Home page: http://marc.merlins.org/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to handle a RAID5 arrawy with a failing drive? - raid5 mostly works, just no rebuilds

2014-03-19 Thread Marc MERLIN
On Tue, Mar 18, 2014 at 09:02:07AM +, Duncan wrote:
 First just a note that you hijacked Mr Manana's patch thread.  Replying 
(...)
I did, I use mutt, I know about in Reply-To, I was tired, I screwed up,
sorry, and there was no undo :)

 Since you don't have to worry about the data I'd suggest blowing it away 
 and starting over.  Btrfs raid5/6 code is known to be incomplete at this 
 point, to work in normal mode and write everything out, but with 
 incomplete recovery code.  So I'd treat it like the raid-0 mode it 
 effectively is, and consider it lost if a device drops.

 Which I haven't.  My use-case wouldn't be looking at raid5/6 (or raid0) 
 anyway, but even if it were, I'd not touch the current code unless it 
 /was/ just for something I'd consider risking on a raid0.  Other than 

Thank you for the warning, and yes I know the risk and the data I'm putting
on it is ok with that risk :)

So, I was bit quiet because I diagnosed problems with the underlying
hardware.
My disk array was creating disk faults due to insufficient power coming in.

Now that I fixed that and made sure the drives work with a full run of
hdrecover of all the drives in parallel (exercises the drives while making
sure all their blocks work), I did tests again:

Summary:
1) You can grow and shrink a raid5 volume while it's mounted = very cool
2) shrinking causes a rebalance
3) growing requires you to run rebalance
4) btrfs cannot replace a drive in raid5, whether it's there or not
   that's the biggest thing missing: just no rebuilds in any way
5) you can mount a raid5 with a missing device with -o degraded
6) adding a drive to a degraded arrays will grow the array, not rebuild
   the missing bits
7) you can remove a drive from an array, add files, and then if you plug
   the drive in, it apparently gets auto sucked in back in the array.
There is no rebuild that happens, you now have an inconsistent array where
one drive is not at the same level than the other ones (I lost all files I 
added 
after the drive was removed when I added the drive back).

In other words, everything seems to work except there is no rebuild that I 
could 
see anywhere.

Here are all the details:

Creation
 polgara:/dev/disk/by-id# mkfs.btrfs -f -d raid5 -m raid5 -L backupcopy 
 /dev/mapper/crypt_sd[bdfghijkl]1
 
 WARNING! - Btrfs v3.12 IS EXPERIMENTAL
 WARNING! - see http://btrfs.wiki.kernel.org before using
 
 Turning ON incompat feature 'extref': increased hardlink limit per file to 
 65536
 Turning ON incompat feature 'raid56': raid56 extended format
 adding device /dev/mapper/crypt_sdd1 id 2
 adding device /dev/mapper/crypt_sdf1 id 3
 adding device /dev/mapper/crypt_sdg1 id 4
 adding device /dev/mapper/crypt_sdh1 id 5
 adding device /dev/mapper/crypt_sdi1 id 6
 adding device /dev/mapper/crypt_sdj1 id 7
 adding device /dev/mapper/crypt_sdk1 id 8
 adding device /dev/mapper/crypt_sdl1 id 9
 fs created label backupcopy on /dev/mapper/crypt_sdb1
 nodesize 16384 leafsize 16384 sectorsize 4096 size 4.09TiB
 polgara:/dev/disk/by-id# mount -L backupcopy /mnt/btrfs_backupcopy/
 polgara:/mnt/btrfs_backupcopy# df -h .
 Filesystem  Size  Used Avail Use% Mounted on
 /dev/mapper/crypt_sdb1  4.1T  3.0M  4.1T   1% /mnt/btrfs_backupcopy

Let's add one drive
 polgara:/mnt/btrfs_backupcopy# btrfs device add -f /dev/mapper/crypt_sdm1 
 /mnt/btrfs_backupcopy/
 polgara:/mnt/btrfs_backupcopy# df -h .
 Filesystem  Size  Used Avail Use% Mounted on
 /dev/mapper/crypt_sdb1  4.6T  3.0M  4.6T   1% /mnt/btrfs_backupcopy

Oh look it's bigger now. We need to manual rebalance to use the new drive:
 polgara:/mnt/btrfs_backupcopy# btrfs balance start . 
 Done, had to relocate 6 out of 6 chunks
 
 polgara:/mnt/btrfs_backupcopy#  btrfs device delete /dev/mapper/crypt_sdm1 .
 BTRFS info (device dm-9): relocating block group 23314563072 flags 130
 BTRFS info (device dm-9): relocating block group 22106603520 flags 132
 BTRFS info (device dm-9): found 6 extents
 BTRFS info (device dm-9): relocating block group 12442927104 flags 129
 BTRFS info (device dm-9): found 1 extents
 polgara:/mnt/btrfs_backupcopy# df -h .
 Filesystem  Size  Used Avail Use% Mounted on
 /dev/mapper/crypt_sdb1  4.1T  4.7M  4.1T   1% /mnt/btrfs_backupcopy

Ah, it's smaller again. Note that it's not degraded, you can just keep removing 
drives
and it'll do a force reblance to fit the data in the remaining drives.

Ok, I've unounted the filesystem, and will manually remove a device:
 polgara:~# dmsetup remove crypt_sdl1
 polgara:~# mount -L backupcopy /mnt/btrfs_backupcopy/
 mount: wrong fs type, bad option, bad superblock on /dev/mapper/crypt_sdk1,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail  or so
 BTRFS: open /dev/dm-9 failed
 BTRFS info (device dm-7): disk space caching is enabled
 BTRFS: failed to read chunk tree on dm-7
 BTRFS: open_ctree failed

So a normal mount fails. You 

Re: How to handle a RAID5 arrawy with a failing drive? - raid5 mostly works, just no rebuilds

2014-03-19 Thread Chris Murphy

On Mar 19, 2014, at 12:09 AM, Marc MERLIN m...@merlins.org wrote:
 
 7) you can remove a drive from an array, add files, and then if you plug
   the drive in, it apparently gets auto sucked in back in the array.
 There is no rebuild that happens, you now have an inconsistent array where
 one drive is not at the same level than the other ones (I lost all files I 
 added 
 after the drive was removed when I added the drive back).

Seems worthy of a dedicated bug report and keeping an eye on in the future, not 
good.

 
 polgara:/mnt/btrfs_backupcopy# df -h .
 Filesystem  Size  Used Avail Use% Mounted on
 /dev/mapper/crypt_sdb1  4.1T  3.0M  4.1T   1% /mnt/btrfs_backupcopy
 
 Let's add one drive
 polgara:/mnt/btrfs_backupcopy# btrfs device add -f /dev/mapper/crypt_sdm1 
 /mnt/btrfs_backupcopy/
 polgara:/mnt/btrfs_backupcopy# df -h .
 Filesystem  Size  Used Avail Use% Mounted on
 /dev/mapper/crypt_sdb1  4.6T  3.0M  4.6T   1% /mnt/btrfs_backupcopy
 
 Oh look it's bigger now. We need to manual rebalance to use the new drive:

You don't have to. As soon as you add the additional drive, newly allocated 
chunks will stripe across all available drives. e.g. 1 GB allocations striped 
across 3x drives, if I add a 4th drive, initially any additional writes are 
only to the first three drives but once a new data chunk is allocated it gets 
striped across 4 drives.


 
 In other words, btrfs happily added my device that was way behind and gave me 
 an incomplete fileystem instead of noticing
 that sdj1 was behind and giving me a degraded filesystem.
 Moral of the story: do not ever re-add a device that got kicked out if you 
 wrote new data after that, or you will end up with an older version of your 
 filesystem (on the plus side, it's consistent and apparently without data 
 corruption. That said, btrfs scrub complained loudly of many errors it didn't 
 know how to fix.

Sure the whole thing isn't corrupt. But if anything written while degraded 
vanishes once the missing device is reattached, and you remount normally 
(non-degraded), that's data loss. Yikes!


 There you go, hope this helps.

Yes. Thanks!

Chris Murphy--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to handle a RAID5 arrawy with a failing drive? - raid5 mostly works, just no rebuilds

2014-03-19 Thread Marc MERLIN
On Wed, Mar 19, 2014 at 12:32:55AM -0600, Chris Murphy wrote:
 
 On Mar 19, 2014, at 12:09 AM, Marc MERLIN m...@merlins.org wrote:
  
  7) you can remove a drive from an array, add files, and then if you plug
the drive in, it apparently gets auto sucked in back in the array.
  There is no rebuild that happens, you now have an inconsistent array where
  one drive is not at the same level than the other ones (I lost all files I 
  added 
  after the drive was removed when I added the drive back).
 
 Seems worthy of a dedicated bug report and keeping an eye on in the future, 
 not good.
 
Since it's not supposed to be working, I didn't file a bug, but I figured
it'd be good for people to know about it in the meantime.

  polgara:/mnt/btrfs_backupcopy# btrfs device add -f /dev/mapper/crypt_sdm1 
  /mnt/btrfs_backupcopy/
  polgara:/mnt/btrfs_backupcopy# df -h .
  Filesystem  Size  Used Avail Use% Mounted on
  /dev/mapper/crypt_sdb1  4.6T  3.0M  4.6T   1% /mnt/btrfs_backupcopy
  
  Oh look it's bigger now. We need to manual rebalance to use the new drive:
 
 You don't have to. As soon as you add the additional drive, newly allocated 
 chunks will stripe across all available drives. e.g. 1 GB allocations striped 
 across 3x drives, if I add a 4th drive, initially any additional writes are 
 only to the first three drives but once a new data chunk is allocated it gets 
 striped across 4 drives.
 
That's the thing though. If the bad device hadn't been forcibly removed, and
apparently the only way to do this was to unmount, make the device node
disappear, and remount in degraded mode, it looked to me like btrfs was
still consideing that the drive was part of the array and trying to write to
it.
After adding a drive, I couldn't quite tell if it was striping over 11
drive2 or 10, but it felt that at least at times, it was striping over 11
drives with write failures on the missing drive.
I can't prove it, but I'm thinking the new data I was writing was being
striped in degraded mode.

 Sure the whole thing isn't corrupt. But if anything written while degraded 
 vanishes once the missing device is reattached, and you remount normally 
 (non-degraded), that's data loss. Yikes!

Yes, although it's limited, you apparently only lose new data that was added
after you went into degraded mode and only if you add another drive where
you write more data.
In real life this shouldn't be too common, even if it is indeed a bug.

Cheers,
Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to handle a RAID5 arrawy with a failing drive? - raid5 mostly works, just no rebuilds

2014-03-19 Thread Chris Murphy

On Mar 19, 2014, at 9:40 AM, Marc MERLIN m...@merlins.org wrote:
 
 After adding a drive, I couldn't quite tell if it was striping over 11
 drive2 or 10, but it felt that at least at times, it was striping over 11
 drives with write failures on the missing drive.
 I can't prove it, but I'm thinking the new data I was writing was being
 striped in degraded mode.

Well it does sound fragile after all to add a drive to a degraded array, 
especially when it's not expressly treating the faulty drive as faulty. I think 
iotop will show what block devices are being written to. And in a VM it's easy 
(albeit rudimentary) with sparse files, as you can see them grow.

 
 Yes, although it's limited, you apparently only lose new data that was added
 after you went into degraded mode and only if you add another drive where
 you write more data.
 In real life this shouldn't be too common, even if it is indeed a bug.

It's entirely plausible a drive power/data cable becomes lose, runs for hours 
degraded before the wayward device is reseated. It'll be common enough. It's 
definitely not OK for all of that data in the interim to vanish just because 
the volume has resumed from degraded to normal. Two states of data, normal vs 
degraded, is scary. It sounds like totally silent data loss. So yeah if it's 
reproducible it's worthy of a separate bug.


Chris Murphy

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to handle a RAID5 arrawy with a failing drive? - raid5 mostly works, just no rebuilds

2014-03-19 Thread Marc MERLIN
On Wed, Mar 19, 2014 at 10:53:33AM -0600, Chris Murphy wrote:
  Yes, although it's limited, you apparently only lose new data that was added
  after you went into degraded mode and only if you add another drive where
  you write more data.
  In real life this shouldn't be too common, even if it is indeed a bug.
 
 It's entirely plausible a drive power/data cable becomes lose, runs for hours 
 degraded before the wayward device is reseated. It'll be common enough. It's 
 definitely not OK for all of that data in the interim to vanish just because 
 the volume has resumed from degraded to normal. Two states of data, normal vs 
 degraded, is scary. It sounds like totally silent data loss. So yeah if it's 
 reproducible it's worthy of a separate bug.

Actually what I did is more complex, I first added a drive to a degraded
array, and then re-added the drive that had been removed.
I don't know if re-adding the same drive that was removed would cause the
bug I saw.

For now, my array is back to actually trying to store the backup I had meant
for it, and the drives seems stable now that I fixed the power issue.

Does someone else want to try? :)

Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html