Re: mkfs.btrfs limits odd [and maybe a failed phantom device?]

2014-12-12 Thread David Taylor

On Thu, 11 Dec 2014, Robert White wrote:


On 12/11/2014 07:56 PM, Zygo Blaxell wrote:


RAID5 with even parity and two devices should be exactly the same as
RAID1 (i.e. disk1 ^ disk2 == 0, therefore disk1 == disk2, the striping
is irrelevant because there is no difference in disk contents so the
disks are interchangeable), except with different behavior when more
devices are added (RAID1 will mirror chunks on pairs of disks, RAID5
should start writing new chunks with N stripes instead of two).


That's not correct. A RAID5 with three elements presents two 
_different_ sectors in each stripe. When one element is lost, it would 
still present two different sectors, but the safety is gone.


The above quote is discussing two device RAID5, you are discussing
three device RAID5.

I understand that the XOR collapses into a mirror if only two datum 
are involved, but that's a mathematical fact that is irrelevant to the 
definition of a RAID5 layout. When you take a wheel off of a tricycle 
it doesn't just become a bike. And you can't make a bicycle into a 
trike by just welding on a wheel somewhere. The infrastructure of the 
two is completely different.


True.  A two-device RAID5 is not the same as a degraded three-device 
RAID5.



So RAID5 with three media M is

MMM   MMM
D1   D2   P(a)
D3   P(b) D4
P(c) D5   D6

If MMM is lost D1, D2, D3, and D5 are intact
D4 and D6 can be recreated via D3^P(b) and P(c)^D5

MMM   X
D1   D2   .
D3   P(b) .
P(c) D5   .


So under _no_ circumstances would a two-disk RAID5 be the same as a 
RAID1 since a two disk RAID5 functionally implies disk three because 
the _minimum_ arity of a RAID5 is 3. A two-disk RAID5 has _zero_ data 
protection because the minimum third element is a computational 
phantom.


You again seem to be treating a two disk RAID5 as synonymous with your 
degraded three disk RAID5 above.  It is not.


RAID5 with two media M would be:

MMM
D1   P(a)
P(b) D2
D3   P(c)

[and each P would be identical to its corresponding D]

In short it is irrational to have a two disk RAID5 that is not 
degraded in the same way you cannot have a two-wheeled tricycle 
without scraping some part of something along the asphalt.


There is nothing irrational about it at all, except that it is
exactly equivalent to two disk RAID1.


A RAID1 with two elements presents one sector along the stripe.


A RAID5 with N elements presents N-1 sectors along the stripe,
so I'm not sure what the problem is with setting N=2.

I realize that what has been implemented is what you call a two drive 
RAID5, and done so by really implementing a RAID1, but it's nonsense.


It's not really, it's merely an argument of semantics if you want
to define it as nonsense.

I mean I understand what you are saying you've done, but it makes no 
sense according to the definitions of RAID5. There is no circumstance 
where RAID5 falls back to mirroring. Trying to implement RAID5 as an 
extension of a mirroring paradigm would involve a fundamental conflict 
in definitions. Especially when you reached a failure mode.


I have no idea what you mean by a fundamental conflict in definition.

This is so fundamental to the design that the fast way to assemble a 
RAID5 of N-arity (minimum N being 3) is to just connect the first N-1 
elements, declare the raid valid-but-degraded using (N-1) of the 
media, and then replacing the Nth phantom/missing/failed element 
with the real disk and triggering a rebuild. This only works if you 
don't need the initial contents of the array to have a specific value 
like zero. (This involves fewest reads and the array is instantly 
available while it builds.)


There is no reason you could not do exactly this with N=2.

As soon as you start writing to the array, the stripes you write 
repair the extents if the repair process hadn't gotten to them yet.


Its basically impossible to turn a mirror into a RAID5 if you _ever_ 
expect the code base to to be able to recover an array that's lost an 
element.


Again, I'm not really sure what you mean.

Uh, no. A raid 6 with three drives, or even two drives, is also 
degraded because the minimum is four.


You're doing your weird semantic dance again.  Just because you
define the minimum to be four does not mean that someone talking
about a three device RAID6 is talking about a degraded four device
RAID6, they're not.

As above, a non-degraded three-device RAID6 can be perfectly
sensibly defined.  Once again, it has exactly the same failure
properties as a three device RAID1 (any two of the devices can
fail), so it's a bit pointless.  But not impossible...



A   B   C   D
D1  D2  Pa  Qa
D3  Pb  Qb  D4
Pc  Qc  D5  D6
Qd  D7  D8  Pd


You can lose one or two media but the minimum stripe is again [X1,X2] 
for any read (ABCD)(ABC.)(AB..)(A..D) etc.


Minimum arity for RAID6 is 4, maximum lost-but-functional 
configuration is arity-minus-two.


A   B   C
D1  Pa  Qa
Pb  Qb  D2
Qc  D3  Pc
D4  Pd  Qd


They're only missing if you believe the 

Re: mkfs.btrfs limits odd [and maybe a failed phantom device?]

2014-12-12 Thread Robert White

On 12/12/2014 01:06 AM, David Taylor wrote:

The above quote is discussing two device RAID5, you are discussing
three device RAID5.


Heresy! (yes, some humor is required here.)

There is no such thing as a two device RAID5. That's what RAID1 is for.

Saying The above quote is discussing a two device RAID5 is exactly 
like saying The above quote is discussing a two wheeled tricycle.


You might as well be talking about three-octet IP addresses. That is you 
could make a network address out of three octets, but it wouldn't' be an 
IP address. It would be something else with the wrong name attached.


I challenge you... nay I _defy_ you... to find a single authority on 
disk storage anywhere on this planet (except, apparently, this list and 
its directly attached people and materials) that discusses, describes, 
or acknowledges the existence of a two device RAID5 while not 
discussing a system with an arity of 3 degraded by the absence of one media.


All these words have standardized definitions.

[That's not hyperbole. I searched for several hours and could not find 
_any_ reference anywhere to construction of a RAID5 array using only two 
devices that did not involve airity-3 and a dummy/missing/failed psudo 
target. So if you can find any reference to doing this _anywhere_ 
outside of BTRFS I'd like to see it. Genuinely.]


THAT SAID...

I really can find no reason the math wouldn't work using only two 
drives. It would be a terrific waste of CPU cycles and storage space to 
construct the stripe buffers and do the XORs instead of just copying the 
data, but the math would work.


So, um, well I'll be damned.

Perhaps is just a tautological belief that someone here didn't buy into. 
Like how people keep partitioning drives into little slices for things 
because thats the preserved wisdom from early eighties.


I think constructing a non-degraded-mode two device thing and calling it 
RAID5 will surprise virtually _everyone_ on the planet.


In every other system. And I do mean _every_ other system, if I had two 
media and I put them under RAID-5 I'd be required to specify the third 
drive as some sort failed device (the block device equivalent of 
/dev/null but that returns error results for all operations instead of 
successes.) See the reserved keyword missing in the mdadm 
documentation etc.


That is, If I put two 1TiB disks into a RAID-5 I'd expect to get a 2TiB 
array with no actual redundancy. As in


mdadm --create md0 --level=r5 --raid-devices=3 /dev/sda missing /dev/sdc

the resulting array would be the same effective size as a stripe of the 
two drives, but when the third was added later it would just slot in as 
a replacement for the missing device and the airity-3 thing would 
reestablish it's redundancy. (this is actually what mdadm does 
internally with a normal build, it blesses the first N-1 drives into an 
array with a missing member, and adds the Nth drive as a spare and 
then the spare is immediately adopted as a replacement for the missing 
drive.)


The parity computation on a single value is just nutty waste of time 
though. Backing it out when the array is degraded is double-nuts.


Maybe everybody just decided it was too crazy to consider for the CPU 
time penalty...?


So yea, semantics... apparently...
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mkfs.btrfs limits odd [and maybe a failed phantom device?]

2014-12-12 Thread Hugo Mills
On Fri, Dec 12, 2014 at 03:16:03AM -0800, Robert White wrote:
 On 12/12/2014 01:06 AM, David Taylor wrote:
 The above quote is discussing two device RAID5, you are discussing
 three device RAID5.
 
 Heresy! (yes, some humor is required here.)
 
 There is no such thing as a two device RAID5. That's what RAID1 is for.
 
 Saying The above quote is discussing a two device RAID5 is exactly
 like saying The above quote is discussing a two wheeled tricycle.
 
 You might as well be talking about three-octet IP addresses. That is
 you could make a network address out of three octets, but it
 wouldn't' be an IP address. It would be something else with the
 wrong name attached.

   OK. Sounds like I need to dust off the change-of-nomenclature patch
again.

   The argument here is about the 1c1s1p configuration. Is there a
problem with that?

   Hugo.

 I challenge you... nay I _defy_ you... to find a single authority on
 disk storage anywhere on this planet (except, apparently, this list
 and its directly attached people and materials) that discusses,
 describes, or acknowledges the existence of a two device RAID5
 while not discussing a system with an arity of 3 degraded by the
 absence of one media.
 
 All these words have standardized definitions.
 
 [That's not hyperbole. I searched for several hours and could not
 find _any_ reference anywhere to construction of a RAID5 array using
 only two devices that did not involve airity-3 and a
 dummy/missing/failed psudo target. So if you can find any reference
 to doing this _anywhere_ outside of BTRFS I'd like to see it.
 Genuinely.]
 
 THAT SAID...
 
 I really can find no reason the math wouldn't work using only two
 drives. It would be a terrific waste of CPU cycles and storage space
 to construct the stripe buffers and do the XORs instead of just
 copying the data, but the math would work.
 
 So, um, well I'll be damned.
 
 Perhaps is just a tautological belief that someone here didn't buy
 into. Like how people keep partitioning drives into little slices
 for things because thats the preserved wisdom from early eighties.
 
 I think constructing a non-degraded-mode two device thing and
 calling it RAID5 will surprise virtually _everyone_ on the planet.
 
 In every other system. And I do mean _every_ other system, if I had
 two media and I put them under RAID-5 I'd be required to specify the
 third drive as some sort failed device (the block device equivalent
 of /dev/null but that returns error results for all operations
 instead of successes.) See the reserved keyword missing in the
 mdadm documentation etc.
 
 That is, If I put two 1TiB disks into a RAID-5 I'd expect to get a
 2TiB array with no actual redundancy. As in
 
 mdadm --create md0 --level=r5 --raid-devices=3 /dev/sda missing /dev/sdc
 
 the resulting array would be the same effective size as a stripe of
 the two drives, but when the third was added later it would just
 slot in as a replacement for the missing device and the airity-3
 thing would reestablish it's redundancy. (this is actually what
 mdadm does internally with a normal build, it blesses the first N-1
 drives into an array with a missing member, and adds the Nth drive
 as a spare and then the spare is immediately adopted as a
 replacement for the missing drive.)
 
 The parity computation on a single value is just nutty waste of time
 though. Backing it out when the array is degraded is double-nuts.
 
 Maybe everybody just decided it was too crazy to consider for the
 CPU time penalty...?
 
 So yea, semantics... apparently...

-- 
Hugo Mills | There's an infinite number of monkeys outside who
hugo@... carfax.org.uk | want to talk to us about this new script for Hamlet
http://carfax.org.uk/  | they've worked out!
PGP: 65E74AC0  |   Arthur Dent


signature.asc
Description: Digital signature


Re: mkfs.btrfs limits odd [and maybe a failed phantom device?]

2014-12-12 Thread Zygo Blaxell
On Thu, Dec 11, 2014 at 10:01:06PM -0800, Robert White wrote:
 So RAID5 with three media M is
 
 MMM   MMM
 D1   D2   P(a)
 D3   P(b) D4
 P(c) D5   D6

RAID5 with two media is well defined, and looks like this:

MMM
D1   P(a)
P(b) D2
D3   P(c)

With even parity and N disks

P(a) ^ D1 [^ D2 ^ ... ^ DN] = 0

Simplifying for one data disk and one parity stripe:

P(a) ^ D1 = 0

therefore

P(a) = D1

which is effectively (and, in practice, literally) mirroring.



signature.asc
Description: Digital signature


Re: mkfs.btrfs limits odd [and maybe a failed phantom device?]

2014-12-12 Thread Robert White

On 12/12/2014 08:45 AM, Zygo Blaxell wrote:

On Thu, Dec 11, 2014 at 10:01:06PM -0800, Robert White wrote:

So RAID5 with three media M is

MMM   MMM
D1   D2   P(a)
D3   P(b) D4
P(c) D5   D6


RAID5 with two media is well defined, and looks like this:

MMM
D1   P(a)
P(b) D2
D3   P(c)


Like I said in the other fork of this thread... I see (now) that the 
math works but I can find no trace of anyone having ever implemented 
this for arity less than 3 RAID greater than one paradigm (outside btrfs 
and its associated materials).


It's like talking about a two-wheeled tricycle. 8-)

I would _genuinely_ like to see any third party discussion of this. It 
just isn't done (probably because, as you've shown it just a really 
complicated and CPU intensive way to end up with a simple mirror). I 
spent several hours looking. I can see the math works, and I understand 
what you are doing (as I said at some length in the grandparent message) 
but it just isn't done.


The reason I use the tricycle example is that, while most people know 
this instinctively few are aware of the fact that going from two wheels 
to three-or-more wheels reverses the steering paradigm. On a bike you 
push-left lean-left and go-left. At the higher arity vehicles (including 
adding a side-car to a bike) you push-right go left (you lean left too, 
but that's just to keep from nosing over 8-). I find that quite apt in 
the whole RAID1 vs RAID5 discussion since the former is about copying 
one-or-more times and the latter is about starting with a theoretically 
zeroed buffer and doing reversible checksumming into it.


I doubt that I will be the last person to be confused by BTRFS' 
implementation of a two-wheeled tricycle.


You're going to get a lot of mail over the years. 8-)


MEANWHILE

the system really needs to be able to explicitly express and support the 
missing media paradigm.


 M xMMM
 D1.P(a)
 D3.D4
 P(c)  .D6

The correct logic here to remove (e.g. replace with nothing instead 
of delete) a media just doesn't seem to exist. And it's already 
painfully missing in the RAID1 situation.


If I have a system with N SATA ports, and I have connected N drives, and 
device M is starting to fail... I need to be able to disconnect M and 
then connect M(new). Possibly with a non-trivial amount of time in 
there. For all RAID levels greater than zero this is a natural operation 
in a degraded mode. And for a nearly full filesystem the shrink 
operation that is btrfs device delete would not work. And for any 
nontrivially occupied fiesystem it would be way slow, and need to be 
reversed for another way-slow interval.


So I need to be able to replace a drive with a nothing so that the 
number of active media becomes N-1 but the arity remains N.


mdadm has the missing keyword. the Device Mapper has the zero 
target. As near as I can tell btrfs has got nothing in this functional slot.


Imagine, if you will, a block device that is the anti-/dev/null. All 
operations on this block device return EFAULT. lets call it 
/dev/nothing. And lets say I have a /dev/sdc that has to come out 
immediately (and all my stuff is RAID1/5/6).  The operational chain would be


btrfs replace start /dev/sdc /dev/nothing /
(time pases, physical device is removed and replace)
btrfs replace start /dev/nothing /dev/sdc /

Now that's good-ish, but really the first replace is pernicious. The 
internal state for the filesystem should just be able to record that 
device id 3 (assuming /dev/sda is devid1 and b is 2 etc for this 
example) is just gone. The replace-with-nothing becomes more-or-less 
instant.


The first replace is also pernicious if its the second media failure on 
a fully RAID6 array since that would trying to put the same kernel level 
device in the array twice.


The restore operation, the replace of the nothing with the something, 
remains fully elaborate.


The nothing devices need to show up in the device id tables for a 
running array in their geographically correct positions and all that.


Without this missing status as a first-class part of the system, 
dealing with failures and communicating about those failures with the 
operator will become vexatious.



[The use of device delete and device add as changes in arity and 
size, and its inaplicability to cases where failure is being dealt with 
abent a change of arity, could be clearer in the documentation.]

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mkfs.btrfs limits odd [and maybe a failed phantom device?]

2014-12-12 Thread Duncan
Robert White posted on Fri, 12 Dec 2014 03:16:03 -0800 as excerpted:

 Perhaps is just a tautological belief that someone here didn't buy into.
 Like how people keep partitioning drives into little slices for things
 because thats the preserved wisdom from early eighties.

While I absolutely agree with your raid5 sentiments (which is exactly 
what I suppose they might be; I'm getting a bit of an education in that 
regard myself, here)...

In the context of the 80s, or even the 90s, nothing about multi-gigabyte 
could be considered little! =:^)

In fact, while it most assuredly dates me, it /still/ feels a bit odd 
referring to the 1 GiB btrfs default threshold for mixed-bg-mode as 
small, given that I distinctly remember wondering how long it might 
take me to fill my first 1 GB (not GiB, unfortunately) drive, tho by that 
time I did have enough experience to know I'd eventually be dealing with 
multi-gig as at the time I was dealing with multi-meg.

More to the point, however...

Those partitions have saved my a** quite a few times over the years.  
Among other things, partitioning allows me to keep my (8 GiB) rootfs an 
entirely separate filesystem that's mounted read-only by default, which 
has kept it undamaged and the tools on it still available to help recover 
my other filesystems, when /var/log and /home were damaged due to a hard 
shutdown recently.

And some years ago I had an AC failure here in Phoenix in the middle of 
the summer, resulting in a physical head-crash and loss of the operating 
partitions on my disk in use at the time, while the backup partitions on 
the same device remained intact, such that after cooldown I actually 
continued to use that disk for some time, mounting the damaged partitions 
only to recover the most recent copies of what I could, updating the 
backups which were now promoted to operational.

Sure, technology such as LVM can do similar and is more flexible in some 
ways, but unfortunately it requires userspace and thus an initr* in 
ordered to handle a root on the same technology.  Otherwise, root must be 
treated differently, and then you have partitioning again.

Additionally, LVM is yet another layer of software that can and does go 
wrong and itself need fixed.  Partitioning is too, to some extent, but in 
practice it has been pretty bullet-proof compared to technologies such as 
LVM and btrfs-subvolumes.  LVM has some way to go before it's as robust 
as partitioning, and of course btrfs with its subvolumes isn't really 
even completely stable yet.  Further, btrfs doesn't well limit damage of 
a subvolume to just that subvolume (that head-crash scenario would have 
almost certainly been a total loss on btrfs subvolumes), the way 
partitioning tends to do.  And LVM's very flexibility means it doesn't 
normally have that sort of damage limitation either.  It certainly can, 
but doing so severely reduces its flexibility, making going back to 
regular partitions to avoid the complexity and additional points of 
failure entirely a rather viable and often better choice.

Meanwhile, technology such as EFI and GPT is breathing new life into 
partitioning, making it more reliable (checksummed redundant partition 
tables), more useful/flexible (killing the primary/secondary/logical 
divisions and adding partition names/labels and a far larger typing 
space), and creating yet more uses for partitioning in the first place, 
due to separate reserved EFI and legacy-BIOS partition types.

Tho of course these days those partition slices are often tens or 
hundreds of gigs, and are now sometimes teras[1], bringing up my 
initial point once again; that's NOT actually so small!

But to each his own, of course, and I definitely do agree with you on 
raid5, the larger point.  FWIW, I still consider allowing a two-device 
raid5 or a three-device raid6 a bug, particularly given that a single-
device raid1 is /not/ allowed, nor is a 3-device raid10.

---
[1] Hmm, K, megs, gigs, ters, teras, simply T to match K ???

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mkfs.btrfs limits odd [and maybe a failed phantom device?]

2014-12-12 Thread Zygo Blaxell
On Fri, Dec 12, 2014 at 02:28:06PM -0800, Robert White wrote:
 On 12/12/2014 08:45 AM, Zygo Blaxell wrote:
 On Thu, Dec 11, 2014 at 10:01:06PM -0800, Robert White wrote:
 So RAID5 with three media M is
 
 MMM   MMM
 D1   D2   P(a)
 D3   P(b) D4
 P(c) D5   D6
 
 RAID5 with two media is well defined, and looks like this:
 
 MMM
 D1   P(a)
 P(b) D2
 D3   P(c)
 
 Like I said in the other fork of this thread... I see (now) that the
 math works but I can find no trace of anyone having ever implemented
 this for arity less than 3 RAID greater than one paradigm (outside
 btrfs and its associated materials).

I've set up mdadm that way (though it does ask you to use '--force'
when you set it up).  mdadm will also ask for --force if you try to set
up RAID1 with one disk.

I don't know of a RAID implementation that _doesn't_ do these modes,
excluding a few ancient proprietary implementations which have no way to
change a layout once created (usually because they shoot themselves in
the foot with bad choices early on, e.g. by picking odd parity for RAID5).

The reason to allow it is future expansion:  below-3-disk RAID5 ensures
that you have the layout constraints *now* for stripe/chunk size so you
can add more disks later.  If RAID5 has a 512K chunk size, and you start
with a linear or RAID1 array and add another disk later, you might lose
part of the last 512K when you switch to RAID5.  So you start with RAID5
on one or two disks so you can scale up without losing any data.

Also, mdadm can grow a two-disk RAID5, but if you try to grow a two-disk
mdadm RAID1 you just get a three-disk RAID1 (i.e. two redudant copies
with no additional capacity).

btrfs doesn't really need this capability for expansion, since it can
just create new RAID5 profile chunks whenever it wants to; however, I'd
expect a complete btrfs RAID5 implementation to borrow some ideas from
ZFS, and dynamically change the number of disks per chunk to maintain
write integrity as drives are added/removed/missing.  That would imply
btrfs-RAID56 profile chunks would have to be able to exist on two or even
one disk, if that was all that was available for writing at the time.
Simply using btrfs-RAID1 chunks wouldn't work since they'd behave the
wrong way when more disks were added later.

 MEANWHILE
 
 the system really needs to be able to explicitly express and support
 the missing media paradigm.
 
  M xMMM
  D1.P(a)
  D3.D4
  P(c)  .D6
 
 The correct logic here to remove (e.g. replace with nothing
 instead of delete) a media just doesn't seem to exist. And it's
 already painfully missing in the RAID1 situation.

There are a number of permanent mistakes a naive admin can make when
dealing with a broken array.  I've destroyed arrays (made them permanently
read-only beyond the ability of btrfs kernel or user tools to recover)
by getting add and replace confused, or by allowing an offline drive
to rejoin an array that had been mounted read-write,degraded for some time.

The basic functionality works.  btrfs does track missing devices and
can replace them relatively quickly (not as fast as mdadm, but less
than an order of magnitude slower) in RAID1.  The reporting is full
of out-of-date cached data, but when a disk is really failing,
there is usually little doubt which one needs to be replaced.

 If I have a system with N SATA ports, and I have connected N drives,
 and device M is starting to fail... I need to be able to disconnect
 M and then connect M(new). Possibly with a non-trivial amount of
 time in there. For all RAID levels greater than zero this is a
 natural operation in a degraded mode. And for a nearly full
 filesystem the shrink operation that is btrfs device delete would
 not work. And for any nontrivially occupied fiesystem it would be
 way slow, and need to be reversed for another way-slow interval.
 
 So I need to be able to replace a drive with a nothing so that
 the number of active media becomes N-1 but the arity remains N.

btrfs already does that, but it sucks.  In a naive RAID5 implementation,
a write in degraded mode will corrupt your data if it is interrupted.
This is a general property of all RAID5 implementations that don't have
NVRAM journalling or some other way to solve the atomic update problem.

ZFS does this well:  when a device is missing, it leaves old data in
degraded mode, but writes new data striped across the existing disks
in non-degraded mode.  If you have 5 disks, and one dies, your writes
are then spread across 4 disks (3 data + parity) while your reads are
reconstructed from 4 disks (4 data + 1 parity - 1 missing).  This prevents
the degraded mode write data integrity problem.

When the dead disk is replaced you would have the 3 data + parity promoted
to 4 data + parity, or you can elect not to replace the dead disk and
get 3 data + party everywhere (with a loss of capacity).  btrfs could
presumably do that by allocating chunks with different raid56 parameters,
although in this 

Re: mkfs.btrfs limits odd [and maybe a failed phantom device?]

2014-12-11 Thread Zygo Blaxell
On Wed, Dec 10, 2014 at 02:18:55PM -0800, Robert White wrote:
 (3) why can I make a raid5 out of two devices? (I understand that we
 are currently just making mirrors, but the standard requires three
 devices in the geometry etc. So I would expect a two device RAID5 to
 be considered degraded with all that entails. It just looks like its
 asking for trouble to allow this once the support is finalized as
 suddenly a working RAID5 thats really a mirror would become
 something that can only be mounted with the degraded flag.)

RAID5 with even parity and two devices should be exactly the same as
RAID1 (i.e. disk1 ^ disk2 == 0, therefore disk1 == disk2, the striping
is irrelevant because there is no difference in disk contents so the
disks are interchangeable), except with different behavior when more
devices are added (RAID1 will mirror chunks on pairs of disks, RAID5
should start writing new chunks with N stripes instead of two).

 (4) Same question for raid6 but with three drives instead of the
 mandated four.

RAID6 with three devices should behave more or less like three-way RAID1,
except maybe the two parity disks might be different (I forget how the
function used to calculate the two parity stripes works, and whether
it can be defined such that F(disk1, disk2, disk3) == disk1).

 (5) If I can make a RAID5 or RAID6 device with one missing element,
 why can't I make a RAID1 out of one drive, e.g. with one missing
 element?

They're only missing if you believe the minimum number of RAID5 disks
is not two and the minimum number of RAID6 disks is not three.

 (6) If I make a RAID1 out of three devices are there three copies of
 every extent or are there always two copies that are semi-randomly
 spread across three devices? (ibid for more than three).

There are always two copies.  RAID1 on 3x1TBdisks gives you 1.5TB
of mirrored storage.

 ---
 
 It seems to me (very dangerous words in computer science, I know)
 that we need a failed device designator so that a device can be in
 the geometry (e.g. have a device ID) but not actually exist.
 Reads/writes to the failed device would always be treated as error
 returns.
 
 The failed device would be subject to replacement with btrfs dev
 replace, and could be the source of said replacement to drop a
 problematic device out of an array.
 
 EXAMPLE:
 Gust t # mkfs.btrfs -f /dev/loop0 failed -d raid1 -m raid1
 Btrfs v3.17.1
 See http://btrfs.wiki.kernel.org for more information.
 
 Performing full device TRIM (2.00GiB) ...
 Turning ON incompat feature 'extref': increased hardlink limit per
 file to 65536
 Processing explicitly missing device
 adding device (failed) id 2 (phantom device)
 
 mount /dev/loop0 /mountpoint
 
 btrfs replace start 2 /dev/loop1 /mountpoint
 
 (and so on)
 
 Being able to replace a faulty device with a phantom failed
 device would nicely disambiguate the whole device add/remove versus
 replace mistake.

It is a little odd that an array of 3 disks with one missing looks
like this:

Label: none  uuid: dde7fa84-11ed-4281-8d90-3eb0d29b949b
Total devices 3 FS bytes used 256.00KiB
devid1 size 4.00GiB used 847.12MiB path /dev/mapper/vgtester-d01
devid2 size 4.00GiB used 827.12MiB path /dev/mapper/vgtester-d02
devid3 size 4.00GiB used 0.00B path /dev/mapper/vgtester-d04

In the above, vgtest-d02 was a deleted LV and does not exist, but
you'd never know that from the output of 'btrfs fi show'...

 It would make the degraded status less mysterious.

The 'degraded' status currently protects against some significant data
corruption risks.  :-O

 A filesystem with an explicitly failed element would also make the
 future roll-out of full RAID5/6 less confusing.
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: Digital signature


Re: mkfs.btrfs limits odd [and maybe a failed phantom device?]

2014-12-10 Thread Duncan
Robert White posted on Wed, 10 Dec 2014 14:18:55 -0800 as excerpted:

 So I started looking at the mkfs.btrfs manual page with an eye towards
 documenting some of the tidbits like metadata automatically switching
 from dup to raid1 when more than one device is used.
 
 In experimenting I ended up with some questions...
 
 (1) why is the dup profile for data restricted to only one device and
 only if it's mixed mode?

 (2) why is metadata dup profile restricted to only one device on
 creation when it will run that way just fine after a device add?

1 and 2 together since they both deal with dup mode...

Dup mode was apparently originally considered purely an extra safeguard 
for metadata in the single-device case, where it was made the default 
(except for SSDs, which default to single mode metadata on a single-
device filesystem, because the FTL voids any guarantees on location 
anyway, and because firmware such as sandforce compresses and dedups 
anyway, in which case the hardware/firmware is subverting btrfs' efforts 
to do dup anyway).

In the single-device case, two copies of data was considered simply not 
worth the cost, due both to doubling the size (especially on SSD where 
size is money!) and to the speed penalties on spinning rust due to seeks 
between one 1-GiB data-chunk and its dup.

With multi-device, raid1 metadata, forcing one copy to each of two 
different devices, was considered enough superior to make that the 
default, since that provided device-loss resiliency for the all-important 
metadata, thus enabling recovery of at least /some/ files even with a 
device missing (single-mode data where the file's extents all happened to 
be on available devices, plus of course raid1, etc, data).  Further, dup-
mode metadata was considered a mistake it was better not to even have 
available as an option, since loss of a single device would likely kill 
the filesystem, which made dup mode little better than single mode, 
without the doubled-size-cost.  Further, on spinning rust there'd again 
be the seek penalty, to little benefit since dup mode provides no 
guarantees in case of device loss.

So multi-device defaults to raid1 metadata for safety, but single mode 
metadata remains an option (along with raid0) if you really /don't/ care 
about losing everything due to loss of a single device.  Single-device 
simply makes dup-mode available (and the default) for metadata, as a poor-
man's substitute for the safety of raid1, but single-device-metadata is 
the only case where that poor-man's-raid1-substitute is worth the 
(considered extreme) cost, with usage of that option not even available 
on multi-device as it'd be a near-certain mistake, certainly at the mkfs 
level.  And dup mode isn't ordinarily available for data even on single-
device, because it's considered not worth the cost.

As for dup-mode working after device-add, that's simply a necessary bit 
in ordered for device add to work from a default-dup-mode single-device 
at all.  And it's only the existing metadata chunks on the original 
device that will be dup-mode.  Once a second device is added, additional 
metadata chunks will be written in raid1 mode, forcing the two chunk 
copies to different devices since there's multiple devices available to 
allow that.  The clear intent and recommendation is to do a rebalance 
ASAP after a device add, to spread usage to the new device as 
appropriate.  And of course that rebalance will use the new raid1 
metadata defaults, unless told otherwise of course, and I don't believe 
dup mode is available to tell it otherwise there, either.


What all that original reasoning fails to account for, however, is the 
btrfs data/metadata checksumming and integrity features and the very high 
(which the original btrfs mode designers obviously considered extreme) 
value some users (including me) place on them.  While a multi-device dup-
mode-metadata choice at mkfs is arguably still a mistake, the cost of 
raid1 metadata without the benefit, near the risk of single metadata but 
at double the size, dup-mode data combined with btrfs checksumming and 
data integrity features on a single device has strong data integrity 
benefits that some would definitely consider worth it, even at the 
additional cost in speed on spinning rust due to seeking, and in size on 
expensive SSDs.

Meanwhile, mixed-bg-mode was an after-thought, added much later (after my 
own btrfs journey began) in ordered to make working with small 
filesystems reasonable.  Before mixed-bg-mode, people attempting to use 
btrfs on sub-GiB devices often found they couldn't use all available 
space (often 25-50% wasted!) as the separate data/metadata chunk 
allocation was simply too large grained to properly deal with the small 
sizes involved.

And small filesystems really _was_ mixed-mode's _entire_ purpose.  That 
it could additionally be used to allow dup-data, using the ability to 
specify mixed-bg-mode even on  1 GiB filesystems where it wasn't