Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-20 Thread Austin S. Hemmelgarn

On 2018-07-20 14:41, Hugo Mills wrote:

On Fri, Jul 20, 2018 at 09:38:14PM +0300, Andrei Borzenkov wrote:

20.07.2018 20:16, Goffredo Baroncelli пишет:

[snip]

Limiting the number of disk per raid, in BTRFS would be quite simple to implement in the 
"chunk allocator"



You mean that currently RAID5 stripe size is equal to number of disks?
Well, I suppose nobody is using btrfs with disk pools of two or three
digits size.


But they are (even if not very many of them) -- we've seen at least
one person with something like 40 or 50 devices in the array. They'd
definitely got into /dev/sdac territory. I don't recall what RAID level
they were using. I think it was either RAID-1 or -10.

That's the largest I can recall seeing mention of, though.
I've talked to at least two people using it on 100+ disks in a SAN 
situation.  In both cases however, BTRFS itself was only seeing about 20 
devices and running in raid0 mode on them, with each of those being a 
RAID6 volume configured on the SAN node holding the disks for it.  From 
what I understood when talking to them, they actually got rather good 
performance in this setup, though maintenance was a bit of a pain.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-20 Thread Hugo Mills
On Fri, Jul 20, 2018 at 09:38:14PM +0300, Andrei Borzenkov wrote:
> 20.07.2018 20:16, Goffredo Baroncelli пишет:
[snip]
> > Limiting the number of disk per raid, in BTRFS would be quite simple to 
> > implement in the "chunk allocator"
> > 
> 
> You mean that currently RAID5 stripe size is equal to number of disks?
> Well, I suppose nobody is using btrfs with disk pools of two or three
> digits size.

   But they are (even if not very many of them) -- we've seen at least
one person with something like 40 or 50 devices in the array. They'd
definitely got into /dev/sdac territory. I don't recall what RAID level
they were using. I think it was either RAID-1 or -10.

   That's the largest I can recall seeing mention of, though.

   Hugo.

-- 
Hugo Mills | Have found Lost City of Atlantis. High Priest is
hugo@... carfax.org.uk | winning at quoits.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   Terry Pratchett


signature.asc
Description: Digital signature


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-20 Thread Andrei Borzenkov
20.07.2018 20:16, Goffredo Baroncelli пишет:
> On 07/20/2018 07:17 AM, Andrei Borzenkov wrote:
>> 18.07.2018 22:42, Goffredo Baroncelli пишет:
>>> On 07/18/2018 09:20 AM, Duncan wrote:
 Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as
 excerpted:

> On 07/17/2018 11:12 PM, Duncan wrote:
>> Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
>> excerpted:
>>
>>> On 07/15/2018 04:37 PM, waxhead wrote:
>>
>>> Striping and mirroring/pairing are orthogonal properties; mirror and
>>> parity are mutually exclusive.
>>
>> I can't agree.  I don't know whether you meant that in the global
>> sense,
>> or purely in the btrfs context (which I suspect), but either way I
>> can't agree.
>>
>> In the pure btrfs context, while striping and mirroring/pairing are
>> orthogonal today, Hugo's whole point was that btrfs is theoretically
>> flexible enough to allow both together and the feature may at some
>> point be added, so it makes sense to have a layout notation format
>> flexible enough to allow it as well.
>
> When I say orthogonal, It means that these can be combined: i.e. you can
> have - striping (RAID0)
> - parity  (?)
> - striping + parity  (e.g. RAID5/6)
> - mirroring  (RAID1)
> - mirroring + striping  (RAID10)
>
> However you can't have mirroring+parity; this means that a notation
> where both 'C' ( = number of copy) and 'P' ( = number of parities) is
> too verbose.

 Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on 
 top of mirroring or mirroring on top of raid5/6, much as raid10 is 
 conceptually just raid0 on top of raid1, and raid01 is conceptually raid1 
 on top of raid0.  
>>> And what about raid 615156156 (raid 6 on top of raid 1 on top of raid 5 on 
>>> top of) ???
>>>
>>> Seriously, of course you can combine a lot of different profile; however 
>>> the only ones that make sense are the ones above.
>>
>> RAID50 (striping across RAID5) is common.
> 
> Yeah someone else report that. But other than reducing the number of disk per 
> raid5 (increasing the ration number of disks/number of parity disks), which 
> other advantages has ? 

It allows distributing IO across virtually unlimited number of disks
while confining failure domain to manageable size.

> Limiting the number of disk per raid, in BTRFS would be quite simple to 
> implement in the "chunk allocator"
> 

You mean that currently RAID5 stripe size is equal to number of disks?
Well, I suppose nobody is using btrfs with disk pools of two or three
digits size.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-20 Thread Austin S. Hemmelgarn

On 2018-07-20 13:13, Goffredo Baroncelli wrote:

On 07/19/2018 09:10 PM, Austin S. Hemmelgarn wrote:

On 2018-07-19 13:29, Goffredo Baroncelli wrote:

[...]


So until now you are repeating what I told: the only useful raid profile are
- striping
- mirroring
- striping+paring (even limiting the number of disk involved)
- striping+mirroring


No, not quite.  At least, not in the combinations you're saying make sense if 
you are using standard terminology.  RAID05 and RAID06 are not the same thing 
as 'striping+parity' as BTRFS implements that case, and can be significantly 
more optimized than the trivial implementation of just limiting the number of 
disks involved in each chunk (by, you know, actually striping just like what we 
currently call raid10 mode in BTRFS does).


Could you provide more information ?
Just parity by itself is functionally equivalent to a really stupid 
implementation of 2 or more copies of the data.  Setups with only one 
disk more than the number of parities in RAID5 and RAID6 are called 
degenerate for this very reason.  All sane RAID5/6 implementations do 
striping across multiple devices internally, and that's almost always 
what people mean when talking about striping plus parity.


What I'm referring to is different though.  Just like RAID10 used to be 
implemented as RAID1 on top of RAID0, RAID05 is RAID0 on top of RAID5. 
That is, you're striping your data across multiple RAID5 arrays instead 
of using one big RAID5 array to store it all.  As I mentioned, this 
mitigates the scaling issues inherent in RAID5 when it comes to rebuilds 
(namely, the fact that device failure rates go up faster for larger 
arrays than rebuild times do).


Functionally, such a setup can be implemented in BTRFS by limiting 
RAID5/6 stripe width, but that will have all kinds of performance 
limitations compared to actually striping across all of the underlying 
RAID5 chunks.  In fact, it will have the exact same performance 
limitations you're calling out BTRFS single mode for below.






RAID15 and RAID16 are a similar case to RAID51 and RAID61, except they might 
actually make sense in BTRFS to provide a backup means of rebuilding blocks 
that fail checksum validation if both copies fail.

If you need further redundancy, it is easy to implement a parity3 and parity4 
raid profile instead of stacking a raid6+raid1

I think you're misunderstanding what I mean here.

RAID15/16 consist of two layers:
* The top layer is regular RAID1, usually limited to two copies.
* The lower layer is RAID5 or RAID6.

This means that the lower layer can validate which of the two copies in the 
upper layer is correct when they don't agree.


This happens only because there is a redundancy greater than 1. Anyway BTRFS 
has the checksum, which helps a lot in this area
The checksum helps, but what do you do when all copies fail the 
checksum?  Or, worse yet, what do you do with both copies have the 
'right' checksum, but different data?  Yes, you could have one more 
copy, but that just reduces the chances of those cases happening, it 
doesn't eliminate them.


Note that I'm not necessarily saying it makes sense to have support for 
this in BTRFS, just that it's a real-world counter-example to your 
statement that only those combinations make sense.  In the case of 
BTRFS, these would make more sense than RAID51 and RAID61, but they 
still aren't particularly practical.  For classic RAID though, they're 
really important, because you don't have checksumming (unless you have 
T10 DIF capable hardware and a RAID implementation that understands how 
to work with it, but that's rare and expensive) and it makes it easier 
to resize an array than having three copies (you only need 2 new disks 
for RAID15 or RAID16 to increase the size of the array, but you need 3 
for 3-copy RAID1 or RAID10).



It doesn't really provide significantly better redundancy (they can technically 
sustain more disk failures without failing completely than simple two-copy 
RAID1 can, but just like BTRFS raid10, they can't reliably survive more than 
one (or two if you're using RAID6 as the lower layer) disk failure), so it does 
not do the same thing that higher-order parity does.




The fact that you can combine striping and mirroring (or pairing) makes sense 
because you could have a speed gain (see below).
[]


As someone else pointed out, md/lvm-raid10 already work like this.
What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty
much works this way except with huge (gig size) chunks.


As implemented in BTRFS, raid1 doesn't have striping.


The argument is that because there's only two copies, on multi-device
btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
alternate device pairs, it's effectively striped at the macro level, with
the 1 GiB device-level chunks effectively being huge individual device
strips of 1 GiB.


The striping concept is based to the fact that if the "stripe size" is small 

Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-20 Thread Goffredo Baroncelli
On 07/20/2018 07:17 AM, Andrei Borzenkov wrote:
> 18.07.2018 22:42, Goffredo Baroncelli пишет:
>> On 07/18/2018 09:20 AM, Duncan wrote:
>>> Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as
>>> excerpted:
>>>
 On 07/17/2018 11:12 PM, Duncan wrote:
> Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
> excerpted:
>
>> On 07/15/2018 04:37 PM, waxhead wrote:
>
>> Striping and mirroring/pairing are orthogonal properties; mirror and
>> parity are mutually exclusive.
>
> I can't agree.  I don't know whether you meant that in the global
> sense,
> or purely in the btrfs context (which I suspect), but either way I
> can't agree.
>
> In the pure btrfs context, while striping and mirroring/pairing are
> orthogonal today, Hugo's whole point was that btrfs is theoretically
> flexible enough to allow both together and the feature may at some
> point be added, so it makes sense to have a layout notation format
> flexible enough to allow it as well.

 When I say orthogonal, It means that these can be combined: i.e. you can
 have - striping (RAID0)
 - parity  (?)
 - striping + parity  (e.g. RAID5/6)
 - mirroring  (RAID1)
 - mirroring + striping  (RAID10)

 However you can't have mirroring+parity; this means that a notation
 where both 'C' ( = number of copy) and 'P' ( = number of parities) is
 too verbose.
>>>
>>> Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on 
>>> top of mirroring or mirroring on top of raid5/6, much as raid10 is 
>>> conceptually just raid0 on top of raid1, and raid01 is conceptually raid1 
>>> on top of raid0.  
>> And what about raid 615156156 (raid 6 on top of raid 1 on top of raid 5 on 
>> top of) ???
>>
>> Seriously, of course you can combine a lot of different profile; however the 
>> only ones that make sense are the ones above.
> 
> RAID50 (striping across RAID5) is common.

Yeah someone else report that. But other than reducing the number of disk per 
raid5 (increasing the ration number of disks/number of parity disks), which 
other advantages has ? 
Limiting the number of disk per raid, in BTRFS would be quite simple to 
implement in the "chunk allocator"

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-20 Thread Goffredo Baroncelli
On 07/19/2018 09:10 PM, Austin S. Hemmelgarn wrote:
> On 2018-07-19 13:29, Goffredo Baroncelli wrote:
[...]
>>
>> So until now you are repeating what I told: the only useful raid profile are
>> - striping
>> - mirroring
>> - striping+paring (even limiting the number of disk involved)
>> - striping+mirroring
> 
> No, not quite.  At least, not in the combinations you're saying make sense if 
> you are using standard terminology.  RAID05 and RAID06 are not the same thing 
> as 'striping+parity' as BTRFS implements that case, and can be significantly 
> more optimized than the trivial implementation of just limiting the number of 
> disks involved in each chunk (by, you know, actually striping just like what 
> we currently call raid10 mode in BTRFS does).

Could you provide more information ?

>>
>>>
>>> RAID15 and RAID16 are a similar case to RAID51 and RAID61, except they 
>>> might actually make sense in BTRFS to provide a backup means of rebuilding 
>>> blocks that fail checksum validation if both copies fail.
>> If you need further redundancy, it is easy to implement a parity3 and 
>> parity4 raid profile instead of stacking a raid6+raid1
> I think you're misunderstanding what I mean here.
> 
> RAID15/16 consist of two layers:
> * The top layer is regular RAID1, usually limited to two copies.
> * The lower layer is RAID5 or RAID6.
> 
> This means that the lower layer can validate which of the two copies in the 
> upper layer is correct when they don't agree.  

This happens only because there is a redundancy greater than 1. Anyway BTRFS 
has the checksum, which helps a lot in this area

> It doesn't really provide significantly better redundancy (they can 
> technically sustain more disk failures without failing completely than simple 
> two-copy RAID1 can, but just like BTRFS raid10, they can't reliably survive 
> more than one (or two if you're using RAID6 as the lower layer) disk 
> failure), so it does not do the same thing that higher-order parity does.
>>

 The fact that you can combine striping and mirroring (or pairing) makes 
 sense because you could have a speed gain (see below).
 []
>>>
>>> As someone else pointed out, md/lvm-raid10 already work like this.
>>> What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty
>>> much works this way except with huge (gig size) chunks.
>>
>> As implemented in BTRFS, raid1 doesn't have striping.
>
> The argument is that because there's only two copies, on multi-device
> btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
> alternate device pairs, it's effectively striped at the macro level, with
> the 1 GiB device-level chunks effectively being huge individual device
> strips of 1 GiB.

 The striping concept is based to the fact that if the "stripe size" is 
 small enough you have a speed benefit because the reads may be performed 
 in parallel from different disks.
>>> That's not the only benefit of striping though.  The other big one is that 
>>> you now have one volume that's the combined size of both of the original 
>>> devices.  Striping is arguably better for this even if you're using a large 
>>> stripe size because it better balances the wear across the devices than 
>>> simple concatenation.
>>
>> Striping means that the data is interleaved between the disks with a 
>> reasonable "block unit". Otherwise which would be the difference between 
>> btrfs-raid0 and btrfs-single ?
> Single mode guarantees that any file less than the chunk size in length will 
> either be completely present or completely absent if one of the devices 
> fails.  BTRFS raid0 mode does not provide any such guarantee, and in fact 
> guarantees that all files that are larger than the stripe unit size (however 
> much gets put on one disk before moving to the next) will all lose data if a 
> device fails.
> 
> Stupid as it sounds, this matters for some people.

I think that even better would be having different filesystems.

>>
>>>
 With a "stripe size" of 1GB, it is very unlikely that this would happens.
>>> That's a pretty big assumption.  There are all kinds of access patterns 
>>> that will still distribute the load reasonably evenly across the 
>>> constituent devices, even if they don't parallelize things.
>>>
>>> If, for example, all your files are 64k or less, and you only read whole 
>>> files, there's no functional difference between RAID0 with 1GB blocks and 
>>> RAID0 with 64k blocks.  Such a workload is not unusual on a very busy 
>>> mail-server.
>>
>> I fully agree that 64K may be too much for some workload, however I have to 
>> point out that I still find difficult to imagine that you can take advantage 
>> of parallel read from multiple disks with a 1GB stripe unit for a *common 
>> workload*. Pay attention that btrfs inline in the metadata the small files, 
>> so even if the file is smaller than 64k, a 64k read (or more) will be 
>> 

Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-20 Thread David Sterba
On Thu, Jul 19, 2018 at 07:47:23AM -0400, Austin S. Hemmelgarn wrote:
> > So this special level will be used for RAID56 for now?
> > Or it will also be possible for metadata usage just like current RAID1?
> > 
> > If the latter, the metadata scrub problem will need to be considered more.
> > 
> > For more copies RAID1, it's will have higher possibility one or two
> > devices missing, and then being scrubbed.
> > For metadata scrub, inlined csum can't ensure it's the latest one.
> > 
> > So for such RAID1 scrub, we need to read out all copies and compare
> > their generation to find out the correct copy.
> > At least from the changeset, it doesn't look like it's addressed yet.
> > 
> > And this also reminds me that current scrub is not as flex as balance, I
> > really like we could filter block groups to scrub just like balance, and
> > do scrub in a block group basis, other than devid basis.
> > That's to say, for a block group scrub, we don't really care which
> > device we're scrubbing, we just need to ensure all device in this block
> > is storing correct data.
> > 
> This would actually be rather useful for non-parity cases too.  Being 
> able to scrub only metadata when the data chunks are using a profile 
> that provides no rebuild support would be great for performance.
> 
> On the same note, it would be _really_ nice to be able to scrub a subset 
> of the volume's directory tree, even if it were only per-subvolume.

https://github.com/kdave/drafts/blob/master/btrfs/scrub-subvolume.txt
https://github.com/kdave/drafts/blob/master/btrfs/scrub-custom.txt

The idea is to build in-memory tree of block ranges that span the given
subvolume or files and run scrub only there.  The selective scrub on the
block groups of a given type would be a special case of the above.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-20 Thread David Sterba
On Thu, Jul 19, 2018 at 03:27:17PM +0800, Qu Wenruo wrote:
> On 2018年07月14日 02:46, David Sterba wrote:
> > Hi,
> > 
> > I have some goodies that go into the RAID56 problem, although not
> > implementing all the remaining features, it can be useful independently.
> > 
> > This time my hackweek project
> > 
> > https://hackweek.suse.com/17/projects/do-something-about-btrfs-and-raid56
> > 
> > aimed to implement the fix for the write hole problem but I spent more
> > time with analysis and design of the solution and don't have a working
> > prototype for that yet.
> > 
> > This patchset brings a feature that will be used by the raid56 log, the
> > log has to be on the same redundancy level and thus we need a 3-copy
> > replication for raid6. As it was easy to extend to higher replication,
> > I've added a 4-copy replication, that would allow triple copy raid (that
> > does not have a standardized name).
> 
> So this special level will be used for RAID56 for now?
> Or it will also be possible for metadata usage just like current RAID1?

It's a new profile usable in the same way as is raid1, ie. for the data
or metadata. The patch that adds support to btrfs-progs has an mkfs
example.

The raid56 will use that to store the log, essentially data forcibly
stored on the n-copy raid1 chunk and used only for logging.

> If the latter, the metadata scrub problem will need to be considered more.
> 
> For more copies RAID1, it's will have higher possibility one or two
> devices missing, and then being scrubbed.
> For metadata scrub, inlined csum can't ensure it's the latest one.
> 
> So for such RAID1 scrub, we need to read out all copies and compare
> their generation to find out the correct copy.
> At least from the changeset, it doesn't look like it's addressed yet.

Nothing like this is implemented in the patches, but I don't understand
how this differs from the current raid1 and one missing device. Sure we
can't have 2 missing devices so the existing copy is automatically
considered correct and up to date.

There are more corner case recovery scenario when there could be 3
copies slightly out of date due to device loss and scrub attempt, so yes
this would need to be addressed.

> And this also reminds me that current scrub is not as flex as balance, I
> really like we could filter block groups to scrub just like balance, and
> do scrub in a block group basis, other than devid basis.
> That's to say, for a block group scrub, we don't really care which
> device we're scrubbing, we just need to ensure all device in this block
> is storing correct data.

Right, a subset of the balance filters would be nice.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-19 Thread Andrei Borzenkov
18.07.2018 22:42, Goffredo Baroncelli пишет:
> On 07/18/2018 09:20 AM, Duncan wrote:
>> Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as
>> excerpted:
>>
>>> On 07/17/2018 11:12 PM, Duncan wrote:
 Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
 excerpted:

> On 07/15/2018 04:37 PM, waxhead wrote:

> Striping and mirroring/pairing are orthogonal properties; mirror and
> parity are mutually exclusive.

 I can't agree.  I don't know whether you meant that in the global
 sense,
 or purely in the btrfs context (which I suspect), but either way I
 can't agree.

 In the pure btrfs context, while striping and mirroring/pairing are
 orthogonal today, Hugo's whole point was that btrfs is theoretically
 flexible enough to allow both together and the feature may at some
 point be added, so it makes sense to have a layout notation format
 flexible enough to allow it as well.
>>>
>>> When I say orthogonal, It means that these can be combined: i.e. you can
>>> have - striping (RAID0)
>>> - parity  (?)
>>> - striping + parity  (e.g. RAID5/6)
>>> - mirroring  (RAID1)
>>> - mirroring + striping  (RAID10)
>>>
>>> However you can't have mirroring+parity; this means that a notation
>>> where both 'C' ( = number of copy) and 'P' ( = number of parities) is
>>> too verbose.
>>
>> Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on 
>> top of mirroring or mirroring on top of raid5/6, much as raid10 is 
>> conceptually just raid0 on top of raid1, and raid01 is conceptually raid1 
>> on top of raid0.  
> And what about raid 615156156 (raid 6 on top of raid 1 on top of raid 5 on 
> top of) ???
> 
> Seriously, of course you can combine a lot of different profile; however the 
> only ones that make sense are the ones above.

RAID50 (striping across RAID5) is common.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-19 Thread waxhead




Hugo Mills wrote:

On Wed, Jul 18, 2018 at 08:39:48AM +, Duncan wrote:

Duncan posted on Wed, 18 Jul 2018 07:20:09 + as excerpted:

Perhaps it's a case of coder's view (no code doing it that way, it's just
a coincidental oddity conditional on equal sizes), vs. sysadmin's view
(code or not, accidental or not, it's a reasonably accurate high-level
description of how it ends up working most of the time with equivalent
sized devices).)


Well, it's an *accurate* observation. It's just not a particularly
*useful* one. :)

Hugo.


A bit off topic perhaps - but I've got to give it a go:
Pretty please with sugar, nuts, a cherry and chocolate sprinkles dipped 
in syrup and coated with icecream on top , would it not be about time to 
update your online btrfs-usage calculator (which is insanely useful in 
so many ways) to support the new modes!?


In fact it would have been a great- / even better as a- cli-tool.
And yes, a while ago I toyed about porting it to C for own use mostly, 
but never got that far.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-19 Thread Austin S. Hemmelgarn

On 2018-07-19 13:29, Goffredo Baroncelli wrote:

On 07/19/2018 01:43 PM, Austin S. Hemmelgarn wrote:

On 2018-07-18 15:42, Goffredo Baroncelli wrote:

On 07/18/2018 09:20 AM, Duncan wrote:

Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as
excerpted:


On 07/17/2018 11:12 PM, Duncan wrote:

Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
excerpted:


[...]


When I say orthogonal, It means that these can be combined: i.e. you can
have - striping (RAID0)
- parity  (?)
- striping + parity  (e.g. RAID5/6)
- mirroring  (RAID1)
- mirroring + striping  (RAID10)

However you can't have mirroring+parity; this means that a notation
where both 'C' ( = number of copy) and 'P' ( = number of parities) is
too verbose.


Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on
top of mirroring or mirroring on top of raid5/6, much as raid10 is
conceptually just raid0 on top of raid1, and raid01 is conceptually raid1
on top of raid0.

And what about raid 615156156 (raid 6 on top of raid 1 on top of raid 5 on top 
of) ???

Seriously, of course you can combine a lot of different profile; however the 
only ones that make sense are the ones above.

No, there are cases where other configurations make sense.

RAID05 and RAID06 are very widely used, especially on NAS systems where you 
have lots of disks.  The RAID5/6 lower layer mitigates the data loss risk of 
RAID0, and the RAID0 upper-layer mitigates the rebuild scalability issues of 
RAID5/6.  In fact, this is pretty much the standard recommended configuration 
for large ZFS arrays that want to use parity RAID.  This could be reasonably 
easily supported to a rudimentary degree in BTRFS by providing the ability to 
limit the stripe width for the parity profiles.

Some people use RAID50 or RAID60, although they are strictly speaking inferior 
in almost all respects to RAID05 and RAID06.

RAID01 is also used on occasion, it ends up having the same storage capacity as 
RAID10, but for some RAID implementations it has a different performance 
envelope and different rebuild characteristics.  Usually, when it is used 
though, it's software RAID0 on top of hardware RAID1.

RAID51 and RAID61 used to be used, but aren't much now.  They provided an easy 
way to have proper data verification without always having the rebuild overhead 
of RAID5/6 and without needing to do checksumming. They are pretty much useless 
for BTRFS, as it can already tell which copy is correct.


So until now you are repeating what I told: the only useful raid profile are
- striping
- mirroring
- striping+paring (even limiting the number of disk involved)
- striping+mirroring


No, not quite.  At least, not in the combinations you're saying make 
sense if you are using standard terminology.  RAID05 and RAID06 are not 
the same thing as 'striping+parity' as BTRFS implements that case, and 
can be significantly more optimized than the trivial implementation of 
just limiting the number of disks involved in each chunk (by, you know, 
actually striping just like what we currently call raid10 mode in BTRFS 
does).




RAID15 and RAID16 are a similar case to RAID51 and RAID61, except they might 
actually make sense in BTRFS to provide a backup means of rebuilding blocks 
that fail checksum validation if both copies fail.

If you need further redundancy, it is easy to implement a parity3 and parity4 
raid profile instead of stacking a raid6+raid1

I think you're misunderstanding what I mean here.

RAID15/16 consist of two layers:
* The top layer is regular RAID1, usually limited to two copies.
* The lower layer is RAID5 or RAID6.

This means that the lower layer can validate which of the two copies in 
the upper layer is correct when they don't agree.  It doesn't really 
provide significantly better redundancy (they can technically sustain 
more disk failures without failing completely than simple two-copy RAID1 
can, but just like BTRFS raid10, they can't reliably survive more than 
one (or two if you're using RAID6 as the lower layer) disk failure), so 
it does not do the same thing that higher-order parity does.




The fact that you can combine striping and mirroring (or pairing) makes sense 
because you could have a speed gain (see below).
[]


As someone else pointed out, md/lvm-raid10 already work like this.
What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty
much works this way except with huge (gig size) chunks.


As implemented in BTRFS, raid1 doesn't have striping.


The argument is that because there's only two copies, on multi-device
btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
alternate device pairs, it's effectively striped at the macro level, with
the 1 GiB device-level chunks effectively being huge individual device
strips of 1 GiB.


The striping concept is based to the fact that if the "stripe size" is small 
enough you have a speed benefit because the reads may be performed in parallel from 

Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-19 Thread Austin S. Hemmelgarn

On 2018-07-19 03:27, Qu Wenruo wrote:



On 2018年07月14日 02:46, David Sterba wrote:

Hi,

I have some goodies that go into the RAID56 problem, although not
implementing all the remaining features, it can be useful independently.

This time my hackweek project

https://hackweek.suse.com/17/projects/do-something-about-btrfs-and-raid56

aimed to implement the fix for the write hole problem but I spent more
time with analysis and design of the solution and don't have a working
prototype for that yet.

This patchset brings a feature that will be used by the raid56 log, the
log has to be on the same redundancy level and thus we need a 3-copy
replication for raid6. As it was easy to extend to higher replication,
I've added a 4-copy replication, that would allow triple copy raid (that
does not have a standardized name).


So this special level will be used for RAID56 for now?
Or it will also be possible for metadata usage just like current RAID1?

If the latter, the metadata scrub problem will need to be considered more.

For more copies RAID1, it's will have higher possibility one or two
devices missing, and then being scrubbed.
For metadata scrub, inlined csum can't ensure it's the latest one.

So for such RAID1 scrub, we need to read out all copies and compare
their generation to find out the correct copy.
At least from the changeset, it doesn't look like it's addressed yet.

And this also reminds me that current scrub is not as flex as balance, I
really like we could filter block groups to scrub just like balance, and
do scrub in a block group basis, other than devid basis.
That's to say, for a block group scrub, we don't really care which
device we're scrubbing, we just need to ensure all device in this block
is storing correct data.

This would actually be rather useful for non-parity cases too.  Being 
able to scrub only metadata when the data chunks are using a profile 
that provides no rebuild support would be great for performance.


On the same note, it would be _really_ nice to be able to scrub a subset 
of the volume's directory tree, even if it were only per-subvolume.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-19 Thread Austin S. Hemmelgarn

On 2018-07-18 15:42, Goffredo Baroncelli wrote:

On 07/18/2018 09:20 AM, Duncan wrote:

Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as
excerpted:


On 07/17/2018 11:12 PM, Duncan wrote:

Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
excerpted:


On 07/15/2018 04:37 PM, waxhead wrote:



Striping and mirroring/pairing are orthogonal properties; mirror and
parity are mutually exclusive.


I can't agree.  I don't know whether you meant that in the global
sense,
or purely in the btrfs context (which I suspect), but either way I
can't agree.

In the pure btrfs context, while striping and mirroring/pairing are
orthogonal today, Hugo's whole point was that btrfs is theoretically
flexible enough to allow both together and the feature may at some
point be added, so it makes sense to have a layout notation format
flexible enough to allow it as well.


When I say orthogonal, It means that these can be combined: i.e. you can
have - striping (RAID0)
- parity  (?)
- striping + parity  (e.g. RAID5/6)
- mirroring  (RAID1)
- mirroring + striping  (RAID10)

However you can't have mirroring+parity; this means that a notation
where both 'C' ( = number of copy) and 'P' ( = number of parities) is
too verbose.


Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on
top of mirroring or mirroring on top of raid5/6, much as raid10 is
conceptually just raid0 on top of raid1, and raid01 is conceptually raid1
on top of raid0.

And what about raid 615156156 (raid 6 on top of raid 1 on top of raid 5 on top 
of) ???

Seriously, of course you can combine a lot of different profile; however the 
only ones that make sense are the ones above.

No, there are cases where other configurations make sense.

RAID05 and RAID06 are very widely used, especially on NAS systems where 
you have lots of disks.  The RAID5/6 lower layer mitigates the data loss 
risk of RAID0, and the RAID0 upper-layer mitigates the rebuild 
scalability issues of RAID5/6.  In fact, this is pretty much the 
standard recommended configuration for large ZFS arrays that want to use 
parity RAID.  This could be reasonably easily supported to a rudimentary 
degree in BTRFS by providing the ability to limit the stripe width for 
the parity profiles.


Some people use RAID50 or RAID60, although they are strictly speaking 
inferior in almost all respects to RAID05 and RAID06.


RAID01 is also used on occasion, it ends up having the same storage 
capacity as RAID10, but for some RAID implementations it has a different 
performance envelope and different rebuild characteristics.  Usually, 
when it is used though, it's software RAID0 on top of hardware RAID1.


RAID51 and RAID61 used to be used, but aren't much now.  They provided 
an easy way to have proper data verification without always having the 
rebuild overhead of RAID5/6 and without needing to do checksumming. 
They are pretty much useless for BTRFS, as it can already tell which 
copy is correct.


RAID15 and RAID16 are a similar case to RAID51 and RAID61, except they 
might actually make sense in BTRFS to provide a backup means of 
rebuilding blocks that fail checksum validation if both copies fail.


The fact that you can combine striping and mirroring (or pairing) makes sense 
because you could have a speed gain (see below).
[]


As someone else pointed out, md/lvm-raid10 already work like this.
What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty
much works this way except with huge (gig size) chunks.


As implemented in BTRFS, raid1 doesn't have striping.


The argument is that because there's only two copies, on multi-device
btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
alternate device pairs, it's effectively striped at the macro level, with
the 1 GiB device-level chunks effectively being huge individual device
strips of 1 GiB.


The striping concept is based to the fact that if the "stripe size" is small 
enough you have a speed benefit because the reads may be performed in parallel from 
different disks.
That's not the only benefit of striping though.  The other big one is 
that you now have one volume that's the combined size of both of the 
original devices.  Striping is arguably better for this even if you're 
using a large stripe size because it better balances the wear across the 
devices than simple concatenation.



With a "stripe size" of 1GB, it is very unlikely that this would happens.
That's a pretty big assumption.  There are all kinds of access patterns 
that will still distribute the load reasonably evenly across the 
constituent devices, even if they don't parallelize things.


If, for example, all your files are 64k or less, and you only read whole 
files, there's no functional difference between RAID0 with 1GB blocks 
and RAID0 with 64k blocks.  Such a workload is not unusual on a very 
busy mail-server.


  

At 1 GiB strip size it doesn't have the typical performance advantage of
striping, but 

Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-19 Thread Qu Wenruo


On 2018年07月14日 02:46, David Sterba wrote:
> Hi,
> 
> I have some goodies that go into the RAID56 problem, although not
> implementing all the remaining features, it can be useful independently.
> 
> This time my hackweek project
> 
> https://hackweek.suse.com/17/projects/do-something-about-btrfs-and-raid56
> 
> aimed to implement the fix for the write hole problem but I spent more
> time with analysis and design of the solution and don't have a working
> prototype for that yet.
> 
> This patchset brings a feature that will be used by the raid56 log, the
> log has to be on the same redundancy level and thus we need a 3-copy
> replication for raid6. As it was easy to extend to higher replication,
> I've added a 4-copy replication, that would allow triple copy raid (that
> does not have a standardized name).

So this special level will be used for RAID56 for now?
Or it will also be possible for metadata usage just like current RAID1?

If the latter, the metadata scrub problem will need to be considered more.

For more copies RAID1, it's will have higher possibility one or two
devices missing, and then being scrubbed.
For metadata scrub, inlined csum can't ensure it's the latest one.

So for such RAID1 scrub, we need to read out all copies and compare
their generation to find out the correct copy.
At least from the changeset, it doesn't look like it's addressed yet.

And this also reminds me that current scrub is not as flex as balance, I
really like we could filter block groups to scrub just like balance, and
do scrub in a block group basis, other than devid basis.
That's to say, for a block group scrub, we don't really care which
device we're scrubbing, we just need to ensure all device in this block
is storing correct data.

Thanks,
Qu

> 
> The number of copies is fixed, so it's not N-copy for an arbitrary N.
> This would complicate the implementation too much, though I'd be willing
> to add a 5-copy replication for a small bribe.
> 
> The new raid profiles and covered by an incompatibility bit, called
> extended_raid, the (idealistic) plan is to stuff as many new
> raid-related features as possible. The patch 4/4 mentions the 3- 4- copy
> raid1, configurable stripe length, write hole log and triple parity.
> If the plan turns out to be too ambitious, the ready and implemented
> features will be split and merged.
> 
> An interesting question is the naming of the extended profiles. I picked
> something that can be easily understood but it's not a final proposal.
> Years ago, Hugo proposed a naming scheme that described the
> non-standard raid varieties of the btrfs flavor:
> 
> https://marc.info/?l=linux-btrfs=136286324417767
> 
> Switching to this naming would be a good addition to the extended raid.
> 
> Regarding the missing raid56 features, I'll continue working on them as
> time permits in the following weeks/months, as I'm not aware of anybody
> working on that actively enough so to speak.
> 
> Anyway, git branches with the patches:
> 
> kernel: git://github.com/kdave/btrfs-devel dev/extended-raid-ncopies
> progs:  git://github.com/kdave/btrfs-progs dev/extended-raid-ncopies
> 
> David Sterba (4):
>   btrfs: refactor block group replication factor calculation to a helper
>   btrfs: add support for 3-copy replication (raid1c3)
>   btrfs: add support for 4-copy replication (raid1c4)
>   btrfs: add incompatibility bit for extended raid features
> 
>  fs/btrfs/ctree.h|  1 +
>  fs/btrfs/extent-tree.c  | 45 +++---
>  fs/btrfs/relocation.c   |  1 +
>  fs/btrfs/scrub.c|  4 +-
>  fs/btrfs/super.c| 17 +++
>  fs/btrfs/sysfs.c|  2 +
>  fs/btrfs/volumes.c  | 84 ++---
>  fs/btrfs/volumes.h  |  6 +++
>  include/uapi/linux/btrfs.h  | 12 -
>  include/uapi/linux/btrfs_tree.h |  6 +++
>  10 files changed, 134 insertions(+), 44 deletions(-)
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-18 Thread Goffredo Baroncelli
On 07/18/2018 09:20 AM, Duncan wrote:
> Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as
> excerpted:
> 
>> On 07/17/2018 11:12 PM, Duncan wrote:
>>> Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
>>> excerpted:
>>>
 On 07/15/2018 04:37 PM, waxhead wrote:
>>>
 Striping and mirroring/pairing are orthogonal properties; mirror and
 parity are mutually exclusive.
>>>
>>> I can't agree.  I don't know whether you meant that in the global
>>> sense,
>>> or purely in the btrfs context (which I suspect), but either way I
>>> can't agree.
>>>
>>> In the pure btrfs context, while striping and mirroring/pairing are
>>> orthogonal today, Hugo's whole point was that btrfs is theoretically
>>> flexible enough to allow both together and the feature may at some
>>> point be added, so it makes sense to have a layout notation format
>>> flexible enough to allow it as well.
>>
>> When I say orthogonal, It means that these can be combined: i.e. you can
>> have - striping (RAID0)
>> - parity  (?)
>> - striping + parity  (e.g. RAID5/6)
>> - mirroring  (RAID1)
>> - mirroring + striping  (RAID10)
>>
>> However you can't have mirroring+parity; this means that a notation
>> where both 'C' ( = number of copy) and 'P' ( = number of parities) is
>> too verbose.
> 
> Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on 
> top of mirroring or mirroring on top of raid5/6, much as raid10 is 
> conceptually just raid0 on top of raid1, and raid01 is conceptually raid1 
> on top of raid0.  
And what about raid 615156156 (raid 6 on top of raid 1 on top of raid 5 on top 
of) ???

Seriously, of course you can combine a lot of different profile; however the 
only ones that make sense are the ones above.

The fact that you can combine striping and mirroring (or pairing) makes sense 
because you could have a speed gain (see below). 
[]
>>>
>>> As someone else pointed out, md/lvm-raid10 already work like this. 
>>> What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty
>>> much works this way except with huge (gig size) chunks.
>>
>> As implemented in BTRFS, raid1 doesn't have striping.
> 
> The argument is that because there's only two copies, on multi-device 
> btrfs raid1 with 4+ devices of equal size so chunk allocations tend to 
> alternate device pairs, it's effectively striped at the macro level, with 
> the 1 GiB device-level chunks effectively being huge individual device 
> strips of 1 GiB.

The striping concept is based to the fact that if the "stripe size" is small 
enough you have a speed benefit because the reads may be performed in parallel 
from different disks.
With a "stripe size" of 1GB, it is very unlikely that this would happens.

 
> At 1 GiB strip size it doesn't have the typical performance advantage of 
> striping, but conceptually, it's equivalent to raid10 with huge 1 GiB 
> strips/chunks.



-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-18 Thread Hugo Mills
On Wed, Jul 18, 2018 at 08:39:48AM +, Duncan wrote:
> Duncan posted on Wed, 18 Jul 2018 07:20:09 + as excerpted:
> 
> >> As implemented in BTRFS, raid1 doesn't have striping.
> > 
> > The argument is that because there's only two copies, on multi-device
> > btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
> > alternate device pairs, it's effectively striped at the macro level,
> > with the 1 GiB device-level chunks effectively being huge individual
> > device strips of 1 GiB.
> > 
> > At 1 GiB strip size it doesn't have the typical performance advantage of
> > striping, but conceptually, it's equivalent to raid10 with huge 1 GiB
> > strips/chunks.
> 
> I forgot this bit...
> 
> Similarly, multi-device single is regarded by some to be conceptually 
> equivalent to raid0 with really huge GiB strips/chunks.
> 
> (As you may note, "the argument is" and "regarded by some" are distancing 
> phrases.  I've seen the argument made on-list, but while I understand the 
> argument and agree with it to some extent, I'm still a bit uncomfortable 
> with it and don't normally make it myself, this thread being a noted 
> exception tho originally I simply repeated what someone else already said 
> in-thread, because I too agree it's stretching things a bit.  But it does 
> appear to be a useful conceptual equivalency for some, and I do see the 
> similarity.
> 
> Perhaps it's a case of coder's view (no code doing it that way, it's just 
> a coincidental oddity conditional on equal sizes), vs. sysadmin's view 
> (code or not, accidental or not, it's a reasonably accurate high-level 
> description of how it ends up working most of the time with equivalent 
> sized devices).)

   Well, it's an *accurate* observation. It's just not a particularly
*useful* one. :)

   Hugo.

-- 
Hugo Mills | I gave up smoking, drinking and sex once. It was the
hugo@... carfax.org.uk | scariest 20 minutes of my life.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-18 Thread Austin S. Hemmelgarn

On 2018-07-18 03:20, Duncan wrote:

Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as
excerpted:


On 07/17/2018 11:12 PM, Duncan wrote:

Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
excerpted:


On 07/15/2018 04:37 PM, waxhead wrote:



Striping and mirroring/pairing are orthogonal properties; mirror and
parity are mutually exclusive.


I can't agree.  I don't know whether you meant that in the global
sense,
or purely in the btrfs context (which I suspect), but either way I
can't agree.

In the pure btrfs context, while striping and mirroring/pairing are
orthogonal today, Hugo's whole point was that btrfs is theoretically
flexible enough to allow both together and the feature may at some
point be added, so it makes sense to have a layout notation format
flexible enough to allow it as well.


When I say orthogonal, It means that these can be combined: i.e. you can
have - striping (RAID0)
- parity  (?)
- striping + parity  (e.g. RAID5/6)
- mirroring  (RAID1)
- mirroring + striping  (RAID10)

However you can't have mirroring+parity; this means that a notation
where both 'C' ( = number of copy) and 'P' ( = number of parities) is
too verbose.


Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on
top of mirroring or mirroring on top of raid5/6, much as raid10 is
conceptually just raid0 on top of raid1, and raid01 is conceptually raid1
on top of raid0.

While it's not possible today on (pure) btrfs (it's possible today with
md/dm-raid or hardware-raid handling one layer), it's theoretically
possible both for btrfs and in general, and it could be added to btrfs in
the future, so a notation with the flexibility to allow parity and
mirroring together does make sense, and having just that sort of
flexibility is exactly why Hugo made the notation proposal he did.

Tho a sensible use-case for mirroring+parity is a different question.  I
can see a case being made for it if one layer is hardware/firmware raid,
but I'm not entirely sure what the use-case for pure-btrfs raid16 or 61
(or 15 or 51) might be, where pure mirroring or pure parity wouldn't
arguably be a at least as good a match to the use-case.  Perhaps one of
the other experts in such things here might help with that.


Question #2: historically RAID10 is requires 4 disks. However I am
guessing if the stripe could be done on a different number of disks:
What about RAID1+Striping on 3 (or 5 disks) ? The key of striping is
that every 64k, the data are stored on a different disk


As someone else pointed out, md/lvm-raid10 already work like this.
What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty
much works this way except with huge (gig size) chunks.


As implemented in BTRFS, raid1 doesn't have striping.


The argument is that because there's only two copies, on multi-device
btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
alternate device pairs, it's effectively striped at the macro level, with
the 1 GiB device-level chunks effectively being huge individual device
strips of 1 GiB.
Actually, it also behaves like LVM and MD RAID10 for any number of 
devices greater than 2, though the exact placement may diverge because 
of BTRFS's concept of different chunk types.  In LVM and MD RAID10, each 
block is stored as two copies, and what disks it ends up on is dependent 
on the block number modulo the number of disks (so, for 3 disks A, B, 
and C, block 0 is on A and B, block 1 is on C and A, and block 2 is on B 
and C, with subsequent blocks following the same pattern).  In an 
idealized model of BTRFS with only one chunk type, you get exactly the 
same behavior (because BTRFS allocates chunks based on disk utilization, 
and prefers lower numbered disks to higher ones in the event of a tie).


At 1 GiB strip size it doesn't have the typical performance advantage of
striping, but conceptually, it's equivalent to raid10 with huge 1 GiB
strips/chunks.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-18 Thread Austin S. Hemmelgarn

On 2018-07-18 04:39, Duncan wrote:

Duncan posted on Wed, 18 Jul 2018 07:20:09 + as excerpted:


As implemented in BTRFS, raid1 doesn't have striping.


The argument is that because there's only two copies, on multi-device
btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
alternate device pairs, it's effectively striped at the macro level,
with the 1 GiB device-level chunks effectively being huge individual
device strips of 1 GiB.

At 1 GiB strip size it doesn't have the typical performance advantage of
striping, but conceptually, it's equivalent to raid10 with huge 1 GiB
strips/chunks.


I forgot this bit...

Similarly, multi-device single is regarded by some to be conceptually
equivalent to raid0 with really huge GiB strips/chunks.

(As you may note, "the argument is" and "regarded by some" are distancing
phrases.  I've seen the argument made on-list, but while I understand the
argument and agree with it to some extent, I'm still a bit uncomfortable
with it and don't normally make it myself, this thread being a noted
exception tho originally I simply repeated what someone else already said
in-thread, because I too agree it's stretching things a bit.  But it does
appear to be a useful conceptual equivalency for some, and I do see the
similarity.
If the file is larger than the data chunk size, it _is_ striped, because 
it spans multiple chunks which are on separate devices.  Otherwise, it's 
more similar to what in GlusterFS is called a 'distributed volume'.  In 
such a Gluster volume, each file is entirely stored on one node (or you 
have a complete copy on N nodes where N is the number of replicas), with 
the selection of what node is used for the next file created being based 
on which node has the most free space.


That said, the main reason I explain single and raid1 the way I do is 
that I've found it's a much simpler way to explain generically how they 
work to people who already have storage background but may not care 
about the specifics.


Perhaps it's a case of coder's view (no code doing it that way, it's just
a coincidental oddity conditional on equal sizes), vs. sysadmin's view
(code or not, accidental or not, it's a reasonably accurate high-level
description of how it ends up working most of the time with equivalent
sized devices).)


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-18 Thread Duncan
Duncan posted on Wed, 18 Jul 2018 07:20:09 + as excerpted:

>> As implemented in BTRFS, raid1 doesn't have striping.
> 
> The argument is that because there's only two copies, on multi-device
> btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
> alternate device pairs, it's effectively striped at the macro level,
> with the 1 GiB device-level chunks effectively being huge individual
> device strips of 1 GiB.
> 
> At 1 GiB strip size it doesn't have the typical performance advantage of
> striping, but conceptually, it's equivalent to raid10 with huge 1 GiB
> strips/chunks.

I forgot this bit...

Similarly, multi-device single is regarded by some to be conceptually 
equivalent to raid0 with really huge GiB strips/chunks.

(As you may note, "the argument is" and "regarded by some" are distancing 
phrases.  I've seen the argument made on-list, but while I understand the 
argument and agree with it to some extent, I'm still a bit uncomfortable 
with it and don't normally make it myself, this thread being a noted 
exception tho originally I simply repeated what someone else already said 
in-thread, because I too agree it's stretching things a bit.  But it does 
appear to be a useful conceptual equivalency for some, and I do see the 
similarity.

Perhaps it's a case of coder's view (no code doing it that way, it's just 
a coincidental oddity conditional on equal sizes), vs. sysadmin's view 
(code or not, accidental or not, it's a reasonably accurate high-level 
description of how it ends up working most of the time with equivalent 
sized devices).)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-18 Thread Duncan
Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as
excerpted:

> On 07/17/2018 11:12 PM, Duncan wrote:
>> Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
>> excerpted:
>> 
>>> On 07/15/2018 04:37 PM, waxhead wrote:
>> 
>>> Striping and mirroring/pairing are orthogonal properties; mirror and
>>> parity are mutually exclusive.
>> 
>> I can't agree.  I don't know whether you meant that in the global
>> sense,
>> or purely in the btrfs context (which I suspect), but either way I
>> can't agree.
>> 
>> In the pure btrfs context, while striping and mirroring/pairing are
>> orthogonal today, Hugo's whole point was that btrfs is theoretically
>> flexible enough to allow both together and the feature may at some
>> point be added, so it makes sense to have a layout notation format
>> flexible enough to allow it as well.
> 
> When I say orthogonal, It means that these can be combined: i.e. you can
> have - striping (RAID0)
> - parity  (?)
> - striping + parity  (e.g. RAID5/6)
> - mirroring  (RAID1)
> - mirroring + striping  (RAID10)
> 
> However you can't have mirroring+parity; this means that a notation
> where both 'C' ( = number of copy) and 'P' ( = number of parities) is
> too verbose.

Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on 
top of mirroring or mirroring on top of raid5/6, much as raid10 is 
conceptually just raid0 on top of raid1, and raid01 is conceptually raid1 
on top of raid0.  

While it's not possible today on (pure) btrfs (it's possible today with 
md/dm-raid or hardware-raid handling one layer), it's theoretically 
possible both for btrfs and in general, and it could be added to btrfs in 
the future, so a notation with the flexibility to allow parity and 
mirroring together does make sense, and having just that sort of 
flexibility is exactly why Hugo made the notation proposal he did.

Tho a sensible use-case for mirroring+parity is a different question.  I 
can see a case being made for it if one layer is hardware/firmware raid, 
but I'm not entirely sure what the use-case for pure-btrfs raid16 or 61 
(or 15 or 51) might be, where pure mirroring or pure parity wouldn't 
arguably be a at least as good a match to the use-case.  Perhaps one of 
the other experts in such things here might help with that.

>>> Question #2: historically RAID10 is requires 4 disks. However I am
>>> guessing if the stripe could be done on a different number of disks:
>>> What about RAID1+Striping on 3 (or 5 disks) ? The key of striping is
>>> that every 64k, the data are stored on a different disk
>> 
>> As someone else pointed out, md/lvm-raid10 already work like this. 
>> What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty
>> much works this way except with huge (gig size) chunks.
> 
> As implemented in BTRFS, raid1 doesn't have striping.

The argument is that because there's only two copies, on multi-device 
btrfs raid1 with 4+ devices of equal size so chunk allocations tend to 
alternate device pairs, it's effectively striped at the macro level, with 
the 1 GiB device-level chunks effectively being huge individual device 
strips of 1 GiB.

At 1 GiB strip size it doesn't have the typical performance advantage of 
striping, but conceptually, it's equivalent to raid10 with huge 1 GiB 
strips/chunks.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-18 Thread Goffredo Baroncelli
On 07/17/2018 11:12 PM, Duncan wrote:
> Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
> excerpted:
> 
>> On 07/15/2018 04:37 PM, waxhead wrote:
> 
>> Striping and mirroring/pairing are orthogonal properties; mirror and
>> parity are mutually exclusive.
> 
> I can't agree.  I don't know whether you meant that in the global sense, 
> or purely in the btrfs context (which I suspect), but either way I can't 
> agree.
> 
> In the pure btrfs context, while striping and mirroring/pairing are 
> orthogonal today, Hugo's whole point was that btrfs is theoretically 
> flexible enough to allow both together and the feature may at some point 
> be added, so it makes sense to have a layout notation format flexible 
> enough to allow it as well.

When I say orthogonal, It means that these can be combined: i.e. you can have
- striping (RAID0)
- parity  (?)
- striping + parity  (e.g. RAID5/6)
- mirroring  (RAID1)
- mirroring + striping  (RAID10)

However you can't have mirroring+parity; this means that a notation where both 
'C' ( = number of copy) and 'P' ( = number of parities) is too verbose.

[...]
> 
>> Question #2: historically RAID10 is requires 4 disks. However I am
>> guessing if the stripe could be done on a different number of disks:
>> What about RAID1+Striping on 3 (or 5 disks) ? The key of striping is
>> that every 64k, the data are stored on a different disk
> 
> As someone else pointed out, md/lvm-raid10 already work like this.  What 
> btrfs calls raid10 is somewhat different, but btrfs raid1 pretty much 
> works this way except with huge (gig size) chunks.

As implemented in BTRFS, raid1 doesn't have striping.

 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-17 Thread Duncan
Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
excerpted:

> On 07/15/2018 04:37 PM, waxhead wrote:

> Striping and mirroring/pairing are orthogonal properties; mirror and
> parity are mutually exclusive.

I can't agree.  I don't know whether you meant that in the global sense, 
or purely in the btrfs context (which I suspect), but either way I can't 
agree.

In the pure btrfs context, while striping and mirroring/pairing are 
orthogonal today, Hugo's whole point was that btrfs is theoretically 
flexible enough to allow both together and the feature may at some point 
be added, so it makes sense to have a layout notation format flexible 
enough to allow it as well.

In the global context, just to complete things and mostly for others 
reading as I feel a bit like a simpleton explaining to the expert here, 
just as raid10 is shorthand for raid1+0, aka raid0 layered on top of 
raid1 (normally preferred to raid01 due to rebuild characteristics, and 
as opposed to raid01, aka raid0+1, aka raid1 on top of raid0, sometimes 
recommended as btrfs raid1 on top of whatever raid0 here due to btrfs' 
data integrity characteristics and less optimized performance), so 
there's also raid51 and raid15, raid61 and raid16, etc, with or without 
the + symbols, involving mirroring and parity conceptually at two 
different levels altho they can be combined in a single implementation 
just as raid10 and raid01 commonly are.  These additional layered-raid 
levels can be used for higher reliability, with differing rebuild and 
performance characteristics between the two forms depending on which is 
the top layer.

> Question #1: for "parity" profiles, does make sense to limit the maximum
> disks number where the data may be spread ? If the answer is not, we
> could omit the last S. IMHO it should.

As someone else already replied, btrfs doesn't currently have the ability 
to specify spread limit, but the idea if we're going to change the 
notation is to allow for the flexibility in the new notation so the 
feature can be added later without further notation changes.

Why might it make sense to specify spread?  At least two possible reasons:

a) (stealing an already posted example) Consider a multi-device layout 
with two or more device sizes.  Someone may want to limit the spread in 
ordered to keep performance and risk consistent as the smaller devices 
fill up, limiting further usage to a lower number of devices.  If that 
lower number is specified as the spread originally it'll make things more 
consistent between the room on all devices case and the room on only some 
devices case.

b) Limiting spread can change the risk and rebuild performance profiles.  
Stripes of full width mean all stripes have a strip on each device, so 
knock a device out and (assuming parity or mirroring) replace it, and all 
stripes are degraded and must be rebuilt.  With less than maximum spread, 
some stripes won't be stripped to the replaced device, and won't be 
degraded or need rebuilt, tho assuming the same overall fill, a larger 
percentage of stripes that /do/ need rebuilt will be on the replaced 
device.  So the risk profile is more "objects" (stripes/chunks/files) 
affected but less of each object, or less of the total affected, but more 
of each affected object.

> Question #2: historically RAID10 is requires 4 disks. However I am
> guessing if the stripe could be done on a different number of disks:
> What about RAID1+Striping on 3 (or 5 disks) ? The key of striping is
> that every 64k, the data are stored on a different disk

As someone else pointed out, md/lvm-raid10 already work like this.  What 
btrfs calls raid10 is somewhat different, but btrfs raid1 pretty much 
works this way except with huge (gig size) chunks.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-16 Thread waxhead

waxhead wrote:

David Sterba wrote:

An interesting question is the naming of the extended profiles. I picked
something that can be easily understood but it's not a final proposal.
Years ago, Hugo proposed a naming scheme that described the
non-standard raid varieties of the btrfs flavor:

https://marc.info/?l=linux-btrfs=136286324417767

Switching to this naming would be a good addition to the extended raid.

As just a humble BTRFS user I agree and really think it is about time to 
move far away from the RAID terminology. However adding some more 
descriptive profile names (or at least some aliases) would be much 
better for the commoners (such as myself). 
...snip... > Which would make the above table look like so:


Old format / My Format / My suggested alias
SINGLE  / R0.S0.P0 / SINGLE
DUP / R1.S1.P0 / DUP (or even MIRRORLOCAL1)
RAID0   / R0.Sm.P0 / STRIPE
RAID1   / R1.S0.P0 / MIRROR1
RAID1c3 / R2.S0.P0 / MIRROR2
RAID1c4 / R3.S0.P0 / MIRROR3
RAID10  / R1.Sm.P0 / STRIPE.MIRROR1
RAID5   / R1.Sm.P1 / STRIPE.PARITY1
RAID6   / R1.Sm.P2 / STRIPE.PARITY2

And i think this is much more readable, but others may disagree. And as 
a side note... from a (hobby) coders perspective this is probably 
simpler to parse as well. 
...snap...


...and before someone else points this out that my suggestion has an 
ugly flaw , I got a bit copy / paste happy and messed up the RAID 5 and 
6 like profiles. The below table are corrected and hopefully it make the 
point why using the word 'replicas' is easier to understand than 
'copies' even if I messed it up :)


Old format / My Format / My suggested alias
SINGLE  / R0.S0.P0 / SINGLE
DUP / R1.S1.P0 / DUP (or even MIRRORLOCAL1)
RAID0   / R0.Sm.P0 / STRIPE
RAID1   / R1.S0.P0 / MIRROR1
RAID1c3 / R2.S0.P0 / MIRROR2
RAID1c4 / R3.S0.P0 / MIRROR3
RAID10  / R1.Sm.P0 / STRIPE.MIRROR1
RAID5   / R0.Sm.P1 / STRIPE.PARITY1
RAID6   / R0.Sm.P2 / STRIPE.PARITY2
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-16 Thread Austin S. Hemmelgarn

On 2018-07-16 14:29, Goffredo Baroncelli wrote:

On 07/15/2018 04:37 PM, waxhead wrote:

David Sterba wrote:

An interesting question is the naming of the extended profiles. I picked
something that can be easily understood but it's not a final proposal.
Years ago, Hugo proposed a naming scheme that described the
non-standard raid varieties of the btrfs flavor:

https://marc.info/?l=linux-btrfs=136286324417767

Switching to this naming would be a good addition to the extended raid.


As just a humble BTRFS user I agree and really think it is about time to move 
far away from the RAID terminology. However adding some more descriptive 
profile names (or at least some aliases) would be much better for the commoners 
(such as myself).

For example:

Old format / New Format / My suggested alias
SINGLE  / 1C / SINGLE
DUP / 2CD    / DUP (or even MIRRORLOCAL1)
RAID0   / 1CmS   / STRIPE




RAID1   / 2C / MIRROR1
RAID1c3 / 3C / MIRROR2
RAID1c4 / 4C / MIRROR3
RAID10  / 2CmS   / STRIPE.MIRROR1


Striping and mirroring/pairing are orthogonal properties; mirror and parity are 
mutually exclusive. What about

RAID1 -> MIRROR1
RAID10 -> MIRROR1S
RAID1c3 -> MIRROR2
RAID1c3+striping -> MIRROR2S

and so on...


RAID5   / 1CmS1P / STRIPE.PARITY1
RAID6   / 1CmS2P / STRIPE.PARITY2


To me these should be called something like

RAID5 -> PARITY1S
RAID6 -> PARITY2S

The S final is due to the fact that usually RAID5/6 spread the data on all 
available disks

Question #1: for "parity" profiles, does make sense to limit the maximum disks 
number where the data may be spread ? If the answer is not, we could omit the last S. 
IMHO it should.
Currently, there is no ability to cap the number of disks that striping 
can happen across.  Ideally, that will change in the future, in which 
case not only the S will be needed, but also a number indicating how 
wide the stripe is.



Question #2: historically RAID10 is requires 4 disks. However I am guessing if 
the stripe could be done on a different number of disks: What about 
RAID1+Striping on 3 (or 5 disks) ? The key of striping is that every 64k, the 
data are stored on a different disk
This is what MD and LVM RAID10 do.  They work somewhat differently from 
what BTRFS calls raid10 (actually, what we currently call raid1 works 
almost identically to MD and LVM RAID10 when more than 3 disks are 
involved, except that the chunk size is 1G or larger).  Short of drastic 
internal changes to how that profile works, this isn't likely to happen.


In spite of both of these, there is practical need for indicating the 
stripe width.  Depending on the configuration of the underlying storage, 
it's fully possible (and sometimes even certain) that you will see 
chunks with differing stripe widths, so properly reporting the stripe 
width (in devices, not bytes) is useful for monitoring purposes).


Consider for example a 6-device array using what's currently called a 
raid10 profile where 2 of the disks are smaller than the other four.  On 
such an array, chunks will span all six disks (resulting in 2 copies 
striped across 3 disks each) until those two smaller disks are full, at 
which point new chunks will span only the remaining four disks 
(resulting in 2 copies striped across 2 disks each).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-16 Thread Goffredo Baroncelli
On 07/15/2018 04:37 PM, waxhead wrote:
> David Sterba wrote:
>> An interesting question is the naming of the extended profiles. I picked
>> something that can be easily understood but it's not a final proposal.
>> Years ago, Hugo proposed a naming scheme that described the
>> non-standard raid varieties of the btrfs flavor:
>>
>> https://marc.info/?l=linux-btrfs=136286324417767
>>
>> Switching to this naming would be a good addition to the extended raid.
>>
> As just a humble BTRFS user I agree and really think it is about time to move 
> far away from the RAID terminology. However adding some more descriptive 
> profile names (or at least some aliases) would be much better for the 
> commoners (such as myself).
> 
> For example:
> 
> Old format / New Format / My suggested alias
> SINGLE  / 1C / SINGLE
> DUP / 2CD    / DUP (or even MIRRORLOCAL1)
> RAID0   / 1CmS   / STRIPE


> RAID1   / 2C / MIRROR1
> RAID1c3 / 3C / MIRROR2
> RAID1c4 / 4C / MIRROR3
> RAID10  / 2CmS   / STRIPE.MIRROR1

Striping and mirroring/pairing are orthogonal properties; mirror and parity are 
mutually exclusive. What about

RAID1 -> MIRROR1
RAID10 -> MIRROR1S
RAID1c3 -> MIRROR2
RAID1c3+striping -> MIRROR2S

and so on...

> RAID5   / 1CmS1P / STRIPE.PARITY1
> RAID6   / 1CmS2P / STRIPE.PARITY2

To me these should be called something like

RAID5 -> PARITY1S
RAID6 -> PARITY2S

The S final is due to the fact that usually RAID5/6 spread the data on all 
available disks

Question #1: for "parity" profiles, does make sense to limit the maximum disks 
number where the data may be spread ? If the answer is not, we could omit the 
last S. IMHO it should. 
Question #2: historically RAID10 is requires 4 disks. However I am guessing if 
the stripe could be done on a different number of disks: What about 
RAID1+Striping on 3 (or 5 disks) ? The key of striping is that every 64k, the 
data are stored on a different disk







> 
> I find that writing something like "btrfs balance start 
> -dconvert=stripe5.parity2 /mnt" is far less confusing and therefore less 
> error prone than writing "-dconvert=1C5S2P".
> 
> While Hugo's suggestion is compact and to the point I would call for 
> expanding that so it is a bit more descriptive and human readable.
> 
> So for example : STRIPE where  obviously is the same as Hugo 
> proposed - the number of storage devices for the stripe and no  would be 
> best to mean 'use max devices'.
> For PARITY then  is obviously required
> 
> Keep in mind that most people (...and I am willing to bet even Duncan which 
> probably HAS backups ;) ) get a bit stressed when their storage system is 
> degraded. With that in mind I hope for more elaborate, descriptive and human 
> readable profile names to be used to avoid making mistakes using the 
> "compact" layout.
> 
> ...and yes, of course this could go both ways. A more compact (and dare I say 
> cryptic) variant can cause people to stop and think before doing something 
> and thus avoid errors,
> 
> Now that I made my point I can't help being a bit extra hash, obnoxious and 
> possibly difficult so I would also suggest that Hugo's format could have been 
> changed (dare I say improved?) from
> 
> numCOPIESnumSTRIPESnumPARITY
> 
> to.
> 
> REPLICASnum.STRIPESnum.PARITYnum
> 
> Which would make the above table look like so:
> 
> Old format / My Format / My suggested alias
> SINGLE  / R0.S0.P0 / SINGLE
> DUP / R1.S1.P0 / DUP (or even MIRRORLOCAL1)
> RAID0   / R0.Sm.P0 / STRIPE
> RAID1   / R1.S0.P0 / MIRROR1
> RAID1c3 / R2.S0.P0 / MIRROR2
> RAID1c4 / R3.S0.P0 / MIRROR3
> RAID10  / R1.Sm.P0 / STRIPE.MIRROR1
> RAID5   / R1.Sm.P1 / STRIPE.PARITY1
> RAID6   / R1.Sm.P2 / STRIPE.PARITY2
> 
> And i think this is much more readable, but others may disagree. And as a 
> side note... from a (hobby) coders perspective this is probably simpler to 
> parse as well.
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-15 Thread Hugo Mills
On Fri, Jul 13, 2018 at 08:46:28PM +0200, David Sterba wrote:
[snip]
> An interesting question is the naming of the extended profiles. I picked
> something that can be easily understood but it's not a final proposal.
> Years ago, Hugo proposed a naming scheme that described the
> non-standard raid varieties of the btrfs flavor:
> 
> https://marc.info/?l=linux-btrfs=136286324417767
> 
> Switching to this naming would be a good addition to the extended raid.

   I'd suggest using lower-case letter for the c, s, p, rather than
upper, as it makes it much easier to read. The upper-case version
tends to make the letters and numbers merge into each other. With
lower-case c, s, p, the taller digits (or M) stand out:

  1c
  1cMs2p
  2c3s8p (OK, just kidding about this one)

   Hugo.

-- 
Hugo Mills | The English language has the mot juste for every
hugo@... carfax.org.uk | occasion.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-15 Thread waxhead

David Sterba wrote:

An interesting question is the naming of the extended profiles. I picked
something that can be easily understood but it's not a final proposal.
Years ago, Hugo proposed a naming scheme that described the
non-standard raid varieties of the btrfs flavor:

https://marc.info/?l=linux-btrfs=136286324417767

Switching to this naming would be a good addition to the extended raid.

As just a humble BTRFS user I agree and really think it is about time to 
move far away from the RAID terminology. However adding some more 
descriptive profile names (or at least some aliases) would be much 
better for the commoners (such as myself).


For example:

Old format / New Format / My suggested alias
SINGLE  / 1C / SINGLE
DUP / 2CD/ DUP (or even MIRRORLOCAL1)
RAID0   / 1CmS   / STRIPE
RAID1   / 2C / MIRROR1
RAID1c3 / 3C / MIRROR2
RAID1c4 / 4C / MIRROR3
RAID10  / 2CmS   / STRIPE.MIRROR1
RAID5   / 1CmS1P / STRIPE.PARITY1
RAID6   / 1CmS2P / STRIPE.PARITY2

I find that writing something like "btrfs balance start 
-dconvert=stripe5.parity2 /mnt" is far less confusing and therefore less 
error prone than writing "-dconvert=1C5S2P".


While Hugo's suggestion is compact and to the point I would call for 
expanding that so it is a bit more descriptive and human readable.


So for example : STRIPE where  obviously is the same as Hugo 
proposed - the number of storage devices for the stripe and no  
would be best to mean 'use max devices'.

For PARITY then  is obviously required

Keep in mind that most people (...and I am willing to bet even Duncan 
which probably HAS backups ;) ) get a bit stressed when their storage 
system is degraded. With that in mind I hope for more elaborate, 
descriptive and human readable profile names to be used to avoid making 
mistakes using the "compact" layout.


...and yes, of course this could go both ways. A more compact (and dare 
I say cryptic) variant can cause people to stop and think before doing 
something and thus avoid errors,


Now that I made my point I can't help being a bit extra hash, obnoxious 
and possibly difficult so I would also suggest that Hugo's format could 
have been changed (dare I say improved?) from


numCOPIESnumSTRIPESnumPARITY

to.

REPLICASnum.STRIPESnum.PARITYnum

Which would make the above table look like so:

Old format / My Format / My suggested alias
SINGLE  / R0.S0.P0 / SINGLE
DUP / R1.S1.P0 / DUP (or even MIRRORLOCAL1)
RAID0   / R0.Sm.P0 / STRIPE
RAID1   / R1.S0.P0 / MIRROR1
RAID1c3 / R2.S0.P0 / MIRROR2
RAID1c4 / R3.S0.P0 / MIRROR3
RAID10  / R1.Sm.P0 / STRIPE.MIRROR1
RAID5   / R1.Sm.P1 / STRIPE.PARITY1
RAID6   / R1.Sm.P2 / STRIPE.PARITY2

And i think this is much more readable, but others may disagree. And as 
a side note... from a (hobby) coders perspective this is probably 
simpler to parse as well.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4] 3- and 4- copy RAID1

2018-07-13 Thread David Sterba
Hi,

I have some goodies that go into the RAID56 problem, although not
implementing all the remaining features, it can be useful independently.

This time my hackweek project

https://hackweek.suse.com/17/projects/do-something-about-btrfs-and-raid56

aimed to implement the fix for the write hole problem but I spent more
time with analysis and design of the solution and don't have a working
prototype for that yet.

This patchset brings a feature that will be used by the raid56 log, the
log has to be on the same redundancy level and thus we need a 3-copy
replication for raid6. As it was easy to extend to higher replication,
I've added a 4-copy replication, that would allow triple copy raid (that
does not have a standardized name).

The number of copies is fixed, so it's not N-copy for an arbitrary N.
This would complicate the implementation too much, though I'd be willing
to add a 5-copy replication for a small bribe.

The new raid profiles and covered by an incompatibility bit, called
extended_raid, the (idealistic) plan is to stuff as many new
raid-related features as possible. The patch 4/4 mentions the 3- 4- copy
raid1, configurable stripe length, write hole log and triple parity.
If the plan turns out to be too ambitious, the ready and implemented
features will be split and merged.

An interesting question is the naming of the extended profiles. I picked
something that can be easily understood but it's not a final proposal.
Years ago, Hugo proposed a naming scheme that described the
non-standard raid varieties of the btrfs flavor:

https://marc.info/?l=linux-btrfs=136286324417767

Switching to this naming would be a good addition to the extended raid.

Regarding the missing raid56 features, I'll continue working on them as
time permits in the following weeks/months, as I'm not aware of anybody
working on that actively enough so to speak.

Anyway, git branches with the patches:

kernel: git://github.com/kdave/btrfs-devel dev/extended-raid-ncopies
progs:  git://github.com/kdave/btrfs-progs dev/extended-raid-ncopies

David Sterba (4):
  btrfs: refactor block group replication factor calculation to a helper
  btrfs: add support for 3-copy replication (raid1c3)
  btrfs: add support for 4-copy replication (raid1c4)
  btrfs: add incompatibility bit for extended raid features

 fs/btrfs/ctree.h|  1 +
 fs/btrfs/extent-tree.c  | 45 +++---
 fs/btrfs/relocation.c   |  1 +
 fs/btrfs/scrub.c|  4 +-
 fs/btrfs/super.c| 17 +++
 fs/btrfs/sysfs.c|  2 +
 fs/btrfs/volumes.c  | 84 ++---
 fs/btrfs/volumes.h  |  6 +++
 include/uapi/linux/btrfs.h  | 12 -
 include/uapi/linux/btrfs_tree.h |  6 +++
 10 files changed, 134 insertions(+), 44 deletions(-)

-- 
2.18.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html