subject:"Re\\\: Triple parity and beyond"

Hi David,

On 11/21/2013 3:07 AM, David Brown wrote:
 On 21/11/13 02:28, Stan Hoeppner wrote:
...
 WRT rebuild times, once drives hit 20TB we're looking at 18 hours just
 to mirror a drive at full streaming bandwidth, assuming 300MB/s
 average--and that is probably being kind to the drive makers.  With 6 or
 8 of these drives, I'd guess a typical md/RAID6 rebuild will take at
 minimum 72 hours or more, probably over 100, and probably more yet for
 3P.  And with larger drive count arrays the rebuild times approach a
 week.  Whose users can go a week with degraded performance?  This is
 simply unreasonable, at best.  I say it's completely unacceptable.

 With these gargantuan drives coming soon, the probability of multiple
 UREs during rebuild are pretty high.  Continuing to use ever more
 complex parity RAID schemes simply increases rebuild time further.  The
 longer the rebuild, the more likely a subsequent drive failure due to
 heat buildup, vibration, etc.  Thus, in our maniacal efforts to mitigate
 one failure mode we're increasing the probability of another.  TANSTAFL.
  Worse yet, RAID10 isn't going to survive because UREs on a single drive
 are increasingly likely with these larger drives, and one URE during
 rebuild destroys the array.


 I don't think the chances of hitting an URE during rebuild is dependent
 on the rebuild time - merely on the amount of data read during rebuild.

Please read the above paragraph again, as you misread it the first time.

  URE rates are per byte read rather than per unit time, are they not?

These are specified by the drive manufacturer, and they are per *bits*
read, not per byte read.  Current consumer drives are typically rated
at 1 URE in 10^14 bits read, enterprise are 1 in 10^15.

 I think you are overestimating the rebuild times a bit, but there is no

Which part?  A 20TB drive mirror taking 18 hours, or parity arrays
taking many times longer than 18 hours?

 arguing that rebuild on parity raids is a lot more work (for the cpu,
 the IO system, and the disks) than for mirror raids.

It's not so much a matter of work or interface bandwidth, but a matter
of serialization and rotational latency.

...
 Shouldn't we be talking about RAID 15 here, rather than RAID 51 ?  I
 interpret RAID 15 to be like RAID 10 - a raid5 set of raid1 mirrors,
 while RAID 51 would be a raid1 mirror of raid5 sets.  I am certain
 that you mean a raid5 set of raid1 pairs - I just think you've got the
 name wrong.

Now that you mention it, yes, RAID 15 would fit much better with
convention.  Not sure why I thought 51.  So it's RAID 15 from here.

 Potential Advantages:

 1.  Only +1 disk capacity overhead vs RAID 10, regardless of drive count
 
 +2 disks (the raid5 parity disk is a raid1 pair)

One drive of each mirror is already gone.  Make a RAID 5 of the
remaining disks and you lose 1 disk.  So you lose 1 additional disk vs
RAID 10, not 2.  As I stated previously, for RAID 15 you lose [1/2]+1 of
your disks to redundancy.

...
 [1]  The RAID1/5 code would need to be patched to properly handle a URE
 encountered by the RAID1 code during rebuild.  There are surely other
 modifications and/or optimizations that would be needed.  For large
 sequential reads, more deterministic read interleaving between mirror
 pairs would be a good candidate I think.  IIUC the RAID1 driver does
 read interleaving on a per thread basis or some such, which I don't
 believe is going to work for this RAID 51 scenario, at least not for
 single streaming reads.  If this can be done well, we double the read
 performance of RAID5, and thus we don't completely waste all the extra
 disks vs big_parity schemes.

 This proposed RAID level 51 should have drastically lower rebuild
 times vs traditional striped parity, should not suffer read/write
 performance degradation with most disk failure scenarios, and with a
 read interleaving optimization may have significantly greater streaming
 read throughput as well.

 This is far from a perfect solution and I am certainly not promoting it
 as such.  But I think it does have some serious advantages over
 traditional striped parity schemes, and at minimum is worth discussion
 as a counterpoint of sorts.
 
 I don't see that there needs to be any changes to the existing md code
 to make raid15 work - it is merely a raid 5 made from a set of raid1
 pairs.  

The sole purpose of the parity layer of the proposed RAID 15 is to
replace sectors lost due to UREs during rebuild.  AFAIK the current RAID
5 and RAID 1 drivers have no code to support each other in this manner.

 I can see that improved threading and interleaving could be a
 benefit here - but that's the case in general for md raid, and it is
 something that the developers are already working on (I haven't followed
 the details, but the topic comes up regularly on the list here).

What I'm talking about here is unrelated to the kernel thread starvation
issue, which is write centric, unrelated to reads.

What I'm suggesting is

Re: Triple parity and beyond

On 11/21/2013 3:07 AM, David Brown wrote:

 For example, with 20 disks at 1 TB each, you can have:

All correct, and these are maximum redundancies.

Maximum:

 raid5 = 19TB, 1 disk redundancy
 raid6 = 18TB, 2 disk redundancy
 raid6.3 = 17TB, 3 disk redundancy
 raid6.4 = 16TB, 4 disk redundancy
 raid6.5 = 15TB, 5 disk redundancy


These are not fully correct, because only the minimums are stated.  With
any mirror based array one can lose half the disks as long as no two are
in one mirror.  The probability of a pair failing together is very low,
and this probability decreases even further as the number of drives in
the array increases.  This is one of the many reasons RAID 10 has been
so popular for so many years.

Minimum:

 raid10 = 10TB, 1 disk redundancy
 raid15 = 8TB, 3 disk redundancy
 raid16 = 6TB, 5 disk redundancy

Maximum:

RAID 10 = 10 disk redundancy
RAID 15 = 11 disk redundancy
RAID 16 = 12 disk redundancy

Range:

RAID 10 = 1-10 disk redundancy
RAID 15 = 3-11 disk redundancy
RAID 16 = 5-12 disk redundancy


-- 
Stan
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

On 11/21/2013 5:38 PM, John Williams wrote:
 On Thu, Nov 21, 2013 at 2:57 PM, Stan Hoeppner s...@hardwarefreak.com wrote:
 He wrote that article in late 2009.  It seems pretty clear he wasn't
 looking 10 years forward to 20TB drives, where the minimum mirror
 rebuild time will be ~18 hours, and parity rebuild will be much greater.
 
 Actually, it is completely obvious that he WAS looking ten years
 ahead, seeing as several of his graphs have time scales going to
 2009+10 = 2019.

Only one graph goes to 2019, the rest are 2010 or less.  That being the
case, his 2019 graph deals with projected reliability of single, double,
and triple parity.

 And he specifically mentions longer rebuild times as one of the
 reasons why higher parity RAIDs are needed.

Yes, he certainly does.  But *only* in the context of the array
surviving for the duration of a rebuild.  He doesn't state that he cares
what the total duration is, he doesn't guess what it might be, nor does
he seem to care about the degraded performance before or during the
rebuild.  He is apparently of the mindset more parity will save us,
until we need more parity, until we need more parity, until we need
more

Following this path, parity will eventually eat more disks of capacity
than RAID10 does today for average array counts, and the only reason for
it being survival of ever increasing rebuild duration.

This is precisely why I proposed RAID 15.  It gives you the single
disk cloning rebuild speed of RAID 10.  When parity hits 5P then RAID 15
becomes very competitive for smaller arrays.  And since drives at that
point will be 40-50TB each, even small arrays will need lots of
protection against UREs and additional failures during massive rebuild
times.  Here I'd say RAID 15 will beat 5P hands down.

-- 
Stan
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

2013-11-22 Thread John Williams

On Fri, Nov 22, 2013 at 1:35 AM, Stan Hoeppner s...@hardwarefreak.com wrote:

 Only one graph goes to 2019, the rest are 2010 or less.  That being the
 case, his 2019 graph deals with projected reliability of single, double,
 and triple parity.

The whole article goes to 2019 (or longer). He shows current trends
and discusses where they are going in the future. The whole point of
the article is looking ahead into the future.

 Following this path, parity will eventually eat more disks of capacity
 than RAID10 does today for average array counts, and the only reason for
 it being survival of ever increasing rebuild duration.

No, that is not what the article finds. In the near future (about 10
years), triple-parity will suffice. Beyond that, perhaps quad-parity
will be required, but predicting that far ahead is usually worthless
in the computer industry.

 When parity hits 5P then RAID 15
 becomes very competitive for smaller arrays.  And since drives at that
 point will be 40-50TB each, even small arrays will need lots of
 protection against UREs and additional failures during massive rebuild
 times.  Here I'd say RAID 15 will beat 5P hands down.

I'll take triple- or 4-parity every time over the disk-wasting and
less reliable RAID 15. There is no need for 5-parity in the near
future.  I see no advantage of RAID 15, and several disadvantages.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

On 11/22/2013 2:13 AM, Stan Hoeppner wrote:
 Hi David,
 
 On 11/21/2013 3:07 AM, David Brown wrote:
...
 I don't see that there needs to be any changes to the existing md code
 to make raid15 work - it is merely a raid 5 made from a set of raid1
 pairs.  
 
 The sole purpose of the parity layer of the proposed RAID 15 is to
 replace sectors lost due to UREs during rebuild.  AFAIK the current RAID
 5 and RAID 1 drivers have no code to support each other in this manner.

Minor self correction here-- obviously this isn't the 'sole' purpose of
the parity layer.  It also allows us to recover from losing an entire
mirror, which is a big upshot of the proposed RAID 15.  Thinking this
through a little further, more code modification would be needed for
this scenario.

In the event of a double drive failure in one mirror, the RAID 1 code
will need to be modified in such a way as to allow the RAID 5 code to
rebuild the first replacement disk, because the RAID 1 device is still
in a failed state.  Once this rebuild is complete, the RAID 1 code will
need to switch the state to degraded, and then do its standard rebuild
routine for the 2nd replacement drive.

Or, with some (likely major) hacking it should be possible to rebuild
both drives simultaneously for no loss of throughput or additional
elapsed time on the RAID 5 rebuild.  In the 20TB drive case, this would
shave 18 hours off the total rebuild operation elapsed time.  With
current 4TB drives it would still save 6.5 hours.  Losing both drives in
one mirror set of a striped array is rare, but given the rebuild time
saved it may be worth investigating during any development of this RAID
15 idea.

-- 
Stan
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

2013-11-22 Thread Mark Knecht

On Fri, Nov 22, 2013 at 12:13 AM, Stan Hoeppner s...@hardwarefreak.com wrote:
 Hi David,

 On 11/21/2013 3:07 AM, David Brown wrote:
SNIP
 Shouldn't we be talking about RAID 15 here, rather than RAID 51 ?  I
 interpret RAID 15 to be like RAID 10 - a raid5 set of raid1 mirrors,
 while RAID 51 would be a raid1 mirror of raid5 sets.  I am certain
 that you mean a raid5 set of raid1 pairs - I just think you've got the
 name wrong.

 Now that you mention it, yes, RAID 15 would fit much better with
 convention.  Not sure why I thought 51.  So it's RAID 15 from here.
SNIP

For us casual readers  RAID users could you clarify RAID15? Would
that be a bunch of RAID1's grouped together in what appears to be a
RAID5 to the system?

Thanks,
Mark
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

2013-11-22 Thread Duncan

Mark Knecht posted on Fri, 22 Nov 2013 08:50:32 -0800 as excerpted:

 On Fri, Nov 22, 2013 at 12:13 AM, Stan Hoeppner s...@hardwarefreak.com
 wrote:
 Now that you mention it, yes, RAID 15 would fit much better with
 convention.  Not sure why I thought 51.  So it's RAID 15 from here.
 SNIP
 
 For us casual readers  RAID users could you clarify RAID15? Would that
 be a bunch of RAID1's grouped together in what appears to be a RAID5 to
 the system?

Simplest definition, yes.

Admittedly part of this discussion is beyond me (as another casual reader 
with some raid experience, reading here via the btrfs list as that's my 
current interest), but I'm following enough of it to find it interesting, 
for SURE! =:^)

And perhaps my explanation of the basics will let the real experts 
continue the debate at their higher level...

At a concept level, because md/raid, etc (I'll use mdraid as my example 
from here, but but there's dm-raid, hardware raid, etc; additionally, 
I'll omit the ALL CAPS RAID convention and use lowercase), devices are 
presented as normal block devices, RAID levels (among other things, LVM2, 
etc) are stackable.  So it's possible to, for instance, create a raid0 on 
top of a bunch of raid1s, or the reverse, a raid1 on top of a bunch of 
raid0s, either with the base level being hardware based and the software 
creating a raid level direct on the hardware raid, or with both/all 
levels in software.

Then we get into naming.  AFAIK the earliest convention was using the 
plus syntax, raid1+0, raid0+1, with the left-most number being the 
lowest, closest to hardware level, either the hardware level or closest 
to the individual hardware devices, so raid1+0 is implemented as striped 
raid (raid0) over top of mirrored raid (raid1), with raid0+1 the reverse, 
a mirror over stripes.

That quickly evolved into omitting the +, thus raid10 and raid01. (Tho 01 
has the leading zero problem with some people trying to omit it, and 
raid1 isn't the same thing AT ALL as raid01!  Between that and the fact 
that raid01 is less common than raid10 for technical reasons as noted 
below, you seldom see raid01 specified; it usually keeps the + and 
appears as raid0+1).

Also, less commonly seen but as more levels were stacked (raid105, etc), 
sometimes the + is still used to separate the hardware raid levels from 
software.  In this usage, raid105 would probably be an all software 
implementation, while raid1+05 would be raid1 in hardware, with software 
raid0 and raid5 stacked on top, in that order, and raid10+5 would be 
hardware raid10, with software raid5 on top.

Note that while raid10, aka raid1+0, should have similar non-degraded 
performance to raid0+1, there's a BIG difference when recovering from 
degraded.  A smart raid10 implementation (or a raid1+0 with hardware 
raid1) can rebuild a failed drive locally, that is, purely at the raid1 
level, using just the data on its raid1 mirror(s).  That means only a 
single device has to be read from in ordered to write the data to the 
rebuilding device.  Raid0+1, by contrast, fails the entire raid0 level at 
once, thus requiring reading from an unfailed entire raid1 (higher) level 
mirror set while writing out an entire new raid0 set!!  So while normal 
operation state is similar between raid10/raid1+0 and raid0+1, the 
recovery characteristics are **MUCH** different, with raid10 being 
markedly better than raid0+1.  As a result, raid0+1 doesn't tend to be 
used that often in practice, while raid10 (aka raid1+0) has become quite 
common, particularly so as its performance is quite high, only exceeded 
by raid0, but with redundancy and recovery characteristics that are good 
to very good, as well.  Its biggest negative at the low end is the number 
of devices required, normally a minimum of four (but see the Linux 
mdraid10 discussion below), a striped pair of mirrored pairs.

This 1+0/0+1 distinction confused me as an early raid user for quite some 
time even after I knew the technical difference, as I kept trying to 
reverse them in my head, and I guess it confuses a lot of people.  For 
some reason, my intuitive read of raid10 was the reverse of convention -- 
intuitively I /wanted/ to interpret it as a raid1 on top of raid0 instead 
of the raid0 on top of raid1 it is by convention, and even after I 
understood that there WAS a difference and in principle knew why and how, 
for years I actually had to look up the difference each time it came up, 
if it made a difference to the discussion, because I /wanted/ to read it 
backward, or more accurately, I thought the convention had it backward to 
the interpretation that made most sense to me.  It is only recently that 
I came to see it the other way, and even still, I have to pause and think 
every time I see it, to ensure I'm not again reversing things.

Which is the distinction that came up in the above discussion as well, 
only with raid5 and raid1 instead of raid0 and raid1.  Apparently I'm not 
the only one to

Re: Triple parity and beyond

2013-11-22 Thread Piergiorgio Sartor

Hi David,

On Fri, Nov 22, 2013 at 01:32:09AM +0100, David Brown wrote:
  One typical case is when many errors are
  found, belonging to the same disk.
  This case clearly shows the disk is to be
  replaced or the interface checked...
  But, again, the user is the master, not the
  machine... :-)
 
 I don't know what sort of interface you have for the user, but I guess
 that means you'll have to collect a number of failures before showing
 them so that the user can see the correlation on disk number.

as usual in Unix, one software will collect
data to a file, an other one will analyze
that file.
Originally, one idea was even to check at
stripe level how many errors (and where)
are present. From that some statistics will
be presented to the user.
This would be integrated in the check tool,
of course.

  For most ECC schemes, you know that all your blocks are set
  synchronously - so any block that does not fit in, is an error.  With
  raid, it could also be that a stripe is only partly written - you can
  
  Could it be?
  I would consider this an error.
 
 It could occur as the result of a failure of some sort (kernel crash,
 power failure, temporary disk problem, etc.).  More generally, md raid
 doesn't have to be on local physical disks - maybe one of the disks is
 an iSCSI drive or something else over a network that could have failures
 or delays.  I haven't thought through all cases here - I am just
 throwing them out as possibilities that might cause trouble.

OK, I misunderstood you, I was thinking during
normal operation...
Again, the check can find that issue, it will
tell that it cannot find where the problem is.
But it will tell where.
Possibly, an other tool can check the FS at
that position.

bye, 

-- 

piergiorgio
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

On 11/22/2013 9:01 AM, John Williams wrote:
snip

 I see no advantage of RAID 15, and several disadvantages.

Of course not, just as I sated previously.

On 11/22/2013 2:13 AM, Stan Hoeppner wrote:

 Parity users who currently shun RAID 10 for this reason will also
 shun this RAID 15.

With that I'll thank you for your input from the pure parity
perspective, and end our discussion.  Any further exchange would be
pointless.

-- 
Stan
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

On Fri, 22 Nov 2013 10:07:09 -0600 Stan Hoeppner s...@hardwarefreak.com
wrote:

 On 11/22/2013 2:13 AM, Stan Hoeppner wrote:
  Hi David,
  
  On 11/21/2013 3:07 AM, David Brown wrote:
 ...
  I don't see that there needs to be any changes to the existing md code
  to make raid15 work - it is merely a raid 5 made from a set of raid1
  pairs.  
  
  The sole purpose of the parity layer of the proposed RAID 15 is to
  replace sectors lost due to UREs during rebuild.  AFAIK the current RAID
  5 and RAID 1 drivers have no code to support each other in this manner.
 
 Minor self correction here-- obviously this isn't the 'sole' purpose of
 the parity layer.  It also allows us to recover from losing an entire
 mirror, which is a big upshot of the proposed RAID 15.  Thinking this
 through a little further, more code modification would be needed for
 this scenario.
 
 In the event of a double drive failure in one mirror, the RAID 1 code
 will need to be modified in such a way as to allow the RAID 5 code to
 rebuild the first replacement disk, because the RAID 1 device is still
 in a failed state.  Once this rebuild is complete, the RAID 1 code will
 need to switch the state to degraded, and then do its standard rebuild
 routine for the 2nd replacement drive.
 
 Or, with some (likely major) hacking it should be possible to rebuild
 both drives simultaneously for no loss of throughput or additional
 elapsed time on the RAID 5 rebuild. 

Nah, that would be minor hacking.  Just recreate the RAID1 in a state that is
not-insync, but with automatic-resync disabled.
Then as continuous writes arrive, move the recovery_cp variable forward
towards the end of the array.  When it reaches the end we can safely mark the
whole array as 'in-sync' and forget about diabling auto-resync.

NeilBrown



 In the 20TB drive case, this would
 shave 18 hours off the total rebuild operation elapsed time.  With
 current 4TB drives it would still save 6.5 hours.  Losing both drives in
 one mirror set of a striped array is rare, but given the rebuild time
 saved it may be worth investigating during any development of this RAID
 15 idea.
 



signature.asc
Description: PGP signature

Re: Triple parity and beyond

On Thu, 21 Nov 2013 16:57:48 -0600 Stan Hoeppner s...@hardwarefreak.com
wrote:

 On 11/21/2013 1:05 AM, John Williams wrote:
  On Wed, Nov 20, 2013 at 10:52 PM, Stan Hoeppner s...@hardwarefreak.com 
  wrote:
  On 11/20/2013 8:46 PM, John Williams wrote:
  For myself or any machines I managed for work that do not need high
  IOPS, I would definitely choose triple- or quad-parity over RAID 51 or
  similar schemes with arrays of 16 - 32 drives.
 
  You must see a week long rebuild as acceptable...
  
  It would not be a problem if it did take that long, since I would have
  extra parity units as backup in case of a failure during a rebuild.
  
  But of course it would not take that long. Take, for example, a 24 x
  3TB triple-parity array (21+3) that has had two drive failures
  (perhaps the rebuild started with one failure, but there was soon
  another failure). I would expect the rebuild to take about a day.
 
 You're looking at today.  We're discussing tomorrow's needs.  Today's
 6TB 3.5 drives have sustained average throughput of ~175MB/s.
 Tomorrow's 20TB drives will be lucky to do 300MB/s.  As I said
 previously, at that rate a straight disk-disk copy of a 20TB drive takes
 18.6 hours.  This is what you get with RAID1/10/51.  In the real world,
 rebuilding a failed drive in a 3P array of say 8 of these disks will
 likely take at least 3 times as long, 2 days 6 hours minimum, probably
 more.  This may be perfectly acceptable to some, but probably not to all.

Could you explain your logic here?  Why do you think rebuilding parity
will take 3 times as long as rebuilding a copy?  Can you measure that sort of
difference today?

Presumably when we have 20TB drives we will also have more cores and quite
possibly dedicated co-processors which will make the CPU load less
significant.

NeilBrown


signature.asc
Description: PGP signature

Re: Triple parity and beyond

On 11/22/2013 5:07 PM, NeilBrown wrote:
 On Thu, 21 Nov 2013 16:57:48 -0600 Stan Hoeppner s...@hardwarefreak.com
 wrote:
 
 On 11/21/2013 1:05 AM, John Williams wrote:
 On Wed, Nov 20, 2013 at 10:52 PM, Stan Hoeppner s...@hardwarefreak.com 
 wrote:
 On 11/20/2013 8:46 PM, John Williams wrote:
 For myself or any machines I managed for work that do not need high
 IOPS, I would definitely choose triple- or quad-parity over RAID 51 or
 similar schemes with arrays of 16 - 32 drives.

 You must see a week long rebuild as acceptable...

 It would not be a problem if it did take that long, since I would have
 extra parity units as backup in case of a failure during a rebuild.

 But of course it would not take that long. Take, for example, a 24 x
 3TB triple-parity array (21+3) that has had two drive failures
 (perhaps the rebuild started with one failure, but there was soon
 another failure). I would expect the rebuild to take about a day.

 You're looking at today.  We're discussing tomorrow's needs.  Today's
 6TB 3.5 drives have sustained average throughput of ~175MB/s.
 Tomorrow's 20TB drives will be lucky to do 300MB/s.  As I said
 previously, at that rate a straight disk-disk copy of a 20TB drive takes
 18.6 hours.  This is what you get with RAID1/10/51.  In the real world,
 rebuilding a failed drive in a 3P array of say 8 of these disks will
 likely take at least 3 times as long, 2 days 6 hours minimum, probably
 more.  This may be perfectly acceptable to some, but probably not to all.
 
 Could you explain your logic here?  Why do you think rebuilding parity
 will take 3 times as long as rebuilding a copy?  Can you measure that sort of
 difference today?

I've not performed head-to-head timed rebuild tests of mirror vs parity
RAIDs.  I'm making the elapsed guess for parity RAIDs based on posts
here over the past ~3 years, in which many users reported 16-24+ hour
rebuild times for their fairly wide (12-16 1-2TB drive) RAID6 arrays.

This is likely due to their chosen rebuild priority and concurrent user
load during rebuild.  Since this seems to be the norm, instead of giving
100% to the rebuild, I thought it prudent to take this into account,
instead of the theoretical minimum rebuild time.

 Presumably when we have 20TB drives we will also have more cores and quite
 possibly dedicated co-processors which will make the CPU load less
 significant.

But (when) will we have the code to fully take advantage of these?  It's
nearly 2014 and we still don't have a working threaded write model for
levels 5/6/10, though maybe soon.  Multi-core mainstream x86 CPUs have
been around for 8 years now, SMP and ccNUMA systems even longer.  So the
need has been there for a while.

I'm strictly making an observation (possibly not fully accurate) here.
I am not casting stones.  I'm not a programmer and am thus unable to
contribute code, only ideas and troubleshooting assistance for fellow
users.  Ergo I have no right/standing to complain about the rate of
feature progress.  I know that everyone hacking md is making the most of
the time they have available.  So again, not a complaint, just an
observation.

-- 
Stan
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

On Fri, 22 Nov 2013 21:46:50 -0600 Stan Hoeppner s...@hardwarefreak.com
wrote:

 On 11/22/2013 5:07 PM, NeilBrown wrote:
  On Thu, 21 Nov 2013 16:57:48 -0600 Stan Hoeppner s...@hardwarefreak.com
  wrote:
  
  On 11/21/2013 1:05 AM, John Williams wrote:
  On Wed, Nov 20, 2013 at 10:52 PM, Stan Hoeppner s...@hardwarefreak.com 
  wrote:
  On 11/20/2013 8:46 PM, John Williams wrote:
  For myself or any machines I managed for work that do not need high
  IOPS, I would definitely choose triple- or quad-parity over RAID 51 or
  similar schemes with arrays of 16 - 32 drives.
 
  You must see a week long rebuild as acceptable...
 
  It would not be a problem if it did take that long, since I would have
  extra parity units as backup in case of a failure during a rebuild.
 
  But of course it would not take that long. Take, for example, a 24 x
  3TB triple-parity array (21+3) that has had two drive failures
  (perhaps the rebuild started with one failure, but there was soon
  another failure). I would expect the rebuild to take about a day.
 
  You're looking at today.  We're discussing tomorrow's needs.  Today's
  6TB 3.5 drives have sustained average throughput of ~175MB/s.
  Tomorrow's 20TB drives will be lucky to do 300MB/s.  As I said
  previously, at that rate a straight disk-disk copy of a 20TB drive takes
  18.6 hours.  This is what you get with RAID1/10/51.  In the real world,
  rebuilding a failed drive in a 3P array of say 8 of these disks will
  likely take at least 3 times as long, 2 days 6 hours minimum, probably
  more.  This may be perfectly acceptable to some, but probably not to all.
  
  Could you explain your logic here?  Why do you think rebuilding parity
  will take 3 times as long as rebuilding a copy?  Can you measure that sort 
  of
  difference today?
 
 I've not performed head-to-head timed rebuild tests of mirror vs parity
 RAIDs.  I'm making the elapsed guess for parity RAIDs based on posts
 here over the past ~3 years, in which many users reported 16-24+ hour
 rebuild times for their fairly wide (12-16 1-2TB drive) RAID6 arrays.

I guess with that many drives you could hit PCI bus throughput limits.

A 16-lane PCIe 4.0 could just about give 100MB/s to each of 16 devices.  So
you would really need top-end hardware to keep all of 16 drives busy in a
recovery.
So yes: rebuilding a drive in a 16-drive RAID6+ would be slower than in e.g.
a 20 drive RAID10.

 
 This is likely due to their chosen rebuild priority and concurrent user
 load during rebuild.  Since this seems to be the norm, instead of giving
 100% to the rebuild, I thought it prudent to take this into account,
 instead of the theoretical minimum rebuild time.
 
  Presumably when we have 20TB drives we will also have more cores and quite
  possibly dedicated co-processors which will make the CPU load less
  significant.
 
 But (when) will we have the code to fully take advantage of these?  It's
 nearly 2014 and we still don't have a working threaded write model for
 levels 5/6/10, though maybe soon.  Multi-core mainstream x86 CPUs have
 been around for 8 years now, SMP and ccNUMA systems even longer.  So the
 need has been there for a while.

I think we might have that multi-threading now - not sure exactly what is
enabled by default though.

I think it requires more than need - it requires demand.  i.e. people
repeatedly expressing the need.  We certainly have had that for a while, but
not a very long while


 
 I'm strictly making an observation (possibly not fully accurate) here.
 I am not casting stones.  I'm not a programmer and am thus unable to
 contribute code, only ideas and troubleshooting assistance for fellow
 users.  Ergo I have no right/standing to complain about the rate of
 feature progress.  I know that everyone hacking md is making the most of
 the time they have available.  So again, not a complaint, just an
 observation.

Understood - and thanks for your observation.

NeilBrown



signature.asc
Description: PGP signature

Re: Triple parity and beyond

2013-11-22 Thread John Williams

On Fri, Nov 22, 2013 at 9:04 PM, NeilBrown ne...@suse.de wrote:

 I guess with that many drives you could hit PCI bus throughput limits.

 A 16-lane PCIe 4.0 could just about give 100MB/s to each of 16 devices.  So
 you would really need top-end hardware to keep all of 16 drives busy in a
 recovery.
 So yes: rebuilding a drive in a 16-drive RAID6+ would be slower than in e.g.
 a 20 drive RAID10.

Not really. A single 8x PCIe 2.0 card has 8 x 500MB/s = 4000MB/s of
potential bandwidth. That would be 250MB/s per drive for 16 drives.

But quite a few people running software RAID with many drives have
multiple PCIe cards. For example, in one machine I have three IBM
M1015 cards (which I got for $75/ea) that are 8x PCIe 2.0. That comes
to 3 x 500MB/s x 8 = 12GB/s of IO bandwidth.

Also, your math is wrong. PCIe 3.0 is 985 MB/s per lane. If we assume
PCIe 4.0 would double that, we would have 1970MB/s per lane. So one
lane of the hypothetical PCIe 4.0 would have enough IO bandwidth to
give about 120MB/s to each of 16 drives. A single 8x PCIe 4.0 card
would have 8 times that capability which is more than 15GB/s.

Even a single 8x PCIe 3.0 card has potentially over 7GB/s of bandwidth.

Bottom line is that IO bandwidth is not a problem for a system with
prudently chosen hardware.

More likely is that you would be CPU limited (rather than bus limited)
in a high-parity rebuild where more than one drive failed. But even
that is not likely to be too bad, since Andrea's single-threaded
recovery code can recover two drives at nearly 1GB/s on one of my
machines. I think the code could probably be threaded to achieve a
multiple of that running on multiple cores.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

On Fri, 22 Nov 2013 21:34:41 -0800 John Williams jwilliams4...@gmail.com
wrote:

 On Fri, Nov 22, 2013 at 9:04 PM, NeilBrown ne...@suse.de wrote:
 
  I guess with that many drives you could hit PCI bus throughput limits.
 
  A 16-lane PCIe 4.0 could just about give 100MB/s to each of 16 devices.  So
  you would really need top-end hardware to keep all of 16 drives busy in a
  recovery.
  So yes: rebuilding a drive in a 16-drive RAID6+ would be slower than in e.g.
  a 20 drive RAID10.
 
 Not really. A single 8x PCIe 2.0 card has 8 x 500MB/s = 4000MB/s of
 potential bandwidth. That would be 250MB/s per drive for 16 drives.
 
 But quite a few people running software RAID with many drives have
 multiple PCIe cards. For example, in one machine I have three IBM
 M1015 cards (which I got for $75/ea) that are 8x PCIe 2.0. That comes
 to 3 x 500MB/s x 8 = 12GB/s of IO bandwidth.
 
 Also, your math is wrong. PCIe 3.0 is 985 MB/s per lane. If we assume
 PCIe 4.0 would double that, we would have 1970MB/s per lane. So one
 lane of the hypothetical PCIe 4.0 would have enough IO bandwidth to
 give about 120MB/s to each of 16 drives. A single 8x PCIe 4.0 card
 would have 8 times that capability which is more than 15GB/s.

It wasn't my math, it was my reading :-(
16-lane PCIe 4.0 is 31 GB/sec so 2GB/sec per drive.  I was reading the
1-lane number...

 
 Even a single 8x PCIe 3.0 card has potentially over 7GB/s of bandwidth.
 
 Bottom line is that IO bandwidth is not a problem for a system with
 prudently chosen hardware.
 
 More likely is that you would be CPU limited (rather than bus limited)
 in a high-parity rebuild where more than one drive failed. But even
 that is not likely to be too bad, since Andrea's single-threaded
 recovery code can recover two drives at nearly 1GB/s on one of my
 machines. I think the code could probably be threaded to achieve a
 multiple of that running on multiple cores.

Indeed.  It seems likely that with modern hardware, the  linear write speed
would be the limiting factor for spinning-rust drives.
For SSDs the limit might end up being somewhere else ...

Thanks,
NeilBrown


signature.asc
Description: PGP signature

Re: Triple parity and beyond

2013-11-22 Thread Andrea Mazzoleni

Hi Piergiorgio,

 How about par2? How does this work?
I checked the matrix they use, and sometimes it contains some singular
square submatrix.
It seems that in GF(2^16) these cases are just less common. Maybe they
were just unnoticed.

Anyway, this seems to be an already known problem for PAR2, with an
hypothetical PAR3 fixing it:

http://sourceforge.net/p/parchive/discussion/96282/thread/d3c6597b/

Ciao,
Andrea
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

2013-11-21 Thread joystick


On 21/11/2013 02:28, Stan Hoeppner wrote:

On 11/20/2013 10:16 AM, James Plank wrote:

Hi all -- no real comments, except as I mentioned to Ric, my tutorial
in FAST last February presents Reed-Solomon coding with Cauchy
matrices, and then makes special note of the common pitfall of
assuming that you can append a Vandermonde matrix to an identity
matrix.  Please see
http://web.eecs.utk.edu/~plank/plank/papers/2013-02-11-FAST-Tutorial.pdf,
slides 48-52.

Andrea, does the matrix that you included in an earlier mail (the one
that has Linux RAID-6 in the first two rows) have a general form, or
did you develop it in an ad hoc manner so that it would include Linux
RAID-6 in the first two rows?

Hello Jim,

It's always perilous to follow a Ph.D., so I guess I'm feeling suicidal
today. ;)

I'm not attempting to marginalize Andrea's work here, but I can't help
but ponder what the real value of triple parity RAID is, or quad, or
beyond.  Some time ago parity RAID's primary mission ceased to be
surviving single drive failure, or a 2nd failure during rebuild, and
became mitigating UREs during a drive rebuild.  So we're now talking
about dedicating 3 drives of capacity to avoiding disaster due to
platter defects and secondary drive failure.  For small arrays this is
approaching half the array capacity.  So here parity RAID has lost the
battle with RAID10's capacity disadvantage, yet it still suffers the
vastly inferior performance in normal read/write IO, not to mention
rebuild times that are 3-10x longer.

WRT rebuild times, once drives hit 20TB we're looking at 18 hours just
to mirror a drive at full streaming bandwidth, assuming 300MB/s
average--and that is probably being kind to the drive makers.  With 6 or
8 of these drives, I'd guess a typical md/RAID6 rebuild will take at
minimum 72 hours or more, probably over 100, and probably more yet for
3P.  And with larger drive count arrays the rebuild times approach a
week.  Whose users can go a week with degraded performance?  This is
simply unreasonable, at best.  I say it's completely unacceptable.

With these gargantuan drives coming soon, the probability of multiple
UREs during rebuild are pretty high.


No because if you are correct about the very high CPU overhead during 
rebuild (which I don't see so dramatic as Andrea claims 500MB/sec for 
triple-parity, probably parallelizable on multiple cores), the speed of 
rebuild decreases proportionally and hence the stress and heating on the 
drives proportionally reduces, approximating that of normal operation.
And how often have you seen a drive failure in a week during normal 
operation?


But in reality, consider that a non-naive implementation of 
multiple-parity would probably use just the single parity during 
reconstruction if just one disk fails, using the multiple parities only 
to read the stripes which are unreadable at single parity. So the speed 
and time of reconstruction and performance penalty would be that of 
raid5 except in exceptional situations of multiple failures.




...
What I envision is an array type, something similar to RAID 51, i.e.
striped parity over mirror pairs. 


I don't like your approach of raid 51: it has the write overhead of 
raid5, with the waste of space of raid1.

So it cannot be used as neither a performance array nor a capacity array.
In the scope of this discussion (we are talking about very large 
arrays), the waste of space of your solution, higher than 50%, will make 
your solution costing double the price.


A competitor for the multiple-parity scheme might be raid65 or 66, but 
this is a so much dirtier approach than multiple parity if you think at 
the kind of rmw and overhead that will occur during normal operation.



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

On 20/11/13 19:09, John Williams wrote:
 On Wed, Nov 20, 2013 at 2:31 AM, David Brown david.br...@hesbynett.no wrote:
 That's certainly a reasonable way to look at it.  We should not limit
 the possibilities for high-end systems because of the limitations of
 low-end systems that are unlikely to use 3+ parity anyway.  I've also
 looked up a list of the processors that support SSE3 and PSHUFB - a lot
 of modern low-end x86 cpus support it.  And of course it is possible
 to implement general G(2^8) multiplication without PSHUFB, using a
 lookup table - it is important that this can all work with any CPU, even
 if it is slow.
 
 Unfortunately, it is SSSE3 that is required for PSHUFB. The SSE3 set
 with only two-esses does not suffice. I made that same mistake when I
 first heard about Andrea's 6-parity work. SSSE3 vs. SSE3, confusing
 notation!
 
 SSSE3 is significantly less widely supported than SSE3. Particularly
 on AMD, only the very latest CPUs seem to support SSSE3. Intel support
 for SSSE3 goes back much further than AMD support.
 
 Maybe it is not such a big problem, since it may be possible to
 support two roads. Both roads would include the current md RAID-5
 and RAID-6. But one road, which those lacking CPUs supporting SSSE3
 might choose, would continue on to the non-SSSE3 triple-parity 2^-1
 technique, and then dead-end. The other road would continue with the
 Cauchy matrix technique through 3-parity all the way to 6-parity.
 
 It might even be feasible to allow someone stuck at the end of the
 non-SSSE3 road to convert to the Cauchy road. You would have to go
 through all the 2^-1 triple-parity and convert it to Cauchy
 triple-parity. But then you would be safely on the Cauchy road.
 

I would not like to see two alternative triple-parity solutions - I
think that would lead to confusion, and a non-Cauchy triple parity would
not be extendible without a rebuild (I've talked before about the idea
of temporarily adding an extra parity drive with an asymmetric layout.
I really like the idea, so I keep pushing for it!).

I think it is better to accept that 3+ parity will be slow on processors
that don't support PSHUFB.  We should try to find the best alternative
SIMD for other realistic processors (such as on AMD chips without
PSHUFB, ARM's with NEON, PPC with Altivec, etc.) - but a simple table
lookup will always work as a fallback.  Other than that I think it is
fair to say that if you want /fast/ 3+ parity, you need a reasonably
modern non-budget-class cpu.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

On 20/11/13 19:34, Andrea Mazzoleni wrote:
 Hi David,
 
 The choice of ZFS to use powers of 4 was likely not optimal,
 because to multiply by 4, it has to do two multiplications by 2.
 I can agree with that.  I didn't copy ZFS's choice here
 David, it was not my intention to suggest that you copied from ZFS.
 Sorry to have expressed myself badly. I just mentioned ZFS because it's
 an implementation that I know uses powers of 4 to generate triple
 parity, and I saw in the code that it's implemented with two multiplication
 by 2.
 

Andrea, I didn't take your comment as an accusation of any kind - there
is no need for any kind of apology!  It was was merely a statement of
fact - I picked powers of 4 as an obvious extension of the powers of 2
in raid6, and found it worked well.

And of course, in the open source world, copying of code and ideas is a
good thing - there is no point in re-inventing the wheel unless we can
invent a better one.  Really, I /should/ have read the ZFS
implementation and copied it!

mvh.,

David


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

On 21/11/13 02:28, Stan Hoeppner wrote:
 On 11/20/2013 10:16 AM, James Plank wrote:
 Hi all -- no real comments, except as I mentioned to Ric, my tutorial
 in FAST last February presents Reed-Solomon coding with Cauchy
 matrices, and then makes special note of the common pitfall of
 assuming that you can append a Vandermonde matrix to an identity
 matrix.  Please see
 http://web.eecs.utk.edu/~plank/plank/papers/2013-02-11-FAST-Tutorial.pdf,
 slides 48-52.

 Andrea, does the matrix that you included in an earlier mail (the one
 that has Linux RAID-6 in the first two rows) have a general form, or
 did you develop it in an ad hoc manner so that it would include Linux
 RAID-6 in the first two rows?
 
 Hello Jim,
 
 It's always perilous to follow a Ph.D., so I guess I'm feeling suicidal
 today. ;)
 
 I'm not attempting to marginalize Andrea's work here, but I can't help
 but ponder what the real value of triple parity RAID is, or quad, or
 beyond.  Some time ago parity RAID's primary mission ceased to be
 surviving single drive failure, or a 2nd failure during rebuild, and
 became mitigating UREs during a drive rebuild.  So we're now talking
 about dedicating 3 drives of capacity to avoiding disaster due to
 platter defects and secondary drive failure.  For small arrays this is
 approaching half the array capacity.  So here parity RAID has lost the
 battle with RAID10's capacity disadvantage, yet it still suffers the
 vastly inferior performance in normal read/write IO, not to mention
 rebuild times that are 3-10x longer.
 
 WRT rebuild times, once drives hit 20TB we're looking at 18 hours just
 to mirror a drive at full streaming bandwidth, assuming 300MB/s
 average--and that is probably being kind to the drive makers.  With 6 or
 8 of these drives, I'd guess a typical md/RAID6 rebuild will take at
 minimum 72 hours or more, probably over 100, and probably more yet for
 3P.  And with larger drive count arrays the rebuild times approach a
 week.  Whose users can go a week with degraded performance?  This is
 simply unreasonable, at best.  I say it's completely unacceptable.
 
 With these gargantuan drives coming soon, the probability of multiple
 UREs during rebuild are pretty high.  Continuing to use ever more
 complex parity RAID schemes simply increases rebuild time further.  The
 longer the rebuild, the more likely a subsequent drive failure due to
 heat buildup, vibration, etc.  Thus, in our maniacal efforts to mitigate
 one failure mode we're increasing the probability of another.  TANSTAFL.
  Worse yet, RAID10 isn't going to survive because UREs on a single drive
 are increasingly likely with these larger drives, and one URE during
 rebuild destroys the array.
 

I don't think the chances of hitting an URE during rebuild is dependent
on the rebuild time - merely on the amount of data read during rebuild.
 URE rates are per byte read rather than per unit time, are they not?

I think you are overestimating the rebuild times a bit, but there is no
arguing that rebuild on parity raids is a lot more work (for the cpu,
the IO system, and the disks) than for mirror raids.

 I think people are going to have to come to grips with using more and
 more drives simply to brace the legs holding up their arrays; comes to
 grips with these insane rebuild times; or bite the bullet they so
 steadfastly avoided with RAID10.  Lots more spindles solves problems,
 but at a greater cost--again, no free lunch.
 
 What I envision is an array type, something similar to RAID 51, i.e.
 striped parity over mirror pairs.  In the case of Linux, this would need
 to be a new distinct md/RAID level, as both the RAID5 and RAID1 code
 would need enhancement before being meshed together into this new level[1].

Shouldn't we be talking about RAID 15 here, rather than RAID 51 ?  I
interpret RAID 15 to be like RAID 10 - a raid5 set of raid1 mirrors,
while RAID 51 would be a raid1 mirror of raid5 sets.  I am certain
that you mean a raid5 set of raid1 pairs - I just think you've got the
name wrong.

 
 Potential Advantages:
 
 1.  Only +1 disk capacity overhead vs RAID 10, regardless of drive count

+2 disks (the raid5 parity disk is a raid1 pair)

 2.  Rebuild time is the same as RAID 10, unless a mirror pair is lost
 3.  Parity is only used during rebuild if/when a URE occurs, unless ^
 4.  Single drive failure doesn't degrade the parity array, multiple
 failures in different mirrors doesn't degrade the parity array
 5.  Can sustain a minimum of 3 simultaneous drive failures--both drives
 in one mirror and one drive in another mirror
 6.  Can lose a maximum of 1/2 of the drives plus 1 drive--one more than
 RAID 10.  Can lose half the drives and still not degrade parity,
 if no two comprise one mirror
 7.  Similar or possibly better read throughput vs triple parity RAID
 8.  Superior write performance with drives down
 9.  Vastly superior rebuild performance, as rebuilds will rarely, if
 ever, involve parity
 
 Potential

Re: Triple parity and beyond

On 21/11/13 10:54, Adam Goryachev wrote:
 On 21/11/13 20:07, David Brown wrote:
 I can see plenty of reasons why raid15 might be a good idea, and even
 raid16 for 5 disk redundancy, compared to multi-parity sets.  However,
 it costs a lot in disk space.  For example, with 20 disks at 1 TB each,
 you can have:

 raid5 = 19TB, 1 disk redundancy
 raid6 = 18TB, 2 disk redundancy
 raid6.3 = 17TB, 3 disk redundancy
 raid6.4 = 16TB, 4 disk redundancy
 raid6.5 = 15TB, 5 disk redundancy

 raid10 = 10TB, 1 disk redundancy
 raid15 = 8TB, 3 disk redundancy
 raid16 = 6TB, 5 disk redundancy


 That's a very significant difference.

 Implementing 3+ parity does not stop people using raid15, or similar
 schemes - it just adds more choice to let people optimise according to
 their needs.
 BTW, as far as strange RAID type options to try and get around problems
 with failed disks, before I learned about timeout mismatches, I was
 pretty worried when my 5 disk RAID5 kept falling apart and losing a
 random member, then adding the failed disk back would work perfectly. To
 help me feel better about this, I used 5 x 500GB drives in RAID5 and
 then used the RAID5 + 1 x 2TB drive in RAID1, meaning I could afford to
 lose any two disks without losing data. Of course, now I know RAID6
 might have been a better choice, or even simply 2 x 2TB drives in RAID1 :)
 
 In any case, I'm not sure I understand the concern with RAID 7.X (as it
 is being called, where X  2). Certainly you will need to make 1
 computation for each stripe being written, for each value of X, so RAID
 7.5 with 5 disk redundancy means 5 calculations for each stripe being
 written. However, given that drives are getting bigger every year, did
 we forget that we are also getting faster CPU and also more cores in a
 single CPU package?
 

This is all true.  And md code is getting better at using more cores
under more circumstances, making the parity calculations more efficient.

The speed concern (which was Stan's, rather than mine) is more about
recovery and rebuild.  If you have a layered raid with raid1 pairs at
the bottom level, then recovery and rebuild (from a single failure) is
just a straight copy from one disk to another - you don't get faster
than that.  If you have a 20 + 3 parity raid, then rebuilding requires
reading a stripe from 20 disks and writing to 1 disk - that's far more
effort and is likely to take more time unless your IO system can handle
full bandwidth of all the disks simultaneously.

Similarly, performance of the array while rebuilding or degraded is much
worse for parity raids than for raids on top of raid1 pairs.

How that matters to you, and how it balances with the space costs, is up
to you and your application.

 On a pure storage server, the CPU would normally have nothing to do,
 except a little interrupt handling, it is just shuffling bytes around.
 Of course, if you need RAID7.5 then you probably have a dedicated
 storage server, so I don't see the problem with using the CPU to do all
 the calculations.
 
 Of course, if you are asking about carbon emissions, and cooling costs
 in the data center, this could (on a global scale) have a significant
 impact, so maybe it is a bad idea after all :)
 
 Regards,
 Adam
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

2013-11-21 Thread Piergiorgio Sartor

Hi David,

On Thu, Nov 21, 2013 at 09:31:46PM +0100, David Brown wrote:
[...]
 If this can all be done to give the user an informed choice, then it
 sounds good.

that would be my target.
To _offer_ more options to the (advanced) user.
It _must_ always be under user control.

 One issue here is whether the check should be done with the filesystem
 mounted and in use, or only off-line.  If it is off-line then it will
 mean a long down-time while the array is checked - but if it is online,
 then there is the risk of confusing the filesystem and caches by
 changing the data.

Currently, raid6check can work with FS mounted.
I got the suggestion from Neil (of course).
It is possible to lock one stripe and check it.
This should be, at any given time, consistent
(that is, the parity should always match the data).
If an error is found, it is reported.
Again, the user can decide to fix it or not,
considering all the FS consequences and so on.

 Most disk errors /are/ detectable, and are reported by the underlying
 hardware - small surface errors are corrected by the disk's own error
 checking and correcting mechanisms, and larger errors are usually
 detected.  It is (or should be!) very rare that a read error goes
 undetected without there being a major problem with the disk controller.
  And if the error is detected, then the normal raid processing kicks in
 as there is no doubt about which block has problems.

That's clear. That case is an erasure (I think)
and it is perfectly in line with the usual operation.
I'm not trying to replace this mechanism.
 
 If you can be /sure/ about which data block is incorrect, then I agree -
 but you can't be /entirely/ sure.  But I agree that you can make a good
 enough guess to recommend a fix to the user - as long as it is not
 automatic.

One typical case is when many errors are
found, belonging to the same disk.
This case clearly shows the disk is to be
replaced or the interface checked...
But, again, the user is the master, not the
machine... :-)
 
 For most ECC schemes, you know that all your blocks are set
 synchronously - so any block that does not fit in, is an error.  With
 raid, it could also be that a stripe is only partly written - you can

Could it be?
I would consider this an error.
The stripe must always be consistent, there
should be a transactional mechanism to make
sure that, if read back, the data is always
matching the parity.
When I write read back I mean from whatever
the data is: physical disk or cache.
Otherwise, the check must run exclusively on
the array (no mounted FS, no other things
running on it).

 have two different valid sets of data mixed to give an inconsistent
 stripe, without any good way of telling what consistent data is the best
 choice.
  
 Perhaps a checking tool can take advantage of a write-intent bitmap (if
 there is one) so that it knows if an inconsistent stripe is partly
 updated or the result of a disk error.

Of course, this is an option, which should be
taken into consideration.

Any improvement idea is welcome!!!

bye,

-- 

piergiorgio
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

2013-11-21 Thread Stan Hoeppner

On 11/21/2013 1:05 AM, John Williams wrote:
 On Wed, Nov 20, 2013 at 10:52 PM, Stan Hoeppner s...@hardwarefreak.com 
 wrote:
 On 11/20/2013 8:46 PM, John Williams wrote:
 For myself or any machines I managed for work that do not need high
 IOPS, I would definitely choose triple- or quad-parity over RAID 51 or
 similar schemes with arrays of 16 - 32 drives.

 You must see a week long rebuild as acceptable...
 
 It would not be a problem if it did take that long, since I would have
 extra parity units as backup in case of a failure during a rebuild.
 
 But of course it would not take that long. Take, for example, a 24 x
 3TB triple-parity array (21+3) that has had two drive failures
 (perhaps the rebuild started with one failure, but there was soon
 another failure). I would expect the rebuild to take about a day.

You're looking at today.  We're discussing tomorrow's needs.  Today's
6TB 3.5 drives have sustained average throughput of ~175MB/s.
Tomorrow's 20TB drives will be lucky to do 300MB/s.  As I said
previously, at that rate a straight disk-disk copy of a 20TB drive takes
18.6 hours.  This is what you get with RAID1/10/51.  In the real world,
rebuilding a failed drive in a 3P array of say 8 of these disks will
likely take at least 3 times as long, 2 days 6 hours minimum, probably
more.  This may be perfectly acceptable to some, but probably not to all.

 on a subject Adam Leventhal has already
 covered in detail in an article Triple-Parity RAID and Beyond which
 seems to match the subject of this thread quite nicely:

 http://queue.acm.org/detail.cfm?id=1670144

 Mr. Leventhal did not address the overwhelming problem we face, which is
 (multiple) parity array reconstruction time.  He assumes the time to
 simply 'populate' one drive at its max throughput is the total
 reconstruction time for the array.
 
 Since Adam wrote the code for RAID-Z3 for ZFS, I'm sure he is aware of
 the time to restore data to failed drives. I do not see any flaw in
 his analysis related to the time needed to restore data to failed
 drives.

He wrote that article in late 2009.  It seems pretty clear he wasn't
looking 10 years forward to 20TB drives, where the minimum mirror
rebuild time will be ~18 hours, and parity rebuild will be much greater.

-- 
Stan
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

2013-11-21 Thread Piergiorgio Sartor

On Thu, Nov 21, 2013 at 11:13:29AM +0100, David Brown wrote:
[...]
 Ah, you are trying to find which disk has incorrect data so that you can
 change just that one disk?  There are dangers with that...

Hi David,

 http://neil.brown.name/blog/20100211050355

I think we already did the exercise, here :-)

 If you disagree with this blog post (and I urge you to read it in full

We discussed the topic (with Neil) and, if I
recall correctly, he is agaist having an
_automatic_ error detectio and correction _in_
kernel.
I fully agree with that: user space is better
and it should not be automatic, but it should
do things under user control.

The current check operetion is pretty poor.
It just reports how many mismatches, it does
not even report where in the array.
The first step, independent from how many
parities one has, would be to tell the user
where the mismatches occurred, so it would
be possible to check the FS at that position.
Having a multi parity RAID allows to check
even which disk.
This would provide the user with a more
comprehensive (I forgot the spelling)
information.

Of course, since we are there, we can
also give the option to fix it.
This would be much likely a fsck.

 first), then this is how I would do a smart stripe recovery:
 
 First calculate the parities from the data blocks, and compare these
 with the existing parity blocks.
 
 If they all match, the stripe is consistent.
 
 Normal (detectable) disk errors and unrecoverable read errors get
 flagged by the disk and the IO system, and you /know/ there is a problem
 with that block.  Whether it is a data block or a parity block, you
 re-generate the correct data and store it - that's what your raid is for.

That's not always the case, otherwise
having the mismatch count would be useless.
The issue is that errors appear, whatever
the reason, without being reported by the
underlying hardware.
 
 If you have no detected read errors, and there is one parity
 inconsistency, then /probably/ that block has had an undetected read
 error, or it simply has not been written completely before a crash.
 Either way, just re-write the correct parity.

Why re-write the parity if I can get
the correct data there?
If can be sure that one data block is
incorrect and I can re-create properly,
that's the thing to do.
 
 Remember, this is not a general error detection and correction scheme -

It is not, but it could be. For free.

bye,

-- 

piergiorgio
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

2013-11-21 Thread Stan Hoeppner

On 11/21/2013 2:08 AM, joystick wrote:
 On 21/11/2013 02:28, Stan Hoeppner wrote:
...
 WRT rebuild times, once drives hit 20TB we're looking at 18 hours just
 to mirror a drive at full streaming bandwidth, assuming 300MB/s
 average--and that is probably being kind to the drive makers.  With 6 or
 8 of these drives, I'd guess a typical md/RAID6 rebuild will take at
 minimum 72 hours or more, probably over 100, and probably more yet for
 3P.  And with larger drive count arrays the rebuild times approach a
 week.  Whose users can go a week with degraded performance?  This is
 simply unreasonable, at best.  I say it's completely unacceptable.

 With these gargantuan drives coming soon, the probability of multiple
 UREs during rebuild are pretty high.
 
 No because if you are correct about the very high CPU overhead during

I made no such claim.

 rebuild (which I don't see so dramatic as Andrea claims 500MB/sec for
 triple-parity, probably parallelizable on multiple cores), the speed of
 rebuild decreases proportionally 

The rebuild time of a parity array normally has little to do with CPU
overhead.  The bulk of the elapsed time is due to:

1.  The serial nature of the rebuild algorithm
2.  The random IO pattern of the reads
3.  The rotational latency of the drives

#3 is typically the largest portion of the elapsed time.

 and hence the stress and heating on the
 drives proportionally reduces, approximating that of normal operation.
 And how often have you seen a drive failure in a week during normal
 operation?

This depends greatly on one's normal operation.  In general, for most
users of parity arrays, any full array operation such as a rebuild or
reshape is far more taxing on the drives, in both power draw and heat
dissipation, than 'normal' operation.

 But in reality, consider that a non-naive implementation of
 multiple-parity would probably use just the single parity during
 reconstruction if just one disk fails, using the multiple parities only
 to read the stripes which are unreadable at single parity. So the speed
 and time of reconstruction and performance penalty would be that of
 raid5 except in exceptional situations of multiple failures.

That may very well be, but it doesn't change #2,3 above.

 What I envision is an array type, something similar to RAID 51, i.e.
 striped parity over mirror pairs. 
 
 I don't like your approach of raid 51: it has the write overhead of
 raid5, with the waste of space of raid1.
 So it cannot be used as neither a performance array nor a capacity array.

I don't like it either.  It's a compromise.  But as RAID1/10 will soon
be unusable due to URE probability during rebuild, I think it's a
relatively good compromise for some users, some workloads.

 In the scope of this discussion (we are talking about very large
 arrays), 

Capacity yes, drive count, no.  Drive capacities are increasing at a
much faster rate than our need for storage space.  As we move forward
the trend will be building larger capacity arrays with fewer disks.

 the waste of space of your solution, higher than 50%, will make
 your solution costing double the price.

This is the classic mirror vs parity argument.  Using 1 more disk to add
parity to striped mirrors doesn't change it.  Waste is in the eye of
the beholder.  Anyone currently using RAID10 will have no problem
dedicating one more disk for uptime, protection.

 A competitor for the multiple-parity scheme might be raid65 or 66, but
 this is a so much dirtier approach than multiple parity if you think at
 the kind of rmw and overhead that will occur during normal operation.

Neither of those has any advantage over multi-parity.  I suggested this
approach because it retains all of the advantages of RAID10 but one.  We
sacrifice fast random write performance for protection against UREs, the
same reason behind 3P.  That's what the single parity is for, and that
alone.

I suggest that anyone in the future needing fast random write IOPS is
going to move those workloads to SSD, which is steadily increasing in
capacity.  And I suggest anyone building arrays with 10-20TB drives
isn't in need of fast random write IOPS.  Whether this approach is
valuable to anyone depends on whether the remaining attributes of
RAID10, with the added URE protection, are worth the drive count.
Obviously proponents of traditional parity arrays will not think so.
Users of RAID10 may.  Even if md never supports such a scheme, I bet
we'll see something similar to this in enterprise gear, where rebuilds
need to be 'fast' and performance degradation due to a downed drive is
not acceptable.

-- 
Stan
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

On 21/11/13 21:52, Piergiorgio Sartor wrote:
 Hi David,
 
 On Thu, Nov 21, 2013 at 09:31:46PM +0100, David Brown wrote:
 [...]
 If this can all be done to give the user an informed choice, then it
 sounds good.
 
 that would be my target.
 To _offer_ more options to the (advanced) user.
 It _must_ always be under user control.
 
 One issue here is whether the check should be done with the filesystem
 mounted and in use, or only off-line.  If it is off-line then it will
 mean a long down-time while the array is checked - but if it is online,
 then there is the risk of confusing the filesystem and caches by
 changing the data.
 
 Currently, raid6check can work with FS mounted.
 I got the suggestion from Neil (of course).
 It is possible to lock one stripe and check it.
 This should be, at any given time, consistent
 (that is, the parity should always match the data).
 If an error is found, it is reported.
 Again, the user can decide to fix it or not,
 considering all the FS consequences and so on.
 

If you can lock stripes, and make sure any old data from that stripe is
flushed from the caches (if you change it while locked), then that
sounds ideal.

 Most disk errors /are/ detectable, and are reported by the underlying
 hardware - small surface errors are corrected by the disk's own error
 checking and correcting mechanisms, and larger errors are usually
 detected.  It is (or should be!) very rare that a read error goes
 undetected without there being a major problem with the disk controller.
  And if the error is detected, then the normal raid processing kicks in
 as there is no doubt about which block has problems.
 
 That's clear. That case is an erasure (I think)
 and it is perfectly in line with the usual operation.
 I'm not trying to replace this mechanism.
  
 If you can be /sure/ about which data block is incorrect, then I agree -
 but you can't be /entirely/ sure.  But I agree that you can make a good
 enough guess to recommend a fix to the user - as long as it is not
 automatic.
 
 One typical case is when many errors are
 found, belonging to the same disk.
 This case clearly shows the disk is to be
 replaced or the interface checked...
 But, again, the user is the master, not the
 machine... :-)

I don't know what sort of interface you have for the user, but I guess
that means you'll have to collect a number of failures before showing
them so that the user can see the correlation on disk number.

  
 For most ECC schemes, you know that all your blocks are set
 synchronously - so any block that does not fit in, is an error.  With
 raid, it could also be that a stripe is only partly written - you can
 
 Could it be?
 I would consider this an error.

It could occur as the result of a failure of some sort (kernel crash,
power failure, temporary disk problem, etc.).  More generally, md raid
doesn't have to be on local physical disks - maybe one of the disks is
an iSCSI drive or something else over a network that could have failures
or delays.  I haven't thought through all cases here - I am just
throwing them out as possibilities that might cause trouble.

 The stripe must always be consistent, there
 should be a transactional mechanism to make
 sure that, if read back, the data is always
 matching the parity.
 When I write read back I mean from whatever
 the data is: physical disk or cache.
 Otherwise, the check must run exclusively on
 the array (no mounted FS, no other things
 running on it).
 
 have two different valid sets of data mixed to give an inconsistent
 stripe, without any good way of telling what consistent data is the best
 choice.
  
 Perhaps a checking tool can take advantage of a write-intent bitmap (if
 there is one) so that it knows if an inconsistent stripe is partly
 updated or the result of a disk error.
 
 Of course, this is an option, which should be
 taken into consideration.
 
 Any improvement idea is welcome!!!
 
 bye,
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

2013-11-21 Thread H. Peter Anvin

On 11/21/2013 04:30 PM, Stan Hoeppner wrote:
 
 The rebuild time of a parity array normally has little to do with CPU
 overhead.

Unless you have to fall back to table driven code.

Anyway, this looks like a great concept.  Now we just need to implement
it ;)

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

On 22/11/13 01:30, Stan Hoeppner wrote:

 I don't like it either.  It's a compromise.  But as RAID1/10 will soon
 be unusable due to URE probability during rebuild, I think it's a
 relatively good compromise for some users, some workloads.

An alternative is to move to 3-way raid1 mirrors rather than 2-way
mirrors.  Obviously you take another hit in disk space efficiency, but
reads will be faster than you have extra redundancy.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

2013-11-21 Thread Piergiorgio Sartor

On Wed, Nov 20, 2013 at 07:28:37PM -0600, Stan Hoeppner wrote:
[...]
 It's always perilous to follow a Ph.D., so I guess I'm feeling suicidal
 today. ;)
 
 I'm not attempting to marginalize Andrea's work here, but I can't help
 but ponder what the real value of triple parity RAID is, or quad, or
 beyond.  Some time ago parity RAID's primary mission ceased to be

Hi Stan,

my opinio is that you have to think
in terms of storage devices which are
not always available.
Those are not simply directly connected
HDDs, it could be more exotic.
The example I consider is a p2p network
storage, where the nodes are very little
reliable.
I guess that could be more.

bye,

-- 

piergiorgio
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

2013-11-20 Thread David Brown

On 20/11/13 02:23, John Williams wrote:
 On Tue, Nov 19, 2013 at 4:54 PM, Chris Murphy li...@colorremedies.com
 wrote:
 If anything, I'd like to see two implementations of RAID 6 dual
 parity. The existing implementation in the md driver and btrfs could
 remain the default, but users could opt into Cauchy matrix based dual
 parity which would then enable them an easy (and live) migration path
 to triple parity and beyond.

Andrea's Cauchy matrix is compatible with the existing Raid6, so there
is no problem there.

I believe it would be a terrible idea to have an incompatible extension
- that would mean you could not have temporary extra parity drives with
asymmetrical layouts, which is something I see as a very useful feature.

 
 Actually, my understanding is that Andrea's Cauchy matrix technique
 (call it C) is compatible with existing md RAID5 and RAID6 (call these
 A). It is only the non-SSSE3 triple-parity algorithm 2^-1 (call it B)
 that is incompatible with his Cauchy matrix technique.
 
 So, you can have:
 
 1) A+B
 
 or
 
 2) A+C
 
 But you cannot have A+B+C

Yes, that's right.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

2013-11-20 Thread James Plank

Hi all -- no real comments, except as I mentioned to Ric, my tutorial in FAST
last February presents Reed-Solomon coding with Cauchy matrices, and then makes
special note of the common pitfall of assuming that you can append a
Vandermonde matrix to an identity matrix. Please see
http://web.eecs.utk.edu/~plank/plank/papers/2013-02-11-FAST-Tutorial.pdf,
slides 48-52.

Andrea, does the matrix that you included in an earlier mail (the one that has
Linux RAID-6 in the first two rows) have a general form, or did you develop it
in an ad hoc manner so that it would include Linux RAID-6 in the first two rows?

Best wishes -- Jim
--

On Nov 19, 2013, at 3:29 PM, Ric Wheeler wrote:

On 11/19/2013 12:28 PM, Andrea Mazzoleni wrote:
Hi Peter,

Yes, 251 disks for 6 parity.

To build a NxM Cauchy matrix you need to pick N+M distinct values
in the GF(2^8) and we have only 2^8 == 256 available.
This means that for every row we add for an extra parity level, we
have to remove one of the disk columns.

Note that in true, I use an Extended Cauchy matrix that gives the
first row of 1 for free. This results in N+M = 256+1.
So, DISKS = 257 - PARITY - 251 = 257 - 6

A brief introduction of Cauchy and Extended Cauchy matrix can be found in:

Vinocha, On Generator Cauchy Matrices of GDRS/GTRS Codes, 2012
http://www.m-hikari.com/ijcms/ijcms-2012/45-48-2012/brarIJCMS45-48-2012.pdf
(just check the Introduction, the rest is not related)

More details can be found in:

Roth, Introduction to Coding Theory, 2006
http://carlossicoli.free.fr/R/Roth_R.-Introduction_to_coding_theory-Cambridge_University_Press%282006%29.pdf
(search for Extended Cauchy)

Ciao,
Andrea

Great work - we have waited a long time for this. Adding in Jim Plank who did
some great talks and work in this area as well :)

Ric

On Tue, Nov 19, 2013 at 12:25 AM, H. Peter Anvin h...@zytor.com wrote:
On 11/18/2013 02:35 PM, Andrea Mazzoleni wrote:
Hi Peter,

The Cauchy matrix has the mathematical property to always have itself
and all submatrices not singular. So, we are sure that we can always
solve the equations to recover the data disks.

Besides the mathematical proof, I've also inverted all the
377,342,351,231 possible submatrices for up to 6 parities and 251 data
disks, and got an experimental confirmation of this.

Nice.

The only limit is coming from the GF(2^8). You have a maximum number
of disk = 2^8 + 1 - number_of_parities. For example, with 6 parities,
you can have no more of 251 data disks. Over this limit it's not
possible to build a Cauchy matrix.

251? Not 255?

Note that instead with a Vandermonde matrix you don't have the
guarantee to always have all the submatrices not singular. This is the
reason because using power coefficients, before or late, it happens to
have unsolvable equations.

You can find the code that generate the Cauchy matrix with some
explanation in the comments at (see the set_cauchy() function) :

http://sourceforge.net/p/snapraid/code/ci/master/tree/mktables.c
OK, need to read up on the theoretical aspects of this, but it sounds
promising.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

2013-11-20 Thread John Williams

On Wed, Nov 20, 2013 at 2:31 AM, David Brown david.br...@hesbynett.no wrote:
 That's certainly a reasonable way to look at it.  We should not limit
 the possibilities for high-end systems because of the limitations of
 low-end systems that are unlikely to use 3+ parity anyway.  I've also
 looked up a list of the processors that support SSE3 and PSHUFB - a lot
 of modern low-end x86 cpus support it.  And of course it is possible
 to implement general G(2^8) multiplication without PSHUFB, using a
 lookup table - it is important that this can all work with any CPU, even
 if it is slow.

Unfortunately, it is SSSE3 that is required for PSHUFB. The SSE3 set
with only two-esses does not suffice. I made that same mistake when I
first heard about Andrea's 6-parity work. SSSE3 vs. SSE3, confusing
notation!

SSSE3 is significantly less widely supported than SSE3. Particularly
on AMD, only the very latest CPUs seem to support SSSE3. Intel support
for SSSE3 goes back much further than AMD support.

Maybe it is not such a big problem, since it may be possible to
support two roads. Both roads would include the current md RAID-5
and RAID-6. But one road, which those lacking CPUs supporting SSSE3
might choose, would continue on to the non-SSSE3 triple-parity 2^-1
technique, and then dead-end. The other road would continue with the
Cauchy matrix technique through 3-parity all the way to 6-parity.

It might even be feasible to allow someone stuck at the end of the
non-SSSE3 road to convert to the Cauchy road. You would have to go
through all the 2^-1 triple-parity and convert it to Cauchy
triple-parity. But then you would be safely on the Cauchy road.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

Hi David,

 The choice of ZFS to use powers of 4 was likely not optimal,
 because to multiply by 4, it has to do two multiplications by 2.
 I can agree with that.  I didn't copy ZFS's choice here
David, it was not my intention to suggest that you copied from ZFS.
Sorry to have expressed myself badly. I just mentioned ZFS because it's
an implementation that I know uses powers of 4 to generate triple
parity, and I saw in the code that it's implemented with two multiplication
by 2.

Ciao,
Andrea
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

It is also possible to quickly multiply by 2^-1 which makes for an interesting 
R parity.

Andrea Mazzoleni amadva...@gmail.com wrote:
Hi David,

 The choice of ZFS to use powers of 4 was likely not optimal,
 because to multiply by 4, it has to do two multiplications by 2.
 I can agree with that.  I didn't copy ZFS's choice here
David, it was not my intention to suggest that you copied from ZFS.
Sorry to have expressed myself badly. I just mentioned ZFS because it's
an implementation that I know uses powers of 4 to generate triple
parity, and I saw in the code that it's implemented with two
multiplication
by 2.

Ciao,
Andrea

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

Hi John,

Yes. There are still AMD CPUs sold without SSSE3. Most notably Athlon.
Instead, Intel is providing SSSE3 from the Core 2 Duo.

A detailed list is available at: http://en.wikipedia.org/wiki/SSSE3

Ciao,
Andrea

On Wed, Nov 20, 2013 at 7:09 PM, John Williams jwilliams4...@gmail.com wrote:
 On Wed, Nov 20, 2013 at 2:31 AM, David Brown david.br...@hesbynett.no wrote:
 That's certainly a reasonable way to look at it.  We should not limit
 the possibilities for high-end systems because of the limitations of
 low-end systems that are unlikely to use 3+ parity anyway.  I've also
 looked up a list of the processors that support SSE3 and PSHUFB - a lot
 of modern low-end x86 cpus support it.  And of course it is possible
 to implement general G(2^8) multiplication without PSHUFB, using a
 lookup table - it is important that this can all work with any CPU, even
 if it is slow.

 Unfortunately, it is SSSE3 that is required for PSHUFB. The SSE3 set
 with only two-esses does not suffice. I made that same mistake when I
 first heard about Andrea's 6-parity work. SSSE3 vs. SSE3, confusing
 notation!

 SSSE3 is significantly less widely supported than SSE3. Particularly
 on AMD, only the very latest CPUs seem to support SSSE3. Intel support
 for SSSE3 goes back much further than AMD support.

 Maybe it is not such a big problem, since it may be possible to
 support two roads. Both roads would include the current md RAID-5
 and RAID-6. But one road, which those lacking CPUs supporting SSSE3
 might choose, would continue on to the non-SSSE3 triple-parity 2^-1
 technique, and then dead-end. The other road would continue with the
 Cauchy matrix technique through 3-parity all the way to 6-parity.

 It might even be feasible to allow someone stuck at the end of the
 non-SSSE3 road to convert to the Cauchy road. You would have to go
 through all the 2^-1 triple-parity and convert it to Cauchy
 triple-parity. But then you would be safely on the Cauchy road.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

Hi,

Yep. At present to multiply for 2^-1 I'm using in C:

static inline uint64_t d2_64(uint64_t v)
{
uint64_t mask = v  0x0101010101010101U;
mask = (mask  8) - mask;
v = (v  1)  0x7f7f7f7f7f7f7f7fU;
v ^= mask  0x8e8e8e8e8e8e8e8eU;
return v;
}

and for SSE2:

asm volatile(movdqa %xmm2,%xmm4);
asm volatile(pxor %xmm5,%xmm5);
asm volatile(psllw $7,%xmm4);
asm volatile(psrlw $1,%xmm2);
asm volatile(pcmpgtb %xmm4,%xmm5);
asm volatile(pand %xmm6,%xmm2); with xmm6 == 7f7f7f7f7f7f...
asm volatile(pand %xmm3,%xmm5); with xmm3 == 8e8e8e8e8e...
asm volatile(pxor %xmm5,%xmm2);

where xmm2 is the intput/output

Ciao,
Andrea

On Wed, Nov 20, 2013 at 7:43 PM, H. Peter Anvin h...@zytor.com wrote:
 It is also possible to quickly multiply by 2^-1 which makes for an 
 interesting R parity.

 Andrea Mazzoleni amadva...@gmail.com wrote:
Hi David,

 The choice of ZFS to use powers of 4 was likely not optimal,
 because to multiply by 4, it has to do two multiplications by 2.
 I can agree with that.  I didn't copy ZFS's choice here
David, it was not my intention to suggest that you copied from ZFS.
Sorry to have expressed myself badly. I just mentioned ZFS because it's
an implementation that I know uses powers of 4 to generate triple
parity, and I saw in the code that it's implemented with two
multiplication
by 2.

Ciao,
Andrea

 --
 Sent from my mobile phone.  Please pardon brevity and lack of formatting.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

On 11/20/2013 10:56 AM, Andrea Mazzoleni wrote:
 Hi,
 
 Yep. At present to multiply for 2^-1 I'm using in C:
 
 static inline uint64_t d2_64(uint64_t v)
 {
 uint64_t mask = v  0x0101010101010101U;
 mask = (mask  8) - mask;
 v = (v  1)  0x7f7f7f7f7f7f7f7fU;
 v ^= mask  0x8e8e8e8e8e8e8e8eU;
 return v;
 }
 
 and for SSE2:
 
 asm volatile(movdqa %xmm2,%xmm4);
 asm volatile(pxor %xmm5,%xmm5);
 asm volatile(psllw $7,%xmm4);
 asm volatile(psrlw $1,%xmm2);
 asm volatile(pcmpgtb %xmm4,%xmm5);
 asm volatile(pand %xmm6,%xmm2); with xmm6 == 7f7f7f7f7f7f...
 asm volatile(pand %xmm3,%xmm5); with xmm3 == 8e8e8e8e8e...
 asm volatile(pxor %xmm5,%xmm2);
 
 where xmm2 is the intput/output
 

Now, that doesn't sound like something that can get neatly meshed into
the Cauchy matrix scheme, I assume.  It is somewhat nice to have a
scheme which is arbitrarily expandable without having to fall back to
dual parity during the restripe operation.  It probably also reduces the
amount of code necessary.

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

Hi Jim,

I build the matrix in a way that results in coefficients matching
Linux RAID for the first two rows, and at the same time gives
the guarantee that all the square submatrices are not singular,
resulting in a MDS code.

I start forming a Cauchy matrix setting each element to 1/(xi+yj)
where all xi and yj are distinct elements. This is how a Cauchy
matrix is usually defined in textbooks.

For the first row with j=0, I use xi = 2^-i and y0 = 0, that results in:

row j=0 - 1/(xi+y0) = 1/(2^-i + 0) = 2^i (RAID-6 coefficients)

For the next rows with j0, I use yj = 2^j, resulting in:

rows j0 - 1/(xi+yj) = 1/(2^-i + 2^j)

with xi != yj for any i,j with i=0,j=1,i+j255

Then I put at the top of the Cauchy matrix a row filled with 1,
transforming it in an Extended Cauchy Matrix.
This transformation maintains the property of having all the
square submatrices not singular.
I found this property mentioned in some papers/textbooks, like
in the introduction of:

Vinocha, On Generator Cauchy Matrices of GDRS/GTRS Codes, 2012
http://www.m-hikari.com/ijcms/ijcms-2012/45-48-2012/brarIJCMS45-48-2012.pdf

Finally I adjust all the rows to have the first column filled with 1,
with a multiplication of each row for an adjusting factor.
Also this transformation maintains the property of having all the
square submatrices not singular, and then we have a MDS code.

Ciao,
Andrea
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

On 11/20/2013 11:05 AM, Andrea Mazzoleni wrote:
 
 For the first row with j=0, I use xi = 2^-i and y0 = 0, that results in:
 

How can xi = 2^-i if x is supposed to be constant?

That doesn't mean that your approach isn't valid, of course, but it
might not be a Cauchy matrix and thus needs additional analysis.

 row j=0 - 1/(xi+y0) = 1/(2^-i + 0) = 2^i (RAID-6 coefficients)
 
 For the next rows with j0, I use yj = 2^j, resulting in:
 
 rows j0 - 1/(xi+yj) = 1/(2^-i + 2^j)

Even more so here... 2^-i and 2^j don't seem to be of the form xi and yj
respectively.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

2013-11-20 Thread James Plank

Peter, I think I understand it differently.  Concrete example in GF(256) for 
k=6, m=4:

First, create a 3 by 6 cauchy matrix, using x_i = 2^-i, and y_i = 0 for i=0, 
and y_i = 2^i for other i.  In this case:   x = { 1, 142, 71, 173, 216, 108 }  
y = { 0, 2, 4).  The cauchy matrix is:

  1   2   4   8  16  32
244  83  78 183 118  47
167  39 213  59 153  82

Divide row 2 by 244 and row 3 by 167.  Then extend it with a row of ones on top 
and it's still MDS, and that's the code for m=4, with RAID-6 as a subset.  Very 
nice!  

Jim

--

On Nov 20, 2013, at 2:10 PM, H. Peter Anvin wrote:

 On 11/20/2013 11:05 AM, Andrea Mazzoleni wrote:
 
 For the first row with j=0, I use xi = 2^-i and y0 = 0, that results in:
 
 
 How can xi = 2^-i if x is supposed to be constant?
 
 That doesn't mean that your approach isn't valid, of course, but it
 might not be a Cauchy matrix and thus needs additional analysis.
 
 row j=0 - 1/(xi+y0) = 1/(2^-i + 0) = 2^i (RAID-6 coefficients)
 
 For the next rows with j0, I use yj = 2^j, resulting in:
 
 rows j0 - 1/(xi+yj) = 1/(2^-i + 2^j)
 
 Even more so here... 2^-i and 2^j don't seem to be of the form xi and yj
 respectively.
 
   -hpa
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

Hi Peter,

 static inline uint64_t d2_64(uint64_t v)
 {
 uint64_t mask = v  0x0101010101010101U;
 mask = (mask  8) - mask;

 (mask  7) I assume...
No. It's (mask  8) - mask. We want to expand the bit at position 0
(in each byte) to the full byte, resulting in 0xFF if the bit is at 1,
and 0x00 if the bit is 0.

(0  8) - 0 = 0x00
(1  8) - 1 = 0x100 - 1 = 0xFF

Ciao,
Andrea
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

On 11/20/2013 01:04 PM, Andrea Mazzoleni wrote:
 Hi Peter,
 
 static inline uint64_t d2_64(uint64_t v)
 {
 uint64_t mask = v  0x0101010101010101U;
 mask = (mask  8) - mask;

 (mask  7) I assume...
 No. It's (mask  8) - mask. We want to expand the bit at position 0
 (in each byte) to the full byte, resulting in 0xFF if the bit is at 1,
 and 0x00 if the bit is 0.
 
 (0  8) - 0 = 0x00
 (1  8) - 1 = 0x100 - 1 = 0xFF
 

Oh, right... it is the same as (v  1) - (v  7) except everything is
shifted over one.

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

Hi Peter,

 Now, that doesn't sound like something that can get neatly meshed into
 the Cauchy matrix scheme, I assume.
You are correct. Multiplication by 2^-1 cannot be used for the Cauchy method.

I used it to implement an alternate triple parity not requiring PSHUFB
that I used as reference for performance evaluation of the Cauchy way,
assuming that this implementation using powers of 1,2,2^-1 is the
fastest possible one.

Hopefully the difference is minimal, and the Cauchy method is
competitive even at triple parity.

Ciao,
Andrea
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

Hi,

 First, create a 3 by 6 cauchy matrix, using x_i = 2^-i, and y_i = 0 for i=0, 
 and y_i = 2^i for other i.
 In this case:   x = { 1, 142, 71, 173, 216, 108 }  y = { 0, 2, 4).  The 
 cauchy matrix is:

   1   2   4   8  16  32
 244  83  78 183 118  47
 167  39 213  59 153  82

 Divide row 2 by 244 and row 3 by 167.  Then extend it with a row of ones on 
 top and it's still MDS,
 and  that's the code for m=4, with RAID-6 as a subset.  Very nice!

You got it Jim!

Thanks,
Andrea
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond

On 11/20/2013 12:30 PM, James Plank wrote:
 Peter, I think I understand it differently.  Concrete example in GF(256) for 
 k=6, m=4:
 
 First, create a 3 by 6 cauchy matrix, using x_i = 2^-i, and y_i = 0 for i=0, 
 and y_i = 2^i for other i.  In this case:   x = { 1, 142, 71, 173, 216, 108 } 
  y = { 0, 2, 4).  The cauchy matrix is:

Sorry, I took xi and yj to mean a constant x multiplied with i and a
constant y multiplied with j, rather than x_i and y_j.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Triple parity and beyond