Re: [zfs-discuss] Raidz - what is stored in parity?

2010-08-12 Thread Adam Leventhal
> In my case, it gives an error that I need at least 11 disks (which I don't) 
> but the point is that raidz parity does not seem to be limited to 3. Is this 
> not true?

RAID-Z is limited to 3 parity disks. The error message is giving you false hope 
and that's a bug. If you had plugged in 11 disks or more in the example you 
provided you would have simply gotten a different error.

- ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz - what is stored in parity?

2010-08-12 Thread Arne Schwabe
 Am 11.08.10 00:40, schrieb Peter Taps:
> Hi,
>
> I am going through understanding the fundamentals of raidz. From the man 
> pages, a raidz configuration of P disks and N parity provides (P-N)*X storage 
> space where X is the size of the disk. For example, if I have 3 disks of 10G 
> each and I configure it with raidz1, I will have 20G of usable storage. In 
> addition, I continue to work even if 1 disk fails.
>
> First, I don't understand why parity takes so much space. From what I know 
> about parity, there is typically one parity bit per byte. Therefore, the 
> parity should be taking 1/8 of storage, not 1/3 of storage. What am I missing?
>
> Second, if one disk fails, how is my lost data reconstructed? There is no 
> duplicate data as this is not a mirrored configuration. Somehow, there should 
> be enough information in the parity disk to reconstruct the lost data. How is 
> this possible?
>
> Thank you in advance for your help.
>
Nah it is more like Disk3 is disk2 xor disk1. You can read about it on
Raid5 (raidz is more complicated but the basic idea stays the same). The
parity you describe is only for error checking. More like a zfs checksum
which also one takes very little additional space.

Arne



smime.p7s
Description: S/MIME Cryptographic Signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz - what is stored in parity?

2010-08-11 Thread Peter Taps
Thank you, Eric. Your explanation is clear to understand.

Regards,
Peter
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz - what is stored in parity?

2010-08-11 Thread Peter Taps
I am running ZFS file system version 5 on Nexenta.

Peter
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz - what is stored in parity?

2010-08-11 Thread Marty Scholes
Peter wrote:
> One question though. Marty mentioned that raidz
> parity is limited to 3. But in my experiment, it
> seems I can get parity to any level.
> 
> You create a raidz zpool as:
> 
> # zpool create mypool raidzx disk1 diskk2 
> 
> Here, x in raidzx is a numeric value indicating the
> desired parity.
> 
> In my experiment, the following command seems to
> work:
> 
> # zpool create mypool raidz10 disk1 disk2 ...
> 
> In my case, it gives an error that I need at least 11
> disks (which I don't) but the point is that raidz
> parity does not seem to be limited to 3. Is this not
> true?

You have my curiousity.  I was asking for that feature in these forums last 
year.

What OS, version and ZFS version are you running?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz - what is stored in parity?

2010-08-11 Thread Peter Taps
Thank you all for your help. It appears my understanding of parity was rather 
limited. I kept on thinking about parity in memory where the extra bit would be 
used to ensure that the total of all 9 bits is always even. 

In case of zfs, the above type of checking is actually moved into checksum. 
What zfs calls parity is much more than a simple check. No wonder it takes more 
space.

One question though. Marty mentioned that raidz parity is limited to 3. But in 
my experiment, it seems I can get parity to any level.

You create a raidz zpool as:

# zpool create mypool raidzx disk1 diskk2 

Here, x in raidzx is a numeric value indicating the desired parity.

In my experiment, the following command seems to work:

# zpool create mypool raidz10 disk1 disk2 ...

In my case, it gives an error that I need at least 11 disks (which I don't) but 
the point is that raidz parity does not seem to be limited to 3. Is this not 
true?

Thank you once again for your help.

Regards,
Peter
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz - what is stored in parity?

2010-08-11 Thread Eric D. Mudama

On Tue, Aug 10 at 21:57, Peter Taps wrote:

Hi Eric,

Thank you for your help. At least one part is clear now.

I still am confused about how the system is still functional after one disk 
fails.


The data for any given sector striped across all drives can be thought
of as:

A+B+C = P

where A..C represent the contents of sector N on devices a..c, and P
is the parity located on device p.

From that, you can do some simple algebra to convert it to:

A+B+C-P = 0

If any of A,B,C or P are unreadable (assume B), from simple algebra,
you can solve for any single unknown (x) to recreate it:

A+x+C = P
A+x+C-A-C = P-A-C
x = P-A-C

and voila, you now have your original B contents, since B=x.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz - what is stored in parity?

2010-08-11 Thread Marty Scholes
Erik Trimble wrote:
> On 8/10/2010 9:57 PM, Peter Taps wrote:
> > Hi Eric,
> >
> > Thank you for your help. At least one part is clear
> now.
> >
> > I still am confused about how the system is still
> functional after one disk fails.
> >
> > Consider my earlier example of 3 disks zpool
> configured for raidz-1. To keep it simple let's not
> consider block sizes.
> >
> > Let's say I send a write value "abcdef" to the
> zpool.
> >
> > As the data gets striped, we will have 2 characters
> per disk.
> >
> > disk1 = "ab" + some parity info
> > disk2 = "cd" + some parity info
> > disk3 = "ef" + some parity info
> >
> > Now, if disk2 fails, I lost "cd." How will I ever
> recover this? The parity info may tell me that
> something is bad but I don't see how my data will get
> recovered.
> >
> > The only good thing is that any newer data will now
> be striped over two disks.
> >
> > Perhaps I am missing some fundamental concept about
> raidz.
> >
> > Regards,
> > Peter
> 
> Parity is not intended to tell you *if* something is
> bad (well, it's not 
> *designed* for that). It tells you how to RECONSTRUCT
> something should 
> it be bad.  ZFS uses Checksums of the data (which are
> stored as data 
> themselves) to tell if some data is bad, and thus
> needs to be re-written 

To follow up Erik's post, parity is used both to detect and correct errors in a 
string of equal sized numbers, each parity is equal in size to each of the 
numbers.  In the old serial protocols, one bit was used to detect an error in a 
string of 7 bits, so each "number" in the string was a one bit.  In the case of 
ZFS, each "number" in the string is a disk block.  The length of the string of 
numbers is completely arbitrary.

I am rusty on parity math, but Reed-Solomon is used (of which XOR is a 
degenerate case) such that each parity is independent of the other parities.  
RAIDZ can support up to three parities per stripe.

Generally, a single parity can either detect a single corrupt number in a 
string or if it is known which number is corrupt, a single parity can correct 
that number.  Traditional RAID5 makes the assumption that it knows which number 
(i.e. block) is bad because the disk failed and therefore can use the parity 
block to reconstruct it.  RAID5 cannot reconstruct a random bit-flip.

RAIDZ takes a different approach where the checksum for the number string (i.e. 
stripe) exists in a different, already validated stripe.  With that checksum in 
hand, ZFS knows when a stripe is corrupt but not which block.  ZFS will then 
reconstruct each data block in the stripe using the parity block, one data 
block at a time until the checksum matches.  At that point ZFS knows which 
block is bad and can rebuild it and write it to disk.  A scrub does this for 
all stripes and all parities in each stripe.

Using the example above, the disk layout would look more like the following for 
a single stripe, and as Erik mentioned, the location of the data and parity 
blocks will change from stripe to stripe:
disk1 = "ab"
disk2 = "cd"
disk3 = parity info

Again using the example above, if disk 2 fails, or even stays online but 
producess bad data, the information can be reconstructed from disk 3.

The beauty of ZFS is that it does not depend on parity to detect errors, your 
stripes can be as wide as you want (up to 100-ish devices) and you can choose 
1, 2 or 3 parity devices.

Hope that makes sense,
Marty
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz - what is stored in parity?

2010-08-11 Thread Thomas Burgess
On Wed, Aug 11, 2010 at 12:57 AM, Peter Taps  wrote:

> Hi Eric,
>
> Thank you for your help. At least one part is clear now.
>
> I still am confused about how the system is still functional after one disk
> fails.
>
> Consider my earlier example of 3 disks zpool configured for raidz-1. To
> keep it simple let's not consider block sizes.
>
> Let's say I send a write value "abcdef" to the zpool.
>
> As the data gets striped, we will have 2 characters per disk.
>
> disk1 = "ab" + some parity info
> disk2 = "cd" + some parity info
> disk3 = "ef" + some parity info
>
> Now, if disk2 fails, I lost "cd." How will I ever recover this? The parity
> info may tell me that something is bad but I don't see how my data will get
> recovered.
>
> The only good thing is that any newer data will now be striped over two
> disks.
>
> Perhaps I am missing some fundamental concept about raidz.
>
> Regards,
> Peter
>





I find the best way to understand how parity works is to think back to your
algebra class when you'd have something like

1x +2 = 3

and you could solve for xit's not EXACTLY like that but solving the
parity stuff is similar to solving for x







> --
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz - what is stored in parity?

2010-08-10 Thread Haudy Kazemi

Peter Taps wrote:

Hi Eric,

Thank you for your help. At least one part is clear now.

I still am confused about how the system is still functional after one disk 
fails.

Consider my earlier example of 3 disks zpool configured for raidz-1. To keep it 
simple let's not consider block sizes.

Let's say I send a write value "abcdef" to the zpool.

As the data gets striped, we will have 2 characters per disk.

disk1 = "ab" + some parity info
disk2 = "cd" + some parity info
disk3 = "ef" + some parity info

Now, if disk2 fails, I lost "cd." How will I ever recover this? The parity info 
may tell me that something is bad but I don't see how my data will get recovered.

The only good thing is that any newer data will now be striped over two disks.

Perhaps I am missing some fundamental concept about raidz.

Regards,
Peter
  


It's done via math and numbers.  :)  In a computer, everything is 
numbers, stored in base 2 (binary)...there are no letters or other 
symbols.  Your sample value of 'abcdef' will be represented as a 
sequence of numbers, probably using the ASCII equivalent numbers, which 
are in turn represented as a binary sequence.


A simplified view of how you can protect multiple independent pieces of 
information with once piece of parity is as follows.
(Note: this simplified view is not exactly how RAID5 or RAIDZ work, as 
they actually make use of XOR at a bitwise level).


Consider an equation with variables (unrelated to your sample value) A, 
B, and P, where A + B = P.  P is the parity value.
A and B are numbers representing your data; they were indirectly chosen 
by you when you created your data.  P is the generated parity value.


If A=97, and B=98, then P=97+98=195.

Each of the three variables is stored on a different disk.  If any one 
variable is lost (the disk failed), the missing variable can be 
recalculated by rearranging the formula and using the known values.


Assuming 'A' was lost, then A=P-B
P-B=195-98
195-98=97
A=97.  Data recovered.

In this simplified example, one piece of parity data P is generated for 
every pair of A and B values that are written.  Special cases handle 
things when only one value needs to be written (zero padding).  For more 
than 3 disks, the formula can expand to variations of A+B+C+D+E+F=P 
where P is the parity.  Additional levels of parity require using more 
complex techniques to generate the needed parity values.


There are lots of other explanations online that might help you out as 
well: http://www.google.com/#hl=en&q=how+raid+works


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz - what is stored in parity?

2010-08-10 Thread Erik Trimble

On 8/10/2010 9:57 PM, Peter Taps wrote:

Hi Eric,

Thank you for your help. At least one part is clear now.

I still am confused about how the system is still functional after one disk 
fails.

Consider my earlier example of 3 disks zpool configured for raidz-1. To keep it 
simple let's not consider block sizes.

Let's say I send a write value "abcdef" to the zpool.

As the data gets striped, we will have 2 characters per disk.

disk1 = "ab" + some parity info
disk2 = "cd" + some parity info
disk3 = "ef" + some parity info

Now, if disk2 fails, I lost "cd." How will I ever recover this? The parity info 
may tell me that something is bad but I don't see how my data will get recovered.

The only good thing is that any newer data will now be striped over two disks.

Perhaps I am missing some fundamental concept about raidz.

Regards,
Peter


Parity is not intended to tell you *if* something is bad (well, it's not 
*designed* for that). It tells you how to RECONSTRUCT something should 
it be bad.  ZFS uses Checksums of the data (which are stored as data 
themselves) to tell if some data is bad, and thus needs to be re-written 
(which is what virtually no other filesystem does now). Parity is used 
at a lower level to reconstruct data on devices after a device failure. 
It is not directly used to determine if a device (or block of data) is bad.



To simplify, let's assume we're talking about raidz1  (the principles 
generally apply to raidz2 and raidz3, but the details differ slightly).



Parity is constructed using mathematical XOR, which has the following 
property:


if A XOR B = C
then
A XOR C = Band alsoB XOR C = A

(XOR is also fully commutative, so A XOR B = B XOR A )


So, in your case, what we have some some data "abcdef", and three disks. 
So, assuming we have a stripe set up so that 1 BYTE (i.e. character) 
gets stored on each device, then what you have is this:


Stripe   Device 1 Device 2 Device 3
1ABA XOR B
2C XOR D  CD
3EE XOR F  F


(where X XOR Y means the binary value computed by XOR-ing X with Y)

In any case, if I lose one of the devices above, I simply XOR the 
corresponding values from the other two devices to reconstruct what I need.




For RaidZ[23], there are 2 or three parity calculations (it's not a 
straight XOR, I forget the algorithm), but the process is the same - you 
use the data from the remaining devices to recompute the lost device or 
devices. As the parity block for a stripe is stored in a balanced manner 
across all devices (there is no dedicated parity-only device), it 
becomes simpler to recover data while retaining performance.




--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz - what is stored in parity?

2010-08-10 Thread Peter Taps
Hi Eric,

Thank you for your help. At least one part is clear now.

I still am confused about how the system is still functional after one disk 
fails.

Consider my earlier example of 3 disks zpool configured for raidz-1. To keep it 
simple let's not consider block sizes.

Let's say I send a write value "abcdef" to the zpool.

As the data gets striped, we will have 2 characters per disk.

disk1 = "ab" + some parity info
disk2 = "cd" + some parity info
disk3 = "ef" + some parity info

Now, if disk2 fails, I lost "cd." How will I ever recover this? The parity info 
may tell me that something is bad but I don't see how my data will get 
recovered.

The only good thing is that any newer data will now be striped over two disks.

Perhaps I am missing some fundamental concept about raidz.

Regards,
Peter
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz - what is stored in parity?

2010-08-10 Thread Eric D. Mudama

On Tue, Aug 10 at 15:40, Peter Taps wrote:

Hi,

First, I don't understand why parity takes so much space. From what
I know about parity, there is typically one parity bit per
byte. Therefore, the parity should be taking 1/8 of storage, not 1/3
of storage. What am I missing?


Think of it as 1 bit of parity per N-wide RAID'd bit stored on your
data drives, which is why it occupies 1/N size.

With 3 disks it's 1/3, with 8 disks it's 1/8, and with 10983 disks it
would be 1/10983, because you're generating parity across the "width"
of your stripe, not as a tail to each stored byte on individual
devices.

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Raidz - what is stored in parity?

2010-08-10 Thread Peter Taps
Hi,

I am going through understanding the fundamentals of raidz. From the man pages, 
a raidz configuration of P disks and N parity provides (P-N)*X storage space 
where X is the size of the disk. For example, if I have 3 disks of 10G each and 
I configure it with raidz1, I will have 20G of usable storage. In addition, I 
continue to work even if 1 disk fails.

First, I don't understand why parity takes so much space. From what I know 
about parity, there is typically one parity bit per byte. Therefore, the parity 
should be taking 1/8 of storage, not 1/3 of storage. What am I missing?

Second, if one disk fails, how is my lost data reconstructed? There is no 
duplicate data as this is not a mirrored configuration. Somehow, there should 
be enough information in the parity disk to reconstruct the lost data. How is 
this possible?

Thank you in advance for your help.

Regards,
Peter
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss