Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.

2007-09-30 Thread Tuomas Leikola
On 9/20/07, Roch - PAE [EMAIL PROTECTED] wrote:

 Next application modifies D0 - D0' and also writes other
 data D3, D4. Now you have

 Disk0   Disk1   Disk2   Disk3

 D0  D1  D2  P0,1,2
 D0' D3  D4  P0',3,4

 But if D1 and D2 stays immutable for long time then we can
 run out of pool blocks with D0 held down in an half-freed state.
 So as we near full pool capacity, a scrubber would have to walk
 the stripes  and look for partially freed ones. Then it
 would need to do a scrubbing read/write on D1, D2 so that
 they become part of a new stripe with some other data
 freeing the full initial stripe.


Or, given a list of partial stripes (and sufficient cache), next write
of D5 could be combined with D1,D2:

 Disk0   Disk1   Disk2   Disk3

 D0  D1  D2  P0,1,2
 D0' D3  D4  P0',3,4
 D5  freefreeP5,1,2

therefore freeing D0 and P012:

 Disk0   Disk1   Disk2   Disk3

 freeD1  D2  free
 D0' D3  D4  P0',3,4
 D5  freefreeP5,1,2

(I assumed no need for alignment). Performance-wise, i'm guessing it
might be beneficial to quickly write mirrored blocks on the disk and
later combine them, freeing the now unneeded mirrors.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.

2007-09-15 Thread Tuomas Leikola
On 9/10/07, Pawel Jakub Dawidek [EMAIL PROTECTED] wrote:
 The problem with RAID5 is that different blocks share the same parity,
 which is not the case for RAIDZ. When you write a block in RAIDZ, you
 write the data and the parity, and then you switch the pointer in
 uberblock. For RAID5, you write the data and you need to update parity,
 which also protects some other data. Now if you write the data, but
 don't update the parity before a crash, you have a whole. If you update
 you parity before the write and a crash, you have a inconsistent with
 different block in the same stripe.

This is why you should consider old data and parity as being live.
The old data (being overwritten) is live as it is needed for the
parity to be consistent - and the old parity is live because it
protects the other blocks.

What IMO should be done is object level raid - write new parity and
new data into blocks not yet used - and as the new parity protects
also the neighbouring data the old parity can be freed, and after it
no longer is live the overwritten data block can also be freed.

Note that this is very different from traditional raid5 as it requires
intimate knowledge about the FS structure. Traditional raids also keep
parity in line with the data blocks it protects - but that is not
necessary if the FS can store information about where the parity is
located.

Define live data well enough and you're safe if you never overwrite any of it.

 My idea was to have one sector every 1GB on each disk for a journal to
 keep list of blocks beeing updated.

This would be called write intent log or bitmap (as in linux
software raid). Speeds up recovery, but doesn't protect against write
hole problems.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.

2007-09-13 Thread Pawel Jakub Dawidek
On Thu, Sep 13, 2007 at 04:58:10AM +, Marc Bevand wrote:
 Pawel Jakub Dawidek pjd at FreeBSD.org writes:
  
  This is how RAIDZ fills the disks (follow the numbers):
  
  Disk0   Disk1   Disk2   Disk3
  
  D0  D1  D2  P3
  D4  D5  D6  P7
  D8  D9  D10 P11
  D12 D13 D14 P15
  D16 D17 D18 P19
  D20 D21 D22 P23
  
  D is data, P is parity.
 
 This layout assumes of course that large stripes have been written to
 the RAIDZ vdev. As you know, the stripe width is dynamic, so it is
 possible for a single logical block to span only 2 disks (for those who
 don't know what I am talking about, see the red block occupying LBAs
 D3 and E3 on page 13 of these ZFS slides [1]).

Yes I'm aware of that.

 To read this logical block (and validate its checksum), only D_0 needs 
 to be read (LBA E3). So in this very specific case, a RAIDZ read
 operation is as cheap as a RAID5 read operation. [...]

If you do single sector writes - yes, but this is very inefficient,
because of two reasons:
1. Bandwidth - writting one sector at a time? Come on.
2. Space - when you write one sector and its parity you consume two
   sectors. You may have more than one parity column in that case, eg.
Disk0   Disk1   Disk2   Disk3   Disk4   Disk5
D0  P0  D1  P1  D2  P2
   In this case space overhead is the same as in mirror.

 [...] The existence of these
 small stripes could explain why RAIDZ doesn't perform as bad as RAID5
 in Pawel's benchmark...

No, as I said, the smallest block I used was 2kB, which means four 512b
blocks plus one 512b of parity - each 2kB block uses all 5 disks.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpvqYkQFVjyQ.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.

2007-09-13 Thread James Blackburn
On 9/12/07, Pawel Jakub Dawidek [EMAIL PROTECTED] wrote:
 On Wed, Sep 12, 2007 at 02:24:56PM -0700, Adam Leventhal wrote:
  On Mon, Sep 10, 2007 at 12:41:24PM +0200, Pawel Jakub Dawidek wrote:
  I'm a bit surprised by these results. Assuming relatively large blocks
  written, RAID-Z and RAID-5 should be laid out on disk very similarly
  resulting in similar read performance.

 Hmm, no. The data was organized very differenly on disks. The smallest
 block size used was 2kB, to ensure each block is written to all disks in
 RAIDZ configuration. In RAID5 configuration however, 128kB stripe size
 was used, which means each block was stored on one disk only.

 Now when you read the data, RAIDZ need to read all disks for each block,
 and RAID5 needs to read only one disk for each block.

  Did you compare the I/O characteristic of both? Was the bottleneck in
  the software or the hardware?

 The bottleneck were definiatelly disks. CPU was like 96% idle.

 To be honest I expected, just like Jeff, much bigger win for RAID5 case.

Well it depends.  In both configurations the available read bandwidth
is the same.  Presumably you're expecting each disk to seek
independently and concurrently.  Is the spa aware that multiple,
offset dependent, reads can be issued concurrently to the RAID-5 vdev?

James
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.

2007-09-12 Thread Adam Leventhal
On Mon, Sep 10, 2007 at 12:41:24PM +0200, Pawel Jakub Dawidek wrote:
 And here are the results:
 
 RAIDZ:
 
   Number of READ requests: 4.
   Number of WRITE requests: 0.
   Number of bytes to transmit: 695678976.
   Number of processes: 8.
   Bytes per second: 1305213
   Requests per second: 75
 
 RAID5:
 
   Number of READ requests: 4.
   Number of WRITE requests: 0.
   Number of bytes to transmit: 695678976.
   Number of processes: 8.
   Bytes per second: 2749719
   Requests per second: 158

I'm a bit surprised by these results. Assuming relatively large blocks
written, RAID-Z and RAID-5 should be laid out on disk very similarly
resulting in similar read performance.

Did you compare the I/O characteristic of both? Was the bottleneck in
the software or the hardware?

Very interesting experiment...

Adam

-- 
Adam Leventhal, FishWorkshttp://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.

2007-09-12 Thread Pawel Jakub Dawidek
On Wed, Sep 12, 2007 at 02:24:56PM -0700, Adam Leventhal wrote:
 On Mon, Sep 10, 2007 at 12:41:24PM +0200, Pawel Jakub Dawidek wrote:
  And here are the results:
  
  RAIDZ:
  
  Number of READ requests: 4.
  Number of WRITE requests: 0.
  Number of bytes to transmit: 695678976.
  Number of processes: 8.
  Bytes per second: 1305213
  Requests per second: 75
  
  RAID5:
  
  Number of READ requests: 4.
  Number of WRITE requests: 0.
  Number of bytes to transmit: 695678976.
  Number of processes: 8.
  Bytes per second: 2749719
  Requests per second: 158
 
 I'm a bit surprised by these results. Assuming relatively large blocks
 written, RAID-Z and RAID-5 should be laid out on disk very similarly
 resulting in similar read performance.

Hmm, no. The data was organized very differenly on disks. The smallest
block size used was 2kB, to ensure each block is written to all disks in
RAIDZ configuration. In RAID5 configuration however, 128kB stripe size
was used, which means each block was stored on one disk only.

Now when you read the data, RAIDZ need to read all disks for each block,
and RAID5 needs to read only one disk for each block.

 Did you compare the I/O characteristic of both? Was the bottleneck in
 the software or the hardware?

The bottleneck were definiatelly disks. CPU was like 96% idle.

To be honest I expected, just like Jeff, much bigger win for RAID5 case.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpaN8zKnXp9n.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.

2007-09-12 Thread Nicolas Williams
On Wed, Sep 12, 2007 at 02:24:56PM -0700, Adam Leventhal wrote:
 I'm a bit surprised by these results. Assuming relatively large blocks
 written, RAID-Z and RAID-5 should be laid out on disk very similarly
 resulting in similar read performance.
 
 Did you compare the I/O characteristic of both? Was the bottleneck in
 the software or the hardware?

Note that Pawel wrote:

Pawel I was using 8 processes, I/O size was a random value between 2kB
Pawel and 32kB (with 2kB step), offset was a random value between 0 and
Pawel 10GB (also with 2kB step).

If the dataset's record size was the default (Pawel didn't say, right?)
then the reason for the lousy read performance is clear: RAID-Z has to
read full blocks to verify the checksum, whereas RAID-5 need only read
as much as is requested (assuming aligned reads, which Pawel did seem to
indicate: 2KB steps).

Peter Tribble pointed out much the same thing already.

The crucial requirement is to match the dataset record size to the I/O
size done by the application.  If the app writes in bigger chunks than
it reads and you want to optimize for write performance then set the
record size to match the write size, else set the record size to match
the read size.

Where the dataset record size is not matched to the application's I/O
size I guess we could say that RAID-Z trades off the RAID-5 write hole
for a read-hole.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.

2007-09-12 Thread Pawel Jakub Dawidek
On Wed, Sep 12, 2007 at 11:20:52PM +0100, Peter Tribble wrote:
 On 9/10/07, Pawel Jakub Dawidek [EMAIL PROTECTED] wrote:
  Hi.
 
  I've a prototype RAID5 implementation for ZFS. It only works in
  non-degraded state for now. The idea is to compare RAIDZ vs. RAID5
  performance, as I suspected that RAIDZ, because of full-stripe
  operations, doesn't work well for random reads issued by many processes
  in parallel.
 
  There is of course write-hole problem, which can be mitigated by running
  scrub after a power failure or system crash.
 
 If I read your suggestion correctly, your implementation is much
 more like traditional raid-5, with a read-modify-write cycle?
 
 My understanding of the raid-z performance issue is that it requires
 full-stripe reads in order to validate the checksum. [...]

No, checksum is independent thing, and this is not the reason why RAIDZ
needs to do full-stripe reads - in non-degraded mode RAIDZ doesn't read
parity.

This is how RAIDZ fills the disks (follow the numbers):

Disk0   Disk1   Disk2   Disk3

D0  D1  D2  P3
D4  D5  D6  P7
D8  D9  D10 P11
D12 D13 D14 P15
D16 D17 D18 P19
D20 D21 D22 P23

D is data, P is parity.

And RAID5 does this:

Disk0   Disk1   Disk2   Disk3

D0  D3  D6  P0,3,6
D1  D4  D7  P1,4,7
D2  D5  D8  P2,5,8
D9  D12 D15 P9,12,15
D10 D13 D16 P10,13,16
D11 D14 D17 P11,14,17

As you can see even small block is stored on all disks in RAIDZ, where
on RAID5 small block can be stored on one disk only.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgp5p7Tq85M8q.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.

2007-09-12 Thread Nicolas Williams
On Thu, Sep 13, 2007 at 12:56:44AM +0200, Pawel Jakub Dawidek wrote:
 On Wed, Sep 12, 2007 at 11:20:52PM +0100, Peter Tribble wrote:
  My understanding of the raid-z performance issue is that it requires
  full-stripe reads in order to validate the checksum. [...]
 
 No, checksum is independent thing, and this is not the reason why RAIDZ
 needs to do full-stripe reads - in non-degraded mode RAIDZ doesn't read
 parity.

I doubt reading the parity could cost all that much (particularly if
there's enough I/O capacity).  It's reading the full 128KB that you have
to read, if a file's record size is 128KB, in order to satisfy a 2KB
read.

And ZFS has to read full blocks in order to verify the checksum.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.

2007-09-12 Thread Al Hopper
On Thu, 13 Sep 2007, Pawel Jakub Dawidek wrote:

 On Wed, Sep 12, 2007 at 11:20:52PM +0100, Peter Tribble wrote:
 On 9/10/07, Pawel Jakub Dawidek [EMAIL PROTECTED] wrote:
 Hi.

 I've a prototype RAID5 implementation for ZFS. It only works in
 non-degraded state for now. The idea is to compare RAIDZ vs. RAID5
 performance, as I suspected that RAIDZ, because of full-stripe
 operations, doesn't work well for random reads issued by many processes
 in parallel.

 There is of course write-hole problem, which can be mitigated by running
 scrub after a power failure or system crash.

 If I read your suggestion correctly, your implementation is much
 more like traditional raid-5, with a read-modify-write cycle?

 My understanding of the raid-z performance issue is that it requires
 full-stripe reads in order to validate the checksum. [...]

 No, checksum is independent thing, and this is not the reason why RAIDZ
 needs to do full-stripe reads - in non-degraded mode RAIDZ doesn't read
 parity.

 This is how RAIDZ fills the disks (follow the numbers):

   Disk0   Disk1   Disk2   Disk3

   D0  D1  D2  P3
   D4  D5  D6  P7
   D8  D9  D10 P11
   D12 D13 D14 P15
   D16 D17 D18 P19
   D20 D21 D22 P23

 D is data, P is parity.

 And RAID5 does this:

   Disk0   Disk1   Disk2   Disk3

   D0  D3  D6  P0,3,6
   D1  D4  D7  P1,4,7
   D2  D5  D8  P2,5,8
   D9  D12 D15 P9,12,15
   D10 D13 D16 P10,13,16
   D11 D14 D17 P11,14,17

Surely the above is not accurate?  You've showing the parity data only 
being written to disk3.  In RAID5 the parity is distributed across all 
disks in the RAID5 set.  What is illustrated above is RAID3.

 As you can see even small block is stored on all disks in RAIDZ, where
 on RAID5 small block can be stored on one disk only.

 --

Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.

2007-09-12 Thread Pawel Jakub Dawidek
On Wed, Sep 12, 2007 at 07:39:56PM -0500, Al Hopper wrote:
 This is how RAIDZ fills the disks (follow the numbers):
 
  Disk0   Disk1   Disk2   Disk3
 
  D0  D1  D2  P3
  D4  D5  D6  P7
  D8  D9  D10 P11
  D12 D13 D14 P15
  D16 D17 D18 P19
  D20 D21 D22 P23
 
 D is data, P is parity.
 
 And RAID5 does this:
 
  Disk0   Disk1   Disk2   Disk3
 
  D0  D3  D6  P0,3,6
  D1  D4  D7  P1,4,7
  D2  D5  D8  P2,5,8
  D9  D12 D15 P9,12,15
  D10 D13 D16 P10,13,16
  D11 D14 D17 P11,14,17
 
 Surely the above is not accurate?  You've showing the parity data only 
 being written to disk3.  In RAID5 the parity is distributed across all 
 disks in the RAID5 set.  What is illustrated above is RAID3.

It's actually RAID4 (RAID3 would look the same as RAIDZ, but there are
differences in practice), but my point wasn't how the parity is
distributed:) Ok, RAID5 once again:

Disk0   Disk1   Disk2   Disk3

D0  D3  D6  P0,3,6
D1  D4  D7  P1,4,7
D2  D5  D8  P2,5,8
D9  D12 P9,12,15D15
D10 D13 P10,13,16   D16
D11 D14 P11,14,17   D17

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpjnuDDD5adp.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.

2007-09-12 Thread Marc Bevand
Pawel Jakub Dawidek pjd at FreeBSD.org writes:
 
 This is how RAIDZ fills the disks (follow the numbers):
 
   Disk0   Disk1   Disk2   Disk3
 
   D0  D1  D2  P3
   D4  D5  D6  P7
   D8  D9  D10 P11
   D12 D13 D14 P15
   D16 D17 D18 P19
   D20 D21 D22 P23
 
 D is data, P is parity.

This layout assumes of course that large stripes have been written to
the RAIDZ vdev. As you know, the stripe width is dynamic, so it is
possible for a single logical block to span only 2 disks (for those who
don't know what I am talking about, see the red block occupying LBAs
D3 and E3 on page 13 of these ZFS slides [1]).

To read this logical block (and validate its checksum), only D_0 needs 
to be read (LBA E3). So in this very specific case, a RAIDZ read
operation is as cheap as a RAID5 read operation. The existence of these
small stripes could explain why RAIDZ doesn't perform as bad as RAID5
in Pawel's benchmark...

[1] http://br.sun.com/sunnews/events/2007/techdaysbrazil/pdf/eric_zfs.pdf

-marc


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.

2007-09-11 Thread Robert Milkowski
Hello Pawel,

Monday, September 10, 2007, 6:18:37 PM, you wrote:

PJD On Mon, Sep 10, 2007 at 04:31:32PM +0100, Robert Milkowski wrote:
 Hello Pawel,
 
 Excellent job!
 
 Now I guess it would be a good idea to get writes done properly,
 even if it means make them slow (like with SVM). The end result
 would be - do you want fast wrties/slow reads go ahead with
 raid-z; if you need fast reads/slow writes go with raid-5.

PJD Writes in non-degraded mode already works. Only non-degraded mode
PJD doesn't work. My implementation is based on RAIDZ, so I'm planning to
PJD support RAID6 as well.

 btw: I'm just thinking loudly - for raid-5 writes, couldn't you
 somewhow utilize ZIL to make writes safe? I'm asking because we've
 got an ability to put zil somewhere else like NVRAM card...

PJD The problem with RAID5 is that different blocks share the same parity,
PJD which is not the case for RAIDZ. When you write a block in RAIDZ, you
PJD write the data and the parity, and then you switch the pointer in
PJD uberblock. For RAID5, you write the data and you need to update parity,
PJD which also protects some other data. Now if you write the data, but
PJD don't update the parity before a crash, you have a whole. If you update
PJD you parity before the write and a crash, you have a inconsistent with
PJD different block in the same stripe.

Are you overwriting old data? I hope you're not...
I don't think you should suffer from above problem in ZFS due to COW.
If you are not overwriting and you're just writing to new locations
from the pool perspective those changes (both new data block and
checksum block) won't be active until they are both flushed and uber
block is updated... right?


-- 
Best regards,
 Robert Milkowski  mailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.

2007-09-11 Thread Pawel Jakub Dawidek
On Tue, Sep 11, 2007 at 08:16:02AM +0100, Robert Milkowski wrote:
 Are you overwriting old data? I hope you're not...

I am, I overwrite parity, this is the whole point. That's why ZFS
designers used RAIDZ instead of RAID5, I think.

 I don't think you should suffer from above problem in ZFS due to COW.

I do, because autonomous blocks share the same parity block.

 If you are not overwriting and you're just writing to new locations
 from the pool perspective those changes (both new data block and
 checksum block) won't be active until they are both flushed and uber
 block is updated... right?

Assume 128kB stripe size in RAID5. You have three disks: A, B and C.
ZFS writes 128kB at offset 0. This makes RAID5 to write data into disk A
and parity into disk C (both at offset 0). Then, ZFS writes 128kB at
offset 128kB. RAID5 writes data into disk B (at offset 0) and updates
parity on disk C (also at offset 0).

As you can see, two independent ZFS blocks share one parity block.
COW won't help you here, you would need to be sure that each ZFS
transaction goes to a different (and free) RAID5 row.

This is I belive the main reason why poor RAID5 wasn't used in the first
place.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpVx1begmkQi.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.

2007-09-11 Thread Jeff Bonwick
 As you can see, two independent ZFS blocks share one parity block.
 COW won't help you here, you would need to be sure that each ZFS
 transaction goes to a different (and free) RAID5 row.
 
 This is I belive the main reason why poor RAID5 wasn't used in the first
 place.

Exactly right.  RAID-Z has different performance trade-offs than RAID-5,
but the deciding factor was correctness.

I'm really glad you're doing these experiments!  It's good to know what
the trade-offs are, performance-wise, between RAID-Z and classic RAID-5.
At a minimum, it tells us what's on the table, and what we're paying for
transactional semantics.  To be honest, I'm pleased that it's only 2x.
It wouldn't have surprised me if it were Nx for an N+1 configuration.
A factor of 2 is something we can work with.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.

2007-09-11 Thread MC
 My question is: Is there any interest in finishing RAID5/RAID6 for ZFS?
 If there is no chance it will be integrated into ZFS at some point, I
 won't bother finishing it.

Your work is as pure an example as any of what OpenSolaris should be about.  I 
think there should be no problem having a new feature like that integrated!... 
as long as it is the level of quality that the community wants.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.

2007-09-10 Thread Robert Milkowski
Hello Pawel,

Excellent job!

Now I guess it would be a good idea to get writes done properly,
even if it means make them slow (like with SVM). The end result
would be - do you want fast wrties/slow reads go ahead with
raid-z; if you need fast reads/slow writes go with raid-5.

btw: I'm just thinking loudly - for raid-5 writes, couldn't you
somewhow utilize ZIL to make writes safe? I'm asking because we've
got an ability to put zil somewhere else like NVRAM card...


-- 
Best regards,
 Robert Milkowski  mailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.

2007-09-10 Thread Darren Dunham
 Now I guess it would be a good idea to get writes done properly,
 even if it means make them slow (like with SVM). The end result
 would be - do you want fast wrties/slow reads go ahead with
 raid-z; if you need fast reads/slow writes go with raid-5.
 
 btw: I'm just thinking loudly - for raid-5 writes, couldn't you
 somewhow utilize ZIL to make writes safe? I'm asking because we've
 got an ability to put zil somewhere else like NVRAM card...

But the safety of raidz (and the overall on-disk consistency of the
pool) does not currently depend on the ZIL.

It instead depends on the fact that blocks are never modified in-place,
but written first, then activated atomically.  So I guess this depends
on how the R5 is implemented in ZFS.  As long as all writes cause a new
block to be written (which has a full R5 stripe?), then the activation
will be atomic and there is no write hole.  The only problem comes if
existing blocks were modified (and that would cause problems with
snapshots anyway, right?)

-- 
Darren Dunham   [EMAIL PROTECTED]
Senior Technical Consultant TAOShttp://www.taos.com/
Got some Dr Pepper?   San Francisco, CA bay area
  This line left intentionally blank to confuse you. 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.

2007-09-10 Thread Pawel Jakub Dawidek
On Mon, Sep 10, 2007 at 04:31:32PM +0100, Robert Milkowski wrote:
 Hello Pawel,
 
 Excellent job!
 
 Now I guess it would be a good idea to get writes done properly,
 even if it means make them slow (like with SVM). The end result
 would be - do you want fast wrties/slow reads go ahead with
 raid-z; if you need fast reads/slow writes go with raid-5.

Writes in non-degraded mode already works. Only non-degraded mode
doesn't work. My implementation is based on RAIDZ, so I'm planning to
support RAID6 as well.

 btw: I'm just thinking loudly - for raid-5 writes, couldn't you
 somewhow utilize ZIL to make writes safe? I'm asking because we've
 got an ability to put zil somewhere else like NVRAM card...

The problem with RAID5 is that different blocks share the same parity,
which is not the case for RAIDZ. When you write a block in RAIDZ, you
write the data and the parity, and then you switch the pointer in
uberblock. For RAID5, you write the data and you need to update parity,
which also protects some other data. Now if you write the data, but
don't update the parity before a crash, you have a whole. If you update
you parity before the write and a crash, you have a inconsistent with
different block in the same stripe.

My idea was to have one sector every 1GB on each disk for a journal to
keep list of blocks beeing updated. For example you want to write 2kB of
data at offset 1MB. You first store offset+size in this journal, then
write data and update parity and then remove offset+size from the
journal.  Unfortuantely, we would need to flush write cache twice: after
offset+size addition and before offset+size removal.
We could optimize it by doing lazy removal, eg. wait for ZFS to flush
write cache as a part of transaction and then remove old offset+size
paris.
But I still expect this to give too much overhead.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpKARqkGHZjL.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss