Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
On 9/20/07, Roch - PAE [EMAIL PROTECTED] wrote: Next application modifies D0 - D0' and also writes other data D3, D4. Now you have Disk0 Disk1 Disk2 Disk3 D0 D1 D2 P0,1,2 D0' D3 D4 P0',3,4 But if D1 and D2 stays immutable for long time then we can run out of pool blocks with D0 held down in an half-freed state. So as we near full pool capacity, a scrubber would have to walk the stripes and look for partially freed ones. Then it would need to do a scrubbing read/write on D1, D2 so that they become part of a new stripe with some other data freeing the full initial stripe. Or, given a list of partial stripes (and sufficient cache), next write of D5 could be combined with D1,D2: Disk0 Disk1 Disk2 Disk3 D0 D1 D2 P0,1,2 D0' D3 D4 P0',3,4 D5 freefreeP5,1,2 therefore freeing D0 and P012: Disk0 Disk1 Disk2 Disk3 freeD1 D2 free D0' D3 D4 P0',3,4 D5 freefreeP5,1,2 (I assumed no need for alignment). Performance-wise, i'm guessing it might be beneficial to quickly write mirrored blocks on the disk and later combine them, freeing the now unneeded mirrors. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
On 9/10/07, Pawel Jakub Dawidek [EMAIL PROTECTED] wrote: The problem with RAID5 is that different blocks share the same parity, which is not the case for RAIDZ. When you write a block in RAIDZ, you write the data and the parity, and then you switch the pointer in uberblock. For RAID5, you write the data and you need to update parity, which also protects some other data. Now if you write the data, but don't update the parity before a crash, you have a whole. If you update you parity before the write and a crash, you have a inconsistent with different block in the same stripe. This is why you should consider old data and parity as being live. The old data (being overwritten) is live as it is needed for the parity to be consistent - and the old parity is live because it protects the other blocks. What IMO should be done is object level raid - write new parity and new data into blocks not yet used - and as the new parity protects also the neighbouring data the old parity can be freed, and after it no longer is live the overwritten data block can also be freed. Note that this is very different from traditional raid5 as it requires intimate knowledge about the FS structure. Traditional raids also keep parity in line with the data blocks it protects - but that is not necessary if the FS can store information about where the parity is located. Define live data well enough and you're safe if you never overwrite any of it. My idea was to have one sector every 1GB on each disk for a journal to keep list of blocks beeing updated. This would be called write intent log or bitmap (as in linux software raid). Speeds up recovery, but doesn't protect against write hole problems. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
On Thu, Sep 13, 2007 at 04:58:10AM +, Marc Bevand wrote: Pawel Jakub Dawidek pjd at FreeBSD.org writes: This is how RAIDZ fills the disks (follow the numbers): Disk0 Disk1 Disk2 Disk3 D0 D1 D2 P3 D4 D5 D6 P7 D8 D9 D10 P11 D12 D13 D14 P15 D16 D17 D18 P19 D20 D21 D22 P23 D is data, P is parity. This layout assumes of course that large stripes have been written to the RAIDZ vdev. As you know, the stripe width is dynamic, so it is possible for a single logical block to span only 2 disks (for those who don't know what I am talking about, see the red block occupying LBAs D3 and E3 on page 13 of these ZFS slides [1]). Yes I'm aware of that. To read this logical block (and validate its checksum), only D_0 needs to be read (LBA E3). So in this very specific case, a RAIDZ read operation is as cheap as a RAID5 read operation. [...] If you do single sector writes - yes, but this is very inefficient, because of two reasons: 1. Bandwidth - writting one sector at a time? Come on. 2. Space - when you write one sector and its parity you consume two sectors. You may have more than one parity column in that case, eg. Disk0 Disk1 Disk2 Disk3 Disk4 Disk5 D0 P0 D1 P1 D2 P2 In this case space overhead is the same as in mirror. [...] The existence of these small stripes could explain why RAIDZ doesn't perform as bad as RAID5 in Pawel's benchmark... No, as I said, the smallest block I used was 2kB, which means four 512b blocks plus one 512b of parity - each 2kB block uses all 5 disks. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpvqYkQFVjyQ.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
On 9/12/07, Pawel Jakub Dawidek [EMAIL PROTECTED] wrote: On Wed, Sep 12, 2007 at 02:24:56PM -0700, Adam Leventhal wrote: On Mon, Sep 10, 2007 at 12:41:24PM +0200, Pawel Jakub Dawidek wrote: I'm a bit surprised by these results. Assuming relatively large blocks written, RAID-Z and RAID-5 should be laid out on disk very similarly resulting in similar read performance. Hmm, no. The data was organized very differenly on disks. The smallest block size used was 2kB, to ensure each block is written to all disks in RAIDZ configuration. In RAID5 configuration however, 128kB stripe size was used, which means each block was stored on one disk only. Now when you read the data, RAIDZ need to read all disks for each block, and RAID5 needs to read only one disk for each block. Did you compare the I/O characteristic of both? Was the bottleneck in the software or the hardware? The bottleneck were definiatelly disks. CPU was like 96% idle. To be honest I expected, just like Jeff, much bigger win for RAID5 case. Well it depends. In both configurations the available read bandwidth is the same. Presumably you're expecting each disk to seek independently and concurrently. Is the spa aware that multiple, offset dependent, reads can be issued concurrently to the RAID-5 vdev? James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
On Mon, Sep 10, 2007 at 12:41:24PM +0200, Pawel Jakub Dawidek wrote: And here are the results: RAIDZ: Number of READ requests: 4. Number of WRITE requests: 0. Number of bytes to transmit: 695678976. Number of processes: 8. Bytes per second: 1305213 Requests per second: 75 RAID5: Number of READ requests: 4. Number of WRITE requests: 0. Number of bytes to transmit: 695678976. Number of processes: 8. Bytes per second: 2749719 Requests per second: 158 I'm a bit surprised by these results. Assuming relatively large blocks written, RAID-Z and RAID-5 should be laid out on disk very similarly resulting in similar read performance. Did you compare the I/O characteristic of both? Was the bottleneck in the software or the hardware? Very interesting experiment... Adam -- Adam Leventhal, FishWorkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
On Wed, Sep 12, 2007 at 02:24:56PM -0700, Adam Leventhal wrote: On Mon, Sep 10, 2007 at 12:41:24PM +0200, Pawel Jakub Dawidek wrote: And here are the results: RAIDZ: Number of READ requests: 4. Number of WRITE requests: 0. Number of bytes to transmit: 695678976. Number of processes: 8. Bytes per second: 1305213 Requests per second: 75 RAID5: Number of READ requests: 4. Number of WRITE requests: 0. Number of bytes to transmit: 695678976. Number of processes: 8. Bytes per second: 2749719 Requests per second: 158 I'm a bit surprised by these results. Assuming relatively large blocks written, RAID-Z and RAID-5 should be laid out on disk very similarly resulting in similar read performance. Hmm, no. The data was organized very differenly on disks. The smallest block size used was 2kB, to ensure each block is written to all disks in RAIDZ configuration. In RAID5 configuration however, 128kB stripe size was used, which means each block was stored on one disk only. Now when you read the data, RAIDZ need to read all disks for each block, and RAID5 needs to read only one disk for each block. Did you compare the I/O characteristic of both? Was the bottleneck in the software or the hardware? The bottleneck were definiatelly disks. CPU was like 96% idle. To be honest I expected, just like Jeff, much bigger win for RAID5 case. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpaN8zKnXp9n.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
On Wed, Sep 12, 2007 at 02:24:56PM -0700, Adam Leventhal wrote: I'm a bit surprised by these results. Assuming relatively large blocks written, RAID-Z and RAID-5 should be laid out on disk very similarly resulting in similar read performance. Did you compare the I/O characteristic of both? Was the bottleneck in the software or the hardware? Note that Pawel wrote: Pawel I was using 8 processes, I/O size was a random value between 2kB Pawel and 32kB (with 2kB step), offset was a random value between 0 and Pawel 10GB (also with 2kB step). If the dataset's record size was the default (Pawel didn't say, right?) then the reason for the lousy read performance is clear: RAID-Z has to read full blocks to verify the checksum, whereas RAID-5 need only read as much as is requested (assuming aligned reads, which Pawel did seem to indicate: 2KB steps). Peter Tribble pointed out much the same thing already. The crucial requirement is to match the dataset record size to the I/O size done by the application. If the app writes in bigger chunks than it reads and you want to optimize for write performance then set the record size to match the write size, else set the record size to match the read size. Where the dataset record size is not matched to the application's I/O size I guess we could say that RAID-Z trades off the RAID-5 write hole for a read-hole. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
On Wed, Sep 12, 2007 at 11:20:52PM +0100, Peter Tribble wrote: On 9/10/07, Pawel Jakub Dawidek [EMAIL PROTECTED] wrote: Hi. I've a prototype RAID5 implementation for ZFS. It only works in non-degraded state for now. The idea is to compare RAIDZ vs. RAID5 performance, as I suspected that RAIDZ, because of full-stripe operations, doesn't work well for random reads issued by many processes in parallel. There is of course write-hole problem, which can be mitigated by running scrub after a power failure or system crash. If I read your suggestion correctly, your implementation is much more like traditional raid-5, with a read-modify-write cycle? My understanding of the raid-z performance issue is that it requires full-stripe reads in order to validate the checksum. [...] No, checksum is independent thing, and this is not the reason why RAIDZ needs to do full-stripe reads - in non-degraded mode RAIDZ doesn't read parity. This is how RAIDZ fills the disks (follow the numbers): Disk0 Disk1 Disk2 Disk3 D0 D1 D2 P3 D4 D5 D6 P7 D8 D9 D10 P11 D12 D13 D14 P15 D16 D17 D18 P19 D20 D21 D22 P23 D is data, P is parity. And RAID5 does this: Disk0 Disk1 Disk2 Disk3 D0 D3 D6 P0,3,6 D1 D4 D7 P1,4,7 D2 D5 D8 P2,5,8 D9 D12 D15 P9,12,15 D10 D13 D16 P10,13,16 D11 D14 D17 P11,14,17 As you can see even small block is stored on all disks in RAIDZ, where on RAID5 small block can be stored on one disk only. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgp5p7Tq85M8q.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
On Thu, Sep 13, 2007 at 12:56:44AM +0200, Pawel Jakub Dawidek wrote: On Wed, Sep 12, 2007 at 11:20:52PM +0100, Peter Tribble wrote: My understanding of the raid-z performance issue is that it requires full-stripe reads in order to validate the checksum. [...] No, checksum is independent thing, and this is not the reason why RAIDZ needs to do full-stripe reads - in non-degraded mode RAIDZ doesn't read parity. I doubt reading the parity could cost all that much (particularly if there's enough I/O capacity). It's reading the full 128KB that you have to read, if a file's record size is 128KB, in order to satisfy a 2KB read. And ZFS has to read full blocks in order to verify the checksum. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
On Thu, 13 Sep 2007, Pawel Jakub Dawidek wrote: On Wed, Sep 12, 2007 at 11:20:52PM +0100, Peter Tribble wrote: On 9/10/07, Pawel Jakub Dawidek [EMAIL PROTECTED] wrote: Hi. I've a prototype RAID5 implementation for ZFS. It only works in non-degraded state for now. The idea is to compare RAIDZ vs. RAID5 performance, as I suspected that RAIDZ, because of full-stripe operations, doesn't work well for random reads issued by many processes in parallel. There is of course write-hole problem, which can be mitigated by running scrub after a power failure or system crash. If I read your suggestion correctly, your implementation is much more like traditional raid-5, with a read-modify-write cycle? My understanding of the raid-z performance issue is that it requires full-stripe reads in order to validate the checksum. [...] No, checksum is independent thing, and this is not the reason why RAIDZ needs to do full-stripe reads - in non-degraded mode RAIDZ doesn't read parity. This is how RAIDZ fills the disks (follow the numbers): Disk0 Disk1 Disk2 Disk3 D0 D1 D2 P3 D4 D5 D6 P7 D8 D9 D10 P11 D12 D13 D14 P15 D16 D17 D18 P19 D20 D21 D22 P23 D is data, P is parity. And RAID5 does this: Disk0 Disk1 Disk2 Disk3 D0 D3 D6 P0,3,6 D1 D4 D7 P1,4,7 D2 D5 D8 P2,5,8 D9 D12 D15 P9,12,15 D10 D13 D16 P10,13,16 D11 D14 D17 P11,14,17 Surely the above is not accurate? You've showing the parity data only being written to disk3. In RAID5 the parity is distributed across all disks in the RAID5 set. What is illustrated above is RAID3. As you can see even small block is stored on all disks in RAIDZ, where on RAID5 small block can be stored on one disk only. -- Regards, Al Hopper Logical Approach Inc, Plano, TX. [EMAIL PROTECTED] Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
On Wed, Sep 12, 2007 at 07:39:56PM -0500, Al Hopper wrote: This is how RAIDZ fills the disks (follow the numbers): Disk0 Disk1 Disk2 Disk3 D0 D1 D2 P3 D4 D5 D6 P7 D8 D9 D10 P11 D12 D13 D14 P15 D16 D17 D18 P19 D20 D21 D22 P23 D is data, P is parity. And RAID5 does this: Disk0 Disk1 Disk2 Disk3 D0 D3 D6 P0,3,6 D1 D4 D7 P1,4,7 D2 D5 D8 P2,5,8 D9 D12 D15 P9,12,15 D10 D13 D16 P10,13,16 D11 D14 D17 P11,14,17 Surely the above is not accurate? You've showing the parity data only being written to disk3. In RAID5 the parity is distributed across all disks in the RAID5 set. What is illustrated above is RAID3. It's actually RAID4 (RAID3 would look the same as RAIDZ, but there are differences in practice), but my point wasn't how the parity is distributed:) Ok, RAID5 once again: Disk0 Disk1 Disk2 Disk3 D0 D3 D6 P0,3,6 D1 D4 D7 P1,4,7 D2 D5 D8 P2,5,8 D9 D12 P9,12,15D15 D10 D13 P10,13,16 D16 D11 D14 P11,14,17 D17 -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpjnuDDD5adp.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
Pawel Jakub Dawidek pjd at FreeBSD.org writes: This is how RAIDZ fills the disks (follow the numbers): Disk0 Disk1 Disk2 Disk3 D0 D1 D2 P3 D4 D5 D6 P7 D8 D9 D10 P11 D12 D13 D14 P15 D16 D17 D18 P19 D20 D21 D22 P23 D is data, P is parity. This layout assumes of course that large stripes have been written to the RAIDZ vdev. As you know, the stripe width is dynamic, so it is possible for a single logical block to span only 2 disks (for those who don't know what I am talking about, see the red block occupying LBAs D3 and E3 on page 13 of these ZFS slides [1]). To read this logical block (and validate its checksum), only D_0 needs to be read (LBA E3). So in this very specific case, a RAIDZ read operation is as cheap as a RAID5 read operation. The existence of these small stripes could explain why RAIDZ doesn't perform as bad as RAID5 in Pawel's benchmark... [1] http://br.sun.com/sunnews/events/2007/techdaysbrazil/pdf/eric_zfs.pdf -marc ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
Hello Pawel, Monday, September 10, 2007, 6:18:37 PM, you wrote: PJD On Mon, Sep 10, 2007 at 04:31:32PM +0100, Robert Milkowski wrote: Hello Pawel, Excellent job! Now I guess it would be a good idea to get writes done properly, even if it means make them slow (like with SVM). The end result would be - do you want fast wrties/slow reads go ahead with raid-z; if you need fast reads/slow writes go with raid-5. PJD Writes in non-degraded mode already works. Only non-degraded mode PJD doesn't work. My implementation is based on RAIDZ, so I'm planning to PJD support RAID6 as well. btw: I'm just thinking loudly - for raid-5 writes, couldn't you somewhow utilize ZIL to make writes safe? I'm asking because we've got an ability to put zil somewhere else like NVRAM card... PJD The problem with RAID5 is that different blocks share the same parity, PJD which is not the case for RAIDZ. When you write a block in RAIDZ, you PJD write the data and the parity, and then you switch the pointer in PJD uberblock. For RAID5, you write the data and you need to update parity, PJD which also protects some other data. Now if you write the data, but PJD don't update the parity before a crash, you have a whole. If you update PJD you parity before the write and a crash, you have a inconsistent with PJD different block in the same stripe. Are you overwriting old data? I hope you're not... I don't think you should suffer from above problem in ZFS due to COW. If you are not overwriting and you're just writing to new locations from the pool perspective those changes (both new data block and checksum block) won't be active until they are both flushed and uber block is updated... right? -- Best regards, Robert Milkowski mailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
On Tue, Sep 11, 2007 at 08:16:02AM +0100, Robert Milkowski wrote: Are you overwriting old data? I hope you're not... I am, I overwrite parity, this is the whole point. That's why ZFS designers used RAIDZ instead of RAID5, I think. I don't think you should suffer from above problem in ZFS due to COW. I do, because autonomous blocks share the same parity block. If you are not overwriting and you're just writing to new locations from the pool perspective those changes (both new data block and checksum block) won't be active until they are both flushed and uber block is updated... right? Assume 128kB stripe size in RAID5. You have three disks: A, B and C. ZFS writes 128kB at offset 0. This makes RAID5 to write data into disk A and parity into disk C (both at offset 0). Then, ZFS writes 128kB at offset 128kB. RAID5 writes data into disk B (at offset 0) and updates parity on disk C (also at offset 0). As you can see, two independent ZFS blocks share one parity block. COW won't help you here, you would need to be sure that each ZFS transaction goes to a different (and free) RAID5 row. This is I belive the main reason why poor RAID5 wasn't used in the first place. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpVx1begmkQi.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
As you can see, two independent ZFS blocks share one parity block. COW won't help you here, you would need to be sure that each ZFS transaction goes to a different (and free) RAID5 row. This is I belive the main reason why poor RAID5 wasn't used in the first place. Exactly right. RAID-Z has different performance trade-offs than RAID-5, but the deciding factor was correctness. I'm really glad you're doing these experiments! It's good to know what the trade-offs are, performance-wise, between RAID-Z and classic RAID-5. At a minimum, it tells us what's on the table, and what we're paying for transactional semantics. To be honest, I'm pleased that it's only 2x. It wouldn't have surprised me if it were Nx for an N+1 configuration. A factor of 2 is something we can work with. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
My question is: Is there any interest in finishing RAID5/RAID6 for ZFS? If there is no chance it will be integrated into ZFS at some point, I won't bother finishing it. Your work is as pure an example as any of what OpenSolaris should be about. I think there should be no problem having a new feature like that integrated!... as long as it is the level of quality that the community wants. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
Hello Pawel, Excellent job! Now I guess it would be a good idea to get writes done properly, even if it means make them slow (like with SVM). The end result would be - do you want fast wrties/slow reads go ahead with raid-z; if you need fast reads/slow writes go with raid-5. btw: I'm just thinking loudly - for raid-5 writes, couldn't you somewhow utilize ZIL to make writes safe? I'm asking because we've got an ability to put zil somewhere else like NVRAM card... -- Best regards, Robert Milkowski mailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
Now I guess it would be a good idea to get writes done properly, even if it means make them slow (like with SVM). The end result would be - do you want fast wrties/slow reads go ahead with raid-z; if you need fast reads/slow writes go with raid-5. btw: I'm just thinking loudly - for raid-5 writes, couldn't you somewhow utilize ZIL to make writes safe? I'm asking because we've got an ability to put zil somewhere else like NVRAM card... But the safety of raidz (and the overall on-disk consistency of the pool) does not currently depend on the ZIL. It instead depends on the fact that blocks are never modified in-place, but written first, then activated atomically. So I guess this depends on how the R5 is implemented in ZFS. As long as all writes cause a new block to be written (which has a full R5 stripe?), then the activation will be atomic and there is no write hole. The only problem comes if existing blocks were modified (and that would cause problems with snapshots anyway, right?) -- Darren Dunham [EMAIL PROTECTED] Senior Technical Consultant TAOShttp://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area This line left intentionally blank to confuse you. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
On Mon, Sep 10, 2007 at 04:31:32PM +0100, Robert Milkowski wrote: Hello Pawel, Excellent job! Now I guess it would be a good idea to get writes done properly, even if it means make them slow (like with SVM). The end result would be - do you want fast wrties/slow reads go ahead with raid-z; if you need fast reads/slow writes go with raid-5. Writes in non-degraded mode already works. Only non-degraded mode doesn't work. My implementation is based on RAIDZ, so I'm planning to support RAID6 as well. btw: I'm just thinking loudly - for raid-5 writes, couldn't you somewhow utilize ZIL to make writes safe? I'm asking because we've got an ability to put zil somewhere else like NVRAM card... The problem with RAID5 is that different blocks share the same parity, which is not the case for RAIDZ. When you write a block in RAIDZ, you write the data and the parity, and then you switch the pointer in uberblock. For RAID5, you write the data and you need to update parity, which also protects some other data. Now if you write the data, but don't update the parity before a crash, you have a whole. If you update you parity before the write and a crash, you have a inconsistent with different block in the same stripe. My idea was to have one sector every 1GB on each disk for a journal to keep list of blocks beeing updated. For example you want to write 2kB of data at offset 1MB. You first store offset+size in this journal, then write data and update parity and then remove offset+size from the journal. Unfortuantely, we would need to flush write cache twice: after offset+size addition and before offset+size removal. We could optimize it by doing lazy removal, eg. wait for ZFS to flush write cache as a part of transaction and then remove old offset+size paris. But I still expect this to give too much overhead. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpKARqkGHZjL.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss