Re: [zfs-discuss] Re: ZFS or UFS - what to do?
Marion Hakanson wrote: However, given the default behavior of ZFS (as of Solaris-10U3) is to panic/halt when it encounters a corrupted block that it can't repair, I'm re-thinking our options, weighing against the possibility of a significant downtime caused by a single-block corruption. Guess what happens when UFS finds an inconsistency it can't fix either? The issue is that ZFS has the chance to fix the inconsistency if the zpool is a mirror or raidZ. Not that it finds the inconsistency in the first place. ZFS will just find more of them given a set of errors vs other filesystems. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
[EMAIL PROTECTED] said: That is the part of your setup that puzzled me. You took the same 7 disk raid5 set and split them into 9 LUNS. The Hitachi likely splits the virtual disk into 9 continuous partitions so each LUN maps back to different parts of the 7 disks. I speculate that ZFS thinks it is talking to 9 different disks so spreads out the writes accordingly. What ZFS thinks is sequential writes becomes well spaced writes across the entire disk blows your seek time off the roof. That's what I thought might happen before I even tried this, although it's also possible the Hitachi stripes each LUN across all 7 disks. Either way, one could be getting too many seeks. Note that I'm just trying to see if it was so bad that the self-healing capability wasn't worth the cost. I do realize these are 7200rpm SATA disks, so seeking isn't what they do best. I'm interested how it looks like from the Hitachi end. If you can, repeat the test with the Hitachi presenting all 7 disks directly to ZFS as LUNs? The array doesn't give us that capability. Interesting... what you are suggesting is that %b is 100% when w/s and r/s is 0? Correct. Sometimes all iostat -xn columns are 0 except %b; Sometimes the asvc_t column stays at 4.0 for the duration of the quiet period. I've also observed times where all columns were 0, including %b. Sure is puzzling. [EMAIL PROTECTED] said: IIRC, the calculation for %busy is the amount of time that an I/O is on the device. These symptoms would occur if an I/O is dropped somewhere along the way or at the array. Eventually, we'll timeout and retry, though by default that should be after 60 seconds. I think we need to figure out what is going on here before accepting the results. It could be that we're overrunning the queue on the Hitachi. By default, ZFS will send 35 concurrent commands per vdev and the ssd driver will send up to 256 to a target. IIRC, Hitachi has a formula for calculating sdd_max_throttle to avoid such overruns, but I'm not sure if that applies to this specific array. Hmm, it's true that I have made no tuning changes on the T2000 side. It would make sense if the array just stopped responding. I'll have to poke at the array and see if it has any diagnostics logged somewhere. I recall that the Hitachi docs do have some recommendations on max-throttle settings, so I'll go dig those up and see what I can find out. Thanks for the comments, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
I wrote: Just thinking out loud here. Now I'm off to see what kind of performance cost there is, comparing (with 400GB disks): Simple ZFS stripe on one 2198GB LUN from a 6+1 HW RAID5 volume 8+1 RAID-Z on 9 244.2GB LUN's from a 6+1 HW RAID5 volume [EMAIL PROTECTED] said: Interesting idea. Please post back to let us know how the performance looks. The short story is, performance is not bad with the raidz arrangement, until you get to doing reads, at which point it looks much worse than the 1-LUN setup. Please bear in mind that I'm not a storage nor benchmarking expert, though I'd say I'm not a neophyte either. Some specifics: The array is a low-end Hitachi, 9520V. My two test subjects are a pair of RAID-5 groups in the same shelf, each consisting of 6D+1P 400GB SATA drives. The test host is a Sun T2000, 16GB RAM, connected via 2Gb FC links through a pair of switches (the array/mpxio combination do not support load-balancing, so only one 2Gb channel is in use at a time). It is running Solaris-10U3, patches current as of 12-Jan-2007. The array was mostly idle except for my tests, although some light I/O to other shelves may have come from another host on occasion. The test host wasn't doing anything else during these tests. One RAID-5 group was configured as a single 2048GB LUN (with about 150GB left unallocated, the array has a max LUN size); The second RAID-5 group was setup as nine 244.3GB LUN's. Here are the zpool configurations I used for these tests: # zpool status -v pool: bulk_sp1 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM bulk_sp1 ONLINE 0 0 0 c6t4849544143484920443630303133323230303230d0 ONLINE 0 0 0 errors: No known data errors pool: bulk_zp2 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM bulk_zp2 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c6t4849544143484920443630303133323230303330d0 ONLINE 0 0 0 c6t4849544143484920443630303133323230303331d0 ONLINE 0 0 0 c6t4849544143484920443630303133323230303332d0 ONLINE 0 0 0 c6t484954414348492044363030313332323030d0 ONLINE 0 0 0 c6t4849544143484920443630303133323230303334d0 ONLINE 0 0 0 c6t4849544143484920443630303133323230303335d0 ONLINE 0 0 0 c6t4849544143484920443630303133323230303336d0 ONLINE 0 0 0 c6t4849544143484920443630303133323230303337d0 ONLINE 0 0 0 c6t4849544143484920443630303133323230303338d0 ONLINE 0 0 0 errors: No known data errors # zfs list NAME USED AVAIL REFER MOUNTPOINT bulk_sp183K 1.95T 24.5K /sp1 bulk_zp2 73.8K 1.87T 2.67K /zp2 I used two benchmarks: One was a bunzip2 | tar extract of the Sun Studio-11 SPARC distribution tarball, extracting from the T2000's internal drives onto the test zpools. For this benchmark, both zpools gave similar results: pool sp1 (single-LUN stripe): du -s -k: 1155141 time -p: real 713.67 user 614.42 sys 7.56 1.6MB/sec overall pool zp2 (8+1-LUN raidz1): du -s -k: 1169020 time -p: real 714.96 user 614.78 sys 7.56 1.6MB/sec overall The 2nd benchmark was bonnie++ v1.03, run single-threaded with default arguments, which means a 32GB dataset made of up 1GB files. Observations of vmstat and mpstat during the tests showed that bonnie++ is CPU-limited on the T2000, especially for the getc()/putc() tests, so I later ran 3x bonnie++'s simultaneously (13GB dataset each), and got the same results in total throughput for the block read/write tests on the single-LUN zpool (I was not patient enough to sit through the getc/putc tests again :-). pool sp1 (single-LUN stripe): Version 1.03 --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP filer1 32G 15497 99 66245 84 16652 30 15210 90 106600 59 322.3 3 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 5204 100 + +++ 8076 100 4551 100 + +++ 7509 100 filer1,32G,15497,99,66245,84,16652,30,15210,90,106600,59,322.3,3,16,5204,100,+,+++,8076,100,4551,100,+,+++,7509,100 pool zp2 (8+1-LUN
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
On 2/1/07, Marion Hakanson [EMAIL PROTECTED] wrote: There's also the potential of too much seeking going on for the raidz pool, since there are 9 LUN's on top of 7 physical disk drives (though how Hitachi divides/stripes those LUN's is not clear to me). Marion, That is the part of your setup that puzzled me. You took the same 7 disk raid5 set and split them into 9 LUNS. The Hitachi likely splits the virtual disk into 9 continuous partitions so each LUN maps back to different parts of the 7 disks. I speculate that ZFS thinks it is talking to 9 different disks so spreads out the writes accordingly. What ZFS thinks is sequential writes becomes well spaced writes across the entire disk blows your seek time off the roof. I'm interested how it looks like from the Hitachi end. If you can, repeat the test with the Hitachi presenting all 7 disks directly to ZFS as LUNs? One thing I noticed which puzzles me is that in both configurations, though more so in the divided-up raidz pool, there were long periods of time where the LUN's showed in iostat -xn output at 100% busy but with no I/O's happening at all. No paging, CPU 100% idle, no less than 2GB of free RAM, for as long as 20-30 seconds. Sure puts a dent in the throughput. Interesting... what you are suggesting is that %b is 100% when w/s and r/s is 0? -- Just me, Wire ... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
fishy smell way below... Marion Hakanson wrote: I wrote: Just thinking out loud here. Now I'm off to see what kind of performance cost there is, comparing (with 400GB disks): Simple ZFS stripe on one 2198GB LUN from a 6+1 HW RAID5 volume 8+1 RAID-Z on 9 244.2GB LUN's from a 6+1 HW RAID5 volume [EMAIL PROTECTED] said: Interesting idea. Please post back to let us know how the performance looks. The short story is, performance is not bad with the raidz arrangement, until you get to doing reads, at which point it looks much worse than the 1-LUN setup. Please bear in mind that I'm not a storage nor benchmarking expert, though I'd say I'm not a neophyte either. Some specifics: The array is a low-end Hitachi, 9520V. My two test subjects are a pair of RAID-5 groups in the same shelf, each consisting of 6D+1P 400GB SATA drives. The test host is a Sun T2000, 16GB RAM, connected via 2Gb FC links through a pair of switches (the array/mpxio combination do not support load-balancing, so only one 2Gb channel is in use at a time). It is running Solaris-10U3, patches current as of 12-Jan-2007. The array was mostly idle except for my tests, although some light I/O to other shelves may have come from another host on occasion. The test host wasn't doing anything else during these tests. One RAID-5 group was configured as a single 2048GB LUN (with about 150GB left unallocated, the array has a max LUN size); The second RAID-5 group was setup as nine 244.3GB LUN's. Here are the zpool configurations I used for these tests: # zpool status -v pool: bulk_sp1 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM bulk_sp1 ONLINE 0 0 0 c6t4849544143484920443630303133323230303230d0 ONLINE 0 0 0 errors: No known data errors pool: bulk_zp2 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM bulk_zp2 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c6t4849544143484920443630303133323230303330d0 ONLINE 0 0 0 c6t4849544143484920443630303133323230303331d0 ONLINE 0 0 0 c6t4849544143484920443630303133323230303332d0 ONLINE 0 0 0 c6t484954414348492044363030313332323030d0 ONLINE 0 0 0 c6t4849544143484920443630303133323230303334d0 ONLINE 0 0 0 c6t4849544143484920443630303133323230303335d0 ONLINE 0 0 0 c6t4849544143484920443630303133323230303336d0 ONLINE 0 0 0 c6t4849544143484920443630303133323230303337d0 ONLINE 0 0 0 c6t4849544143484920443630303133323230303338d0 ONLINE 0 0 0 errors: No known data errors # zfs list NAME USED AVAIL REFER MOUNTPOINT bulk_sp183K 1.95T 24.5K /sp1 bulk_zp2 73.8K 1.87T 2.67K /zp2 I used two benchmarks: One was a bunzip2 | tar extract of the Sun Studio-11 SPARC distribution tarball, extracting from the T2000's internal drives onto the test zpools. For this benchmark, both zpools gave similar results: pool sp1 (single-LUN stripe): du -s -k: 1155141 time -p: real 713.67 user 614.42 sys 7.56 1.6MB/sec overall pool zp2 (8+1-LUN raidz1): du -s -k: 1169020 time -p: real 714.96 user 614.78 sys 7.56 1.6MB/sec overall The 2nd benchmark was bonnie++ v1.03, run single-threaded with default arguments, which means a 32GB dataset made of up 1GB files. Observations of vmstat and mpstat during the tests showed that bonnie++ is CPU-limited on the T2000, especially for the getc()/putc() tests, so I later ran 3x bonnie++'s simultaneously (13GB dataset each), and got the same results in total throughput for the block read/write tests on the single-LUN zpool (I was not patient enough to sit through the getc/putc tests again :-). pool sp1 (single-LUN stripe): Version 1.03 --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP filer1 32G 15497 99 66245 84 16652 30 15210 90 106600 59 322.3 3 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 5204 100 + +++ 8076 100 4551 100 + +++ 7509 100
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
Our Netapp does double-parity RAID. In fact, the filesystem design is remarkably similar to that of ZFS. Wouldn't that also detect the error? I suppose it depends if the `wrong sector without notice' error is repeated each time. Or is it random? On most (all?) other systems the parity only comes into effect when a drive fails. When all the drives are reporting OK most (all?) RAID systems don't use the parity data at all. ZFS is the first (only?) system that actively checks the data returned from disk, regardless of whether the drives are reporting they're okay or not. I'm sure I'll be corrected if I'm wrong. :) Netapp/OnTAP does do read verification, but it does it outside the raid-4/raid-dp protection (just like ZFS does it outside the raidz protction). So it's correct that the parity data is not read at all in either OnTAP or ZFS, but both attempt to do verification of the data on all reads. See also: http://blogs.sun.com/bonwick/entry/zfs_end_to_end_data for a few more specifics on it and the differences from the ZFS data check. -- Darren Dunham [EMAIL PROTECTED] Senior Technical Consultant TAOShttp://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area This line left intentionally blank to confuse you. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS or UFS - what to do?
Hi Guys, SO... From what I can tell from this thread ZFS if VERY fussy about managing writes,reads and failures. It wants to be bit perfect. So if you use the hardware that comes with a given solution (in my case an Engenio 6994) to manage failures you risk a) bad writes that don't get picked up due to corruption from write cache to disk b) failures due to data changes that ZFS is unaware of that the hardware imposes when it tries to fix itself. So now I have a $70K+ lump that's useless for what it was designed for. I should have spent $20K on a JBOD. But since I didn't do that, it sounds like a traditional model works best (ie. UFS et al) for the type of hardware I have. No sense paying for something and not using it. And by using ZFS just as a method for ease of file system growth and management I risk much more corruption. The other thing I haven't heard is why NOT to use ZFS. Or people who don't like it for some reason or another. Comments? Thanks, Jeff PS - the responses so far have been great and are much appreciated! Keep 'em coming... This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
Hi Jeff, Maybe I mis-read this thread, but I don't think anyone was saying that using ZFS on-top of an intelligent array risks more corruption. Given my experience, I wouldn't run ZFS without some level of redundancy, since it will panic your kernel in a RAID-0 scenario where it detects a LUN is missing and can't fix it. That being said, I wouldn't run anything but ZFS anymore. When we had some database corruption issues awhile back, ZFS made it very simple to prove it was the DB. Just did a scrub and boom, verification that the data was laid down correctly. RAID-5 will have better random read performance the RAID-Z for reasons Robert had to beat into my head. ;-) But if you really need that performance, perhaps RAID-10 is what you should be looking at? Someone smarter than I can probably give a better idea. Regarding the failure detection, is anyone on the list have the ZFS/FMA traps fed into a network management app yet? I'm curious what the experience with it is? Best Regards, Jason On 1/29/07, Jeffery Malloch [EMAIL PROTECTED] wrote: Hi Guys, SO... From what I can tell from this thread ZFS if VERY fussy about managing writes,reads and failures. It wants to be bit perfect. So if you use the hardware that comes with a given solution (in my case an Engenio 6994) to manage failures you risk a) bad writes that don't get picked up due to corruption from write cache to disk b) failures due to data changes that ZFS is unaware of that the hardware imposes when it tries to fix itself. So now I have a $70K+ lump that's useless for what it was designed for. I should have spent $20K on a JBOD. But since I didn't do that, it sounds like a traditional model works best (ie. UFS et al) for the type of hardware I have. No sense paying for something and not using it. And by using ZFS just as a method for ease of file system growth and management I risk much more corruption. The other thing I haven't heard is why NOT to use ZFS. Or people who don't like it for some reason or another. Comments? Thanks, Jeff PS - the responses so far have been great and are much appreciated! Keep 'em coming... This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
On Jan 29, 2007, at 14:17, Jeffery Malloch wrote: Hi Guys, SO... From what I can tell from this thread ZFS if VERY fussy about managing writes,reads and failures. It wants to be bit perfect. So if you use the hardware that comes with a given solution (in my case an Engenio 6994) to manage failures you risk a) bad writes that don't get picked up due to corruption from write cache to disk b) failures due to data changes that ZFS is unaware of that the hardware imposes when it tries to fix itself. So now I have a $70K+ lump that's useless for what it was designed for. I should have spent $20K on a JBOD. But since I didn't do that, it sounds like a traditional model works best (ie. UFS et al) for the type of hardware I have. No sense paying for something and not using it. And by using ZFS just as a method for ease of file system growth and management I risk much more corruption. The other thing I haven't heard is why NOT to use ZFS. Or people who don't like it for some reason or another. Comments? I put together this chart a while back .. i should probably update it for RAID6 and RAIDZ2 # ZFS ARRAY HWCAPACITYCOMMENTS -- --- 1 R0 R1 N/2 hw mirror - no zfs healing 2 R0 R5 N-1 hw R5 - no zfs healing 3 R1 2 x R0 N/2 flexible, redundant, good perf 4 R1 2 x R5 (N/2)-1 flexible, more redundant, decent perf 5 R1 1 x R5 (N-1)/2 parity and mirror on same drives (XXX) 6 RZ R0 N-1 standard RAID-Z no mirroring 7 RZ R1 (tray) (N/2)-1 RAIDZ+1 8 RZ R1 (drives) (N/2)-1 RAID1+Z (highest redundancy) 9 RZ 3 x R5 N-4 triple parity calculations (XXX) 10 RZ 1 x R5 N-2 double parity calculations (XXX) (note: I included the cases where you have multiple arrays with a single lun per vdisk (say) and where you only have a single array split into multiple LUNs.) The way I see it, you're better off picking either controller parity or zfs parity .. there's no sense in computing parity multiple times unless you have cycles to spare and don't mind the performance hit .. so the questions you should really answer before you choose the hardware is what level of redundancy to capacity balance do you want? and whether or not you want to compute RAID in ZFS host memory or out on a dedicated blackbox controller? I would say something about double caching too, but I think that's moot since you'll always cache in the ARC if you use ZFS the way it's currently written. Other feasible filesystem options for Solaris - UFS, QFS, or vxfs with SVM or VxVM for volume mgmt if you're so inclined .. all depends on your budget and application. There's currently tradeoffs in each one, and contrary to some opinions, the death of any of these has been grossly exaggerated. --- .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
On Mon, Jan 29, 2007 at 11:17:05AM -0800, Jeffery Malloch wrote: From what I can tell from this thread ZFS if VERY fussy about managing writes,reads and failures. It wants to be bit perfect. So if you use the hardware that comes with a given solution (in my case an Engenio 6994) to manage failures you risk a) bad writes that don't get picked up due to corruption from write cache to disk b) failures due to data changes that ZFS is unaware of that the hardware imposes when it tries to fix itself. So now I have a $70K+ lump that's useless for what it was designed for. I should have spent $20K on a JBOD. But since I didn't do that, it sounds like a traditional model works best (ie. UFS et al) for the type of hardware I have. No sense paying for something and not using it. And by using ZFS just as a method for ease of file system growth and management I risk much more corruption. Well, ZFS with HW RAID makes sense in some cases. However, it seems that if you are unwilling to lose 50% disk space to RAID 10 or two mirrored HW RAID arrays, you either use RAID 0 on the array with ZFS RAIDZ/RAIDZ2 on top of that or a JBOD with ZFS RAIDZ/RAIDZ2 on top of that. -- albert chin ([EMAIL PROTECTED]) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
On January 29, 2007 11:17:05 AM -0800 Jeffery Malloch [EMAIL PROTECTED] wrote: Hi Guys, SO... From what I can tell from this thread ZFS if VERY fussy about managing writes,reads and failures. It wants to be bit perfect. It's funny to call that fussy. All filesystems WANT to be bit perfect, zfs actually does something to ensure it. So if you use the hardware that comes with a given solution (in my case an Engenio 6994) to manage failures you risk a) bad writes that don't get picked up due to corruption from write cache to disk You would always have that problem, JBOD or RAID. There are many places data can get corrupted, not just in the RAID write cache. zfs will correct it, or at least detect it depending on your configuration. b) failures due to data changes that ZFS is unaware of that the hardware imposes when it tries to fix itself. If that happens, you will be lucky to have ZFS to fix it. If the array changes data, it is broken. This is not the same thing as correcting data. The other thing I haven't heard is why NOT to use ZFS. Or people who don't like it for some reason or another. If you need per-user quotas, zfs might not be a good fit. (In many cases per-filesystem quotas can be used effectively though.) If you need NFS clients to traverse mount points on the server (eg /home/foo), then this won't work yet. Then again, does this work with UFS either? Seems to me it wouldn't. The difference is that zfs encourages you to create more filesystems. But you don't have to. If you have an application that is very highly tuned for a specific filesystem (e.g. UFS with directio), you might not want to replace it with zfs. If you need incremental restore, you might need to stick with UFS. (snapshots might be enough for you though) -frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
Albert Chin said: Well, ZFS with HW RAID makes sense in some cases. However, it seems that if you are unwilling to lose 50% disk space to RAID 10 or two mirrored HW RAID arrays, you either use RAID 0 on the array with ZFS RAIDZ/RAIDZ2 on top of that or a JBOD with ZFS RAIDZ/RAIDZ2 on top of that. I've been re-evaluating our local decision on this question (how to layout ZFS on pre-existing RAID hardware). In our case, the array does not allow RAID-0 of any type, and we're unwilling to give up the expensive disk space to a mirrored configuration. In fact, in our last decision, we came to the conclusion that we didn't want to layer RAID-Z on top of HW RAID-5, thinking that the added loss of space is too high, given any of the XXX layouts in Jonathan Edwards' chart: # ZFS ARRAY HWCAPACITYCOMMENTS -- --- . . . 5 R1 1 x R5 (N-1)/2 parity and mirror on same drives (XXX) 9 RZ 3 x R5 N-4 triple parity calculations (XXX) . . . 10 RZ 1 x R5 N-2 double parity calculations (XXX) So, we ended up (some months ago) deciding to go with only HW RAID-5, using ZFS to stripe together large-ish LUN's made up of independent HW RAID-5 groups. We'd have no ZFS redundancy, but at least ZFS would catch any corruption that may come along. We can restore individual corrupted files from tape backups (which we're already doing anyway), if necessary. However, given the default behavior of ZFS (as of Solaris-10U3) is to panic/halt when it encounters a corrupted block that it can't repair, I'm re-thinking our options, weighing against the possibility of a significant downtime caused by a single-block corruption. Today I've been pondering a variant of #10 above, the variation being to slice a RAID-5 volume across than N LUN's, i.e. LUN's smaller than the size of the individual disks that make up the HW R5 volume. A larger number of small LUN's results in less space given up to ZFS parity, which is nice when overall disk space is important to us. We're not expecting RAID-Z across these LUN's to make it possible to survive failure of a whole disk, rather we only need RAID-Z to repair the occasional block corruption, in the hopes that this might head off the need to restore a whole multi-TB pool. We'll rely on the HW RAID-5 to protect against whole-disk failure. Just thinking out loud here. Now I'm off to see what kind of performance cost there is, comparing (with 400GB disks): Simple ZFS stripe on one 2198GB LUN from a 6+1 HW RAID5 volume 8+1 RAID-Z on 9 244.2GB LUN's from a 6+1 HW RAID5 volume Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
Hello Anantha, Friday, January 26, 2007, 5:06:46 PM, you wrote: ANS All my feedback is based on Solaris 10 Update 2 (aka 06/06) and ANS I've no comments on NFS. I strongly recommend that you use ZFS ANS data redundancy (z1, z2, or mirror) and simply delegate the ANS Engenio to stripe the data for performance. Striping on an array and then doing redundancy with ZFS has at least one drawback - what if one of disks fails? You've got to replace bad disk, re-create stripe on an array and resilver on ZFS (or stay with hotspare). Lot of hassle. -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
Selim Daoud wrote: it would be good to have real data and not only guess ot anecdots this story about wrong blocks being written by RAID controllers sounds like the anti-terrorism propaganda we are leaving in: exagerate the facts to catch everyone's attention .It's going to take more than that to prove RAID ctrls have been doing a bad jobs for the last 30 years Let's make up real stories with hard fact first I have actual hard data and bitter experience (from support calls) to backup the allegations that raid controllers can and do write bad blocks. No, I cannot and will not provide specifics - I signed an NDA which expressly deals with confidentiality of customer information. What I can say is that if we'd had ZFS to manage the filesystems in question, not only would we have detected the problem much earlier, but the flow-on effect to the end-users would have been much more easily managed. James C. McPherson -- Solaris kernel software engineer, system admin and troubleshooter http://www.jmcp.homeunix.com/blog Find me on LinkedIn @ http://www.linkedin.com/in/jamescmcpherson ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
On Jan 26, 2007, at 14:43, Gary Mills wrote: Our Netapp does double-parity RAID. In fact, the filesystem design is remarkably similar to that of ZFS. Wouldn't that also detect the error? I suppose it depends if the `wrong sector without notice' error is repeated each time. Or is it random? On most (all?) other systems the parity only comes into effect when a drive fails. When all the drives are reporting OK most (all?) RAID systems don't use the parity data at all. ZFS is the first (only?) system that actively checks the data returned from disk, regardless of whether the drives are reporting they're okay or not. I'm sure I'll be corrected if I'm wrong. :) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS or UFS - what to do?
I've used ZFS since July/August 2006 when Sol 10 Update 2 came out (first release to integrate ZFS.) I've used it on three servers (E25K domain, and 2 E2900s) extensivesely; two them are production. I've over 3TB of storage from an EMC SAN under ZFS management for no less than 6 months. Like your configuration we've defered data redundancy to SAN. My observations are: 1. ZFS is stable to a very large extent. There are two known issues that I'm aware of: a. You can end up in an endless 'reboot' cycle when you've a corrupt zpool. I came across this when I had data corruption due to a HBA mismatch with EMC SAN. This mismatch injected data corruption in transit and the EMC faithfully wrote bad data, upon reading this bad data ZFS threw up all over the floor for that pool. There is a documented workaround to snap out of the 'reboot' cycle, I've not checked if this is fixed in 11/06 update 3. b. Your server will hang when one of the underlying disks disappear. In our case we had a T2000 running 11/06 and had a mirrored zpool against two internal drives. When we pulled one of the drives abruptly the server simply hung. I believe this is a known bug, workaround? 2. When you've I/O operations that either request fsync or open files with O_DSYNC option coupled with high I/O ZFS will choke. It won't crash but the filesystem I/O runs like molases on a cold morning. All my feedback is based on Solaris 10 Update 2 (aka 06/06) and I've no comments on NFS. I strongly recommend that you use ZFS data redundancy (z1, z2, or mirror) and simply delegate the Engenio to stripe the data for performance. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
On Fri, Jan 26, 2007 at 08:06:46AM -0800, Anantha N. Srirama wrote: b. Your server will hang when one of the underlying disks disappear. In our case we had a T2000 running 11/06 and had a mirrored zpool against two internal drives. When we pulled one of the drives abruptly the server simply hung. I believe this is a known bug, workaround? This was just covered here and looks like the fix will make it into u4 (i think it's in svn_48?) The workaround is to do a 'zpool offline' whenever possible before removing a disk. Yes, this is not always possible (in the case of disk death), but will help in some situations. I can't wait for U4. :) -brian -- The reason I don't use Gnome: every single other window manager I know of is very powerfully extensible, where you can switch actions to different mouse buttons. Guess which one is not, because it might confuse the poor users? Here's a hint: it's not the small and fast one.--Linus ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS or UFS - what to do?
Oh yep, I know that churning feeling in stomach that there's got to be a GOTCHA somewhere... it can't be *that* simple! This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
On Fri, Jan 26, 2007 at 09:33:40AM -0800, Akhilesh Mritunjai wrote: ZFS Rule #0: You gotta have redundancy ZFS Rule #1: Redundancy shall be managed by zfs, and zfs alone. Whatever you have, junk it. Let ZFS manage mirroring and redundancy. ZFS doesn't forgive even single bit errors! How does this work in an environment with storage that's centrally- managed and shared between many servers? I'm putting together a new IMAP server that will eventually use 3TB of space from our Netapp via an iSCSI SAN. The Netapp provides all of the disk management and redundancy that I'll ever need. The server will only see a virtual disk (a LUN). I want to use ZFS on that LUN because it's superior to UFS in this application, even without the redundancy. There's no way to get the Netapp to behave like a JBOD. Are you saying that this configuration isn't going to work? -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
On Jan 26, 2007, at 9:42, Gary Mills wrote: How does this work in an environment with storage that's centrally- managed and shared between many servers? I'm putting together a new IMAP server that will eventually use 3TB of space from our Netapp via an iSCSI SAN. The Netapp provides all of the disk management and redundancy that I'll ever need. The server will only see a virtual disk (a LUN). I want to use ZFS on that LUN because it's superior to UFS in this application, even without the redundancy. There's no way to get the Netapp to behave like a JBOD. Are you saying that this configuration isn't going to work? It will work, but if the storage system corrupts the data, ZFS will be unable to correct it. It will detect the error. A number that I've been quoting, albeit without a good reference, comes from Jim Gray, who has been around the data-management industry for longer than I have (and I've been in this business since 1970); he's currently at Microsoft. Jim says that the controller/drive subsystem writes data to the wrong sector of the drive without notice about once per drive per year. In a 400-drive array, that's once a day. ZFS will detect this error when the file is read (one of the blocks' checksum will not match). But it can only correct the error if it manages the redundancy. I would suggest exporting two LUNs from your central storage and let ZFS mirror them. You can get a wider range of space/performance tradeoffs if you give ZFS a JBOD, but that doesn't sound like an option. --Ed ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote: On Jan 26, 2007, at 9:42, Gary Mills wrote: How does this work in an environment with storage that's centrally- managed and shared between many servers? It will work, but if the storage system corrupts the data, ZFS will be unable to correct it. It will detect the error. A number that I've been quoting, albeit without a good reference, comes from Jim Gray, who has been around the data-management industry for longer than I have (and I've been in this business since 1970); he's currently at Microsoft. Jim says that the controller/drive subsystem writes data to the wrong sector of the drive without notice about once per drive per year. In a 400-drive array, that's once a day. ZFS will detect this error when the file is read (one of the blocks' checksum will not match). But it can only correct the error if it manages the redundancy. Our Netapp does double-parity RAID. In fact, the filesystem design is remarkably similar to that of ZFS. Wouldn't that also detect the error? I suppose it depends if the `wrong sector without notice' error is repeated each time. Or is it random? I would suggest exporting two LUNs from your central storage and let ZFS mirror them. You can get a wider range of space/performance tradeoffs if you give ZFS a JBOD, but that doesn't sound like an option. That would double the amount of disk that we'd require. I am actually planning on using two iSCSI LUNs and letting ZFS stripe across them. When we need to expand the ZFS pool, I'd like to just expand the two LUNs on the Netapp. If ZFS won't accomodate that, I can just add a couple more LUNs. This is all convenient and easily managable. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
Gary Mills wrote: On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote: On Jan 26, 2007, at 9:42, Gary Mills wrote: How does this work in an environment with storage that's centrally- managed and shared between many servers? It will work, but if the storage system corrupts the data, ZFS will be unable to correct it. It will detect the error. A number that I've been quoting, albeit without a good reference, comes from Jim Gray, who has been around the data-management industry for longer than I have (and I've been in this business since 1970); he's currently at Microsoft. Jim says that the controller/drive subsystem writes data to the wrong sector of the drive without notice about once per drive per year. In a 400-drive array, that's once a day. ZFS will detect this error when the file is read (one of the blocks' checksum will not match). But it can only correct the error if it manages the redundancy. The quote from Jim seems to be related to the leaves of the tree (disks). Anecdotally, now that we have ZFS at the trunk, we're seeing that the branches are also corrupting data. We've speculated that it would occur, but now we can measure it, and it is non-zero. See Anantha's post for one such anecdote. Our Netapp does double-parity RAID. In fact, the filesystem design is remarkably similar to that of ZFS. Wouldn't that also detect the error? I suppose it depends if the `wrong sector without notice' error is repeated each time. Or is it random? We're having a debate related to this, data would be appreciated :-) Do you get small, random read performance equivalent to N-2 spindles for an N-way double-parity volume? I would suggest exporting two LUNs from your central storage and let ZFS mirror them. You can get a wider range of space/performance tradeoffs if you give ZFS a JBOD, but that doesn't sound like an option. That would double the amount of disk that we'd require. I am actually planning on using two iSCSI LUNs and letting ZFS stripe across them. When we need to expand the ZFS pool, I'd like to just expand the two LUNs on the Netapp. If ZFS won't accomodate that, I can just add a couple more LUNs. This is all convenient and easily managable. Sounds reasonable to me :-) -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
Gary Mills wrote: On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote: On Jan 26, 2007, at 9:42, Gary Mills wrote: How does this work in an environment with storage that's centrally- managed and shared between many servers? It will work, but if the storage system corrupts the data, ZFS will be unable to correct it. It will detect the error. A number that I've been quoting, albeit without a good reference, comes from Jim Gray, who has been around the data-management industry for longer than I have (and I've been in this business since 1970); he's currently at Microsoft. Jim says that the controller/drive subsystem writes data to the wrong sector of the drive without notice about once per drive per year. In a 400-drive array, that's once a day. ZFS will detect this error when the file is read (one of the blocks' checksum will not match). But it can only correct the error if it manages the redundancy. Our Netapp does double-parity RAID. In fact, the filesystem design is remarkably similar to that of ZFS. Wouldn't that also detect the error? I suppose it depends if the `wrong sector without notice' error is repeated each time. If the wrong block is written by the controller then you're out of luck. The filesystem would read the incorrect block and ... who knows. Thats why the ZFS checksums are important. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
[EMAIL PROTECTED] wrote on 01/26/2007 01:43:35 PM: On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote: On Jan 26, 2007, at 9:42, Gary Mills wrote: How does this work in an environment with storage that's centrally- managed and shared between many servers? It will work, but if the storage system corrupts the data, ZFS will be unable to correct it. It will detect the error. A number that I've been quoting, albeit without a good reference, comes from Jim Gray, who has been around the data-management industry for longer than I have (and I've been in this business since 1970); he's currently at Microsoft. Jim says that the controller/drive subsystem writes data to the wrong sector of the drive without notice about once per drive per year. In a 400-drive array, that's once a day. ZFS will detect this error when the file is read (one of the blocks' checksum will not match). But it can only correct the error if it manages the redundancy. Our Netapp does double-parity RAID. In fact, the filesystem design is remarkably similar to that of ZFS. Wouldn't that also detect the error? I suppose it depends if the `wrong sector without notice' error is repeated each time. Or is it random? I do not know, WAFL and other portions of NetApp backends are never really described in very technical details -- even getting real IOPS numbers from them seems to be a hassle, much magic -- little meat. To me, zfs is very well defined behavior and methodology (you can even see the source to verify specifics) and this allows you to _know_ what weak points are. NetApp, EMC and other disk vendors may have financial benefits for allowing edge cases such as the write hole or bit rot (x errors per disk are acceptable losses, after x errors then consider replacing disk cost/benefit analysis -- will customers actually know a bit is flipped?). In EMC's case it is very common for a disk to have multiple read/write errors before EMC will swap out the disk, they even use a substantial portion of the disk as replacement and parity bits (outside of raid) so they offset or postpone the replacement volume/costs on the customer. The most detailed description of WAFL I was able to find last time I looked was: http://www.netapp.com/library/tr/3002.pdf I would suggest exporting two LUNs from your central storage and let ZFS mirror them. You can get a wider range of space/performance tradeoffs if you give ZFS a JBOD, but that doesn't sound like an option. That would double the amount of disk that we'd require. I am actually planning on using two iSCSI LUNs and letting ZFS stripe across them. When we need to expand the ZFS pool, I'd like to just expand the two LUNs on the Netapp. If ZFS won't accomodate that, I can just add a couple more LUNs. This is all convenient and easily managable. If you do have bit errors coming from the netapp zfs will find them and will not be able to correct in this case. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
On Jan 26, 2007, at 12:13, Richard Elling wrote: On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote: A number that I've been quoting, albeit without a good reference, comes from Jim Gray, who has been around the data-management industry for longer than I have (and I've been in this business since 1970); he's currently at Microsoft. Jim says that the controller/drive subsystem writes data to the wrong sector of the drive without notice about once per drive per year. In a 400-drive array, that's once a day. ZFS will detect this error when the file is read (one of the blocks' checksum will not match). But it can only correct the error if it manages the redundancy. The quote from Jim seems to be related to the leaves of the tree (disks). Anecdotally, now that we have ZFS at the trunk, we're seeing that the branches are also corrupting data. We've speculated that it would occur, but now we can measure it, and it is non-zero. See Anantha's post for one such anecdote. Actually, Jim was referring to everything but the trunk. He didn't specify where from the HBA to the drive the error actually occurs. I don't think it really matters. I saw him give a talk a few years ago at the Usenix FAST conference; that's where I got this information. --Ed ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
Ed Gould wrote: On Jan 26, 2007, at 12:13, Richard Elling wrote: On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote: A number that I've been quoting, albeit without a good reference, comes from Jim Gray, who has been around the data-management industry for longer than I have (and I've been in this business since 1970); he's currently at Microsoft. Jim says that the controller/drive subsystem writes data to the wrong sector of the drive without notice about once per drive per year. In a 400-drive array, that's once a day. ZFS will detect this error when the file is read (one of the blocks' checksum will not match). But it can only correct the error if it manages the redundancy. Actually, Jim was referring to everything but the trunk. He didn't specify where from the HBA to the drive the error actually occurs. I don't think it really matters. I saw him give a talk a few years ago at the Usenix FAST conference; that's where I got this information. So this leaves me wondering how often the controller/drive subsystem reads data from the wrong sector of the drive without notice; is it symmetrical with respect to writing, and thus about once a drive/year, or are there factors which change this? Dana ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
Dana H. Myers wrote: Ed Gould wrote: On Jan 26, 2007, at 12:13, Richard Elling wrote: On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote: A number that I've been quoting, albeit without a good reference, comes from Jim Gray, who has been around the data-management industry for longer than I have (and I've been in this business since 1970); he's currently at Microsoft. Jim says that the controller/drive subsystem writes data to the wrong sector of the drive without notice about once per drive per year. In a 400-drive array, that's once a day. ZFS will detect this error when the file is read (one of the blocks' checksum will not match). But it can only correct the error if it manages the redundancy. Actually, Jim was referring to everything but the trunk. He didn't specify where from the HBA to the drive the error actually occurs. I don't think it really matters. I saw him give a talk a few years ago at the Usenix FAST conference; that's where I got this information. So this leaves me wondering how often the controller/drive subsystem reads data from the wrong sector of the drive without notice; is it symmetrical with respect to writing, and thus about once a drive/year, or are there factors which change this? It's not symmetrical. Often times its a fw bug. Others a spurious event causes one block to be read/written instead of an other one. (Alpha particles anyone?) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
On Jan 26, 2007, at 12:52, Dana H. Myers wrote: So this leaves me wondering how often the controller/drive subsystem reads data from the wrong sector of the drive without notice; is it symmetrical with respect to writing, and thus about once a drive/year, or are there factors which change this? My guess is that it would be symmetric, but I don't really know. --Ed ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
Torrey McMahon wrote: Dana H. Myers wrote: Ed Gould wrote: On Jan 26, 2007, at 12:13, Richard Elling wrote: On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote: A number that I've been quoting, albeit without a good reference, comes from Jim Gray, who has been around the data-management industry for longer than I have (and I've been in this business since 1970); he's currently at Microsoft. Jim says that the controller/drive subsystem writes data to the wrong sector of the drive without notice about once per drive per year. In a 400-drive array, that's once a day. ZFS will detect this error when the file is read (one of the blocks' checksum will not match). But it can only correct the error if it manages the redundancy. Actually, Jim was referring to everything but the trunk. He didn't specify where from the HBA to the drive the error actually occurs. I don't think it really matters. I saw him give a talk a few years ago at the Usenix FAST conference; that's where I got this information. So this leaves me wondering how often the controller/drive subsystem reads data from the wrong sector of the drive without notice; is it symmetrical with respect to writing, and thus about once a drive/year, or are there factors which change this? It's not symmetrical. Often times its a fw bug. Others a spurious event causes one block to be read/written instead of an other one. (Alpha particles anyone?) I would tend to expect these spurious events to impact read and write equally; more specifically, the chance of any one read or write being mis-addressed is about the same. Since, AFAIK, there are many more reads from a disk typically than writes, this would seem to suggest that there would be more mis-addressed reads in a drive/year than mis-addressed writes. Is this the reason for the asymmetry? (I'm sure waving my hands here) Dana ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
On Jan 26, 2007, at 13:16, Dana H. Myers wrote: I would tend to expect these spurious events to impact read and write equally; more specifically, the chance of any one read or write being mis-addressed is about the same. Since, AFAIK, there are many more reads from a disk typically than writes, this would seem to suggest that there would be more mis-addressed reads in a drive/year than mis-addressed writes. Is this the reason for the asymmetry? Jim's once per drive per year number was not very precise. I took it to be just one significant digit. I don't recall if he distinguished reads from writes. --Ed ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
it would be good to have real data and not only guess ot anecdots this story about wrong blocks being written by RAID controllers sounds like the anti-terrorism propaganda we are leaving in: exagerate the facts to catch everyone's attention .It's going to take more than that to prove RAID ctrls have been doing a bad jobs for the last 30 years Let's make up real stories with hard fact first s. On 1/26/07, Ed Gould [EMAIL PROTECTED] wrote: On Jan 26, 2007, at 12:52, Dana H. Myers wrote: So this leaves me wondering how often the controller/drive subsystem reads data from the wrong sector of the drive without notice; is it symmetrical with respect to writing, and thus about once a drive/year, or are there factors which change this? My guess is that it would be symmetric, but I don't really know. --Ed ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
On Jan 26, 2007, at 13:29, Selim Daoud wrote: it would be good to have real data and not only guess ot anecdots Yes, I agree. I'm sorry I don't have the data that Jim presented at FAST, but he did present actual data. Richard Elling (I believe it was Richard) has also posted some related data from ZFS experience to this list. There is more than just anecdotal evidence for this. --Ed ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
RE: [zfs-discuss] Re: ZFS or UFS - what to do?
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Ed Gould Sent: Friday, January 26, 2007 3:38 PM Yes, I agree. I'm sorry I don't have the data that Jim presented at FAST, but he did present actual data. Richard Elling (I believe it was Richard) has also posted some related data from ZFS experience to this list. This seems to be from Jim and on point: http://www.usenix.org/event/fast05/tech/gray.pdf paul ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
Dana H. Myers wrote: Torrey McMahon wrote: Dana H. Myers wrote: Ed Gould wrote: On Jan 26, 2007, at 12:13, Richard Elling wrote: On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote: A number that I've been quoting, albeit without a good reference, comes from Jim Gray, who has been around the data-management industry for longer than I have (and I've been in this business since 1970); he's currently at Microsoft. Jim says that the controller/drive subsystem writes data to the wrong sector of the drive without notice about once per drive per year. In a 400-drive array, that's once a day. ZFS will detect this error when the file is read (one of the blocks' checksum will not match). But it can only correct the error if it manages the redundancy. Actually, Jim was referring to everything but the trunk. He didn't specify where from the HBA to the drive the error actually occurs. I don't think it really matters. I saw him give a talk a few years ago at the Usenix FAST conference; that's where I got this information. So this leaves me wondering how often the controller/drive subsystem reads data from the wrong sector of the drive without notice; is it symmetrical with respect to writing, and thus about once a drive/year, or are there factors which change this? It's not symmetrical. Often times its a fw bug. Others a spurious event causes one block to be read/written instead of an other one. (Alpha particles anyone?) I would tend to expect these spurious events to impact read and write equally; more specifically, the chance of any one read or write being mis-addressed is about the same. Since, AFAIK, there are many more reads from a disk typically than writes, this would seem to suggest that there would be more mis-addressed reads in a drive/year than mis-addressed writes. Is this the reason for the asymmetry? (I'm sure waving my hands here) For the spurious events, yes, I would expect things to be impacted symmetrically depending when it comes to errors during reads and errors during writes. That is if you could figure out what spurious event occurred. In most cases the spurious errors are caught only at read time and you're left wondering. Was it an incorrect read? Was the data written incorrectly? You end up throwing your hands up and saying, Lets hope that doesn't happen again. It's much easier to unearth a fw bug in a particular disk drive operating in certain conditions and fix it. Now that we're checksumming things I'd expect to find more errors, and hopefully be in a condition to fix them, then we have in the past. We will also start getting customer complaints like, We moved to ZFS and now we are seeing media errors more often. Why is ZFS broken? This is similar to the StorADE issues we had in NWS - Ahhh, the good old days - when we started doing a much better job discovering issues and reporting them when in the past we were blissfully silent. We used to have some data on that with nice graphs but I can't find them lying about. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote: A number that I've been quoting, albeit without a good reference, comes from Jim Gray, who has been around the data-management industry for longer than I have (and I've been in this business since 1970); he's currently at Microsoft. Jim says that the controller/drive subsystem writes data to the wrong sector of the drive without notice about once per drive per year. In a 400-drive array, that's once a day. ZFS will detect this error when the file is read (one of the blocks' checksum will not match). But it can only correct the error if it manages the redundancy. My only qualification to enter this discussion is that I once wrote a floppy disk format program for minix. I recollect, however, that each sector on the disk is accompanied by a block that contains the sector address and a CRC. In order to write to the wrong sector, both of these items would have to be read incorrectly. Otherwise, the controller would never find the wrong sector. Are we just talking about a CRC failure here? That would be random, but the frequency of CRC errors would depend on the signal quality. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
On 26-Jan-07, at 7:29 PM, Selim Daoud wrote: it would be good to have real data and not only guess ot anecdots this story about wrong blocks being written by RAID controllers sounds like the anti-terrorism propaganda we are leaving in: exagerate the facts to catch everyone's attention .It's going to take more than that to prove RAID ctrls have been doing a bad jobs for the last 30 years It does happen. Hard numbers are available if you look. This sounds a bit like the RAID expert I bumped into who just couldn't see the paradigm had shifted under him -- the implications of end to end. Let's make up real stories with hard fact first s. Related links: https://www.gelato.unsw.edu.au/archives/comp-arch/2006-September/ 003008.html http://www.lockss.org/locksswiki/files/3/30/Eurosys2006.pdf [A Fresh Look at the Reliability of Long-term Digital Storage, 2006] http://www.ecsl.cs.sunysb.edu/tr/rpe19.pdf [Challenges of Long-Term Digital Archiving: A Survey, 2006] http://www.cs.wisc.edu/~vijayan/vijayan-thesis.pdf [IRON File Systems, 2006] http://www.tcs.hut.fi/~hhk/phd/phd_Hannu_H_Kari.pdf [Latent Sector Faults and Reliability of Disk Arrays, 1997] --T On 1/26/07, Ed Gould [EMAIL PROTECTED] wrote: On Jan 26, 2007, at 12:52, Dana H. Myers wrote: So this leaves me wondering how often the controller/drive subsystem reads data from the wrong sector of the drive without notice; is it symmetrical with respect to writing, and thus about once a drive/ year, or are there factors which change this? My guess is that it would be symmetric, but I don't really know. --Ed ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
My only qualification to enter this discussion is that I once wrote a floppy disk format program for minix. I recollect, however, that each sector on the disk is accompanied by a block that contains the sector address and a CRC. You'd have to define the layer you're talking about. I presume something like this occurs between a dumb disk and an intelligent controller, or even within the encoding parameters of a disk, but I don't think it does between say a SCSI/FC controller and a disk. So if the drive itself put the head in the wrong sector, maybe it could figure that out. But perhaps the scsi controller had a bug and sent the wrong address to the drive. I don't think there's anything at that layer that would notice (unless the application/file system is encoding intent into the data). Corrections about my assumption with SCSI/FC/ATA appreciated. -- Darren Dunham [EMAIL PROTECTED] Senior Technical Consultant TAOShttp://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area This line left intentionally blank to confuse you. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
Toby Thain wrote: On 26-Jan-07, at 7:29 PM, Selim Daoud wrote: it would be good to have real data and not only guess ot anecdots this story about wrong blocks being written by RAID controllers sounds like the anti-terrorism propaganda we are leaving in: exagerate the facts to catch everyone's attention .It's going to take more than that to prove RAID ctrls have been doing a bad jobs for the last 30 years It does happen. Hard numbers are available if you look. This sounds a bit like the RAID expert I bumped into who just couldn't see the paradigm had shifted under him -- the implications of end to end. It happens. As long we look at the numbers in context and don't run around going, Hey...have you seen these numbers? What have been doing for the last 35 years!?!? we're ok. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS or UFS - what to do?
1. How stable is ZFS? It's a new file system; there will be bugs. It appears to be well-tested, though. There are a few known issues; for instance, a write failure can panic the system under some circumstances. UFS has known issues too 2. Recommended config. Above, I have a fairly simple setup. In many of the examples the granularity is home directory level and when you have many many users that could get to be a bit of a nightmare administratively. Do you need user quotas? If so, you need a file system per user with ZFS. That may be an argument against it in some environments, but in my experience tends to be more important in academic settings than corporations. 4. Since all data access is via NFS we are concerned that 32 bit systems (Mainly Linux and Windows via Samba) will not be able to access all the data areas of a 2TB+ zpool even if the zfs quota on a particular share is less then that. Can anyone comment? Not a problem. NFS doesn't really deal with volumes, just files, so the offsets are always file-relative and the volume can be as large as desired. Anton This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss