Re: [zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)
I think as far as data integrity and complete volume loss is most likely in the following order: 1. 1x Raidz(7+1) 2. 2x RaidZ(3+1) 3. 1x Raidz2(6+2) Simple raidz certainly is an option with only 8 disks (8 is about the maximum I would go) but to be honest I would feel safer going raidz2. The 2x raidz (3+1) would probably perform the best but I would prefer going with the 3rd option (raidz2) as it is better for redundancy. With raidz2 any two disks can fail and with dual parity if you get some un-recoverable read errors during a scrub you have a much better chance of not having corruption due to the double parity on the same set of data. On 02/06/2011 06:45 PM, Matthew Angelo wrote: I require a new high capacity 8 disk zpool. The disks I will be purchasing (Samsung or Hitachi) have an Error Rate (non-recoverable, bits read) of 1 in 10^14 and will be 2TB. I'm staying clear of WD because they have the new 2048b sectors which don't play nice with ZFS at the moment. My question is, how do I determine which of the following zpool and vdev configuration I should run to maximize space whilst mitigating rebuild failure risk? 1. 2x RAIDZ(3+1) vdev 2. 1x RAIDZ(7+1) vdev 3. 1x RAIDZ2(7+1) vdev I just want to prove I shouldn't run a plain old RAID5 (RAIDZ) with 8x 2TB disks. Cheers ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On Mon, Feb 7, 2011 at 12:25 AM, Richard Elling richard.ell...@gmail.com wrote: On Feb 5, 2011, at 8:10 AM, Yi Zhang wrote: Hi all, I'm trying to achieve the same effect of UFS directio on ZFS and here is what I did: Solaris UFS directio has three functions: 1. improved async code path 2. multiple concurrent writers 3. no buffering Thanks for the comments, Richard. All I wanted is to achieve 3 on ZFS. But as I said, apprently 2.a) below didn't give me that. Do you have any suggestion? Of the three, #1 and #2 were designed into ZFS from day 1, so there is nothing to set or change to take advantage of the feature. 1. Set the primarycache of zfs to metadata and secondarycache to none, recordsize to 8K (to match the unit size of writes) 2. Run my test program (code below) with different options and measure the running time. a) open the file without O_DSYNC flag: 0.11s. This doesn't seem like directio is in effect, because I tried on UFS and time was 2s. So I went on with more experiments with the O_DSYNC flag set. I know that directio and O_DSYNC are two different things, but I thought the flag would force synchronous writes and achieve what directio does (and more). Directio and O_DSYNC are two different features. b) open the file with O_DSYNC flag: 147.26s ouch c) same as b) but also enabled zfs_nocacheflush: 5.87s Is your pool created from a single HDD? Yes, it is. Do you have an explanation for the b) case? I also tried O_DSYNC AND directio on UFS, the time is on the same order as directio but no O_DSYNC on UFS (see below). This dramatic difference between UFS and ZFS is puzzling me... UFS: directio=on,no O_DSYNC - 2s directio=on,O_DSYNC - 5s ZFS: no caching, no O_DSYNC - 0.11s no caching, O_DSYNC - 147s My questions are: 1. With my primarycache and secondarycache settings, the FS shouldn't buffer reads and writes anymore. Wouldn't that be equivalent to O_DSYNC? Why a) and b) are so different? No. O_DSYNC deals with when the I/O is committed to media. 2. My understanding is that zfs_nocacheflush essentially removes the sync command sent to the device, which cancels the O_DSYNC flag. Why b) and c) are so different? No. Disabling the cache flush means that the volatile write buffer in the disk is not flushed. In other words, disabling the cache flush is in direct conflict with the semantics of O_DSYNC. 3. Does ZIL have anything to do with these results? Yes. The ZIL is used for meeting the O_DSYNC requirements. This has nothing to do with buffering. More details are on the ZFS Best Practices Guide. -- richard Thanks in advance for any suggestion/insight! Yi #include fcntl.h #include sys/time.h int main(int argc, char **argv) { struct timeval tim; gettimeofday(tim, NULL); double t1 = tim.tv_sec + tim.tv_usec/100.0; char a[8192]; int fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC, 0660); //int fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC|O_DSYNC, 0660); if (argv[2][0] == '1') directio(fd, DIRECTIO_ON); int i; for (i=0; i1; ++i) pwrite(fd, a, sizeof(a), i*8192); close(fd); gettimeofday(tim, NULL); double t2 = tim.tv_sec + tim.tv_usec/100.0; printf(%f\n, t2-t1); } ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [?] - What is the recommended number of disks for a consumer PC with ZFS
References: Thread: ZFS effective short-stroking and connection to thin provisioning? http://opensolaris.org/jive/thread.jspa?threadID=127608 Confused about consumer drives and zfs can someone help? http://opensolaris.org/jive/thread.jspa?threadID=132253 Recommended RAM for ZFS on various platforms http://opensolaris.org/jive/thread.jspa?threadID=132072 Performance advantages of spool with 2x raidz2 vdevs vs. Single vdev - Spindles http://opensolaris.org/jive/thread.jspa?threadID=132127 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Drive i/o anomaly
Hi, I have a low-power server with three drives in it, like so: matt@vault:~$ zpool status pool: rpool state: ONLINE scan: resilvered 588M in 0h3m with 0 errors on Fri Jan 7 07:38:06 2011 config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror-0ONLINE 0 0 0 c8t1d0s0 ONLINE 0 0 0 c8t0d0s0 ONLINE 0 0 0 cache c12d0s0 ONLINE 0 0 0 errors: No known data errors I'm running netatalk file sharing for mac, and using it as a time machine backup server for my mac laptop. When files are copying to the server, I often see periods of a minute or so where network traffic stops. I'm convinced that there's some bottleneck in the storage side of things because when this happens, I can still ping the machine and if I have an ssh window, open, I can still see output from a `top` command running smoothly. However, if I try and do anything that touches disk (eg `ls`) that command stalls. At the time it comes good, everything comes good, file copies across the network continue, etc. If I have a ssh terminal session open and run `iostat -nv 5` I see something like this: extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 1.2 36.0 153.6 4608.0 1.2 0.3 31.99.3 16 18 c12d0 0.0 113.40.0 7446.7 0.8 0.17.00.5 15 5 c8t0d0 0.2 106.44.1 7427.8 4.0 0.1 37.81.4 93 14 c8t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.4 73.2 25.7 9243.0 2.3 0.7 31.69.8 34 37 c12d0 0.0 226.60.0 24860.5 1.6 0.27.00.9 25 19 c8t0d0 0.2 127.63.4 12377.6 3.8 0.3 29.72.2 91 27 c8t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 44.20.0 5657.6 1.4 0.4 31.79.0 19 20 c12d0 0.2 76.04.8 9420.8 1.1 0.1 14.21.7 12 13 c8t0d0 0.0 16.60.0 2058.4 9.0 1.0 542.1 60.2 100 100 c8t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.00.20.0 25.6 0.0 0.00.32.3 0 0 c12d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c8t0d0 0.0 11.00.0 1365.6 9.0 1.0 818.1 90.9 100 100 c8t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.20.00.10.0 0.0 0.00.1 25.4 0 1 c12d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c8t0d0 0.0 17.60.0 2182.4 9.0 1.0 511.3 56.8 100 100 c8t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.00.00.00.0 0.0 0.00.00.0 0 0 c12d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c8t0d0 0.0 16.60.0 2058.4 9.0 1.0 542.1 60.2 100 100 c8t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.00.00.00.0 0.0 0.00.00.0 0 0 c12d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c8t0d0 0.0 15.80.0 1959.2 9.0 1.0 569.6 63.3 100 100 c8t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.20.00.10.0 0.0 0.00.10.1 0 0 c12d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c8t0d0 0.0 17.40.0 2157.6 9.0 1.0 517.2 57.4 100 100 c8t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.00.00.00.0 0.0 0.00.00.0 0 0 c12d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c8t0d0 0.0 18.20.0 2256.8 9.0 1.0 494.5 54.9 100 100 c8t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.00.00.00.0 0.0 0.00.00.0 0 0 c12d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c8t0d0 0.0 14.80.0 1835.2 9.0 1.0 608.1 67.5 100 100 c8t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.20.00.10.0 0.0 0.00.10.1 0 0 c12d0 0.01.40.00.6 0.0 0.00.00.2 0 0 c8t0d0 0.0 49.00.0 6049.6 6.7 0.5 137.6 11.2 100 55 c8t1d0 extended device statistics r/sw/s kr/s
[zfs-discuss] ZFS Newbie question
I’ve spend a few hours reading through the forums and wiki and honestly my head is spinning. I have been trying to study up on either buying or building a box that would allow me to add drives of varying sizes/speeds/brands (adding more later etc) and still be able to use the full space of drives (minus parity? [not sure if I got the terminology right]) with redundancy. I have found the “all in one” solution being Drobo however it has many caveats such as a proprietary setup, limited number of drives (I am looking to eventually expand over 8 drives), and a price tag that is borderline criminal. From what I understand using ZFS one could setup something like RAID 6 (RAID-Z2?) but with the ability to use drives of varying sizes/speeds/brands and able to add additional drives later. Am I about right? If so I will continue studying up on this if not then I guess I need to continue exploring different options. Thanks!! Cheers, -Gaiko -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] deduplication requirements
Hi guys, I'm currently running 2 zpools each in a raidz1 configuration, totally around 16TB usable data. I'm running it all on an OpenSolaris based box with 2gb memory and an old Athlon 64 3700 CPU, I understand this is very poor and underpowered for deduplication, so I'm looking at building a new system, but wanted some advice first, here is what i've planned so far: Core i7 2600 CPU 16gb DDR3 Memory 64GB SSD for ZIL (optional) Would this produce decent results for deduplication of 16TB worth of pools or would I need more RAM still? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] 40MB repaired on a disk during scrub but no errors
Hey folks, While scrubbing, zpool status shows nearly 40MB repaired but 0 in each of the read/write/checksum columns for each disk. One disk has (repairing) to the right but once the scrub completes there's no mention that anything ever needed fixing. Any idea what would need to be repaired on that disk? Are there any other types of errors besides read/write/checksum? Previously, whenever a disk has required repair during scrub it's been either bad disk or loose cable connection and it's generated read, write and/or cksum errors. It also irks me a little that these repairs are only noted while the scrub is running. Once it's complete, it's as if those repairs never happened. If it's relevant, this is a 6 drive mirrored pool with a single SSD for L2Arc cache. Pool version 26 under Nexenta Core Platform 3.0 with a LSI 9200-16E and SATA disks. $ zpool status bigboy pool: bigboy state: ONLINE scan: scrub in progress since Sat Feb 5 02:22:18 2011 3.74T scanned out of 3.74T at 141M/s, 0h0m to go 37.9M repaired, 99.88% done [-config snip - all columns 0, one drive on the right has (repairing)] errors: No known data errors And then once the scrub completes: $ zpool status bigboy pool: bigboy state: ONLINE scan: scrub repaired 37.9M in 7h42m with 0 errors on Sat Feb 5 10:04:53 2011 [-config snip - all columns 0, the (repairing) note is now gone] errors: No known data errors -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS/Drobo (Newbie) Question
Thank you kebabber. I will try out indiana and virtual box to play around with it a bit. Just to make sure I understand your example, if I say had a 4x2tb drives, 2x750gb, 2x1.5tb drives etc then i could make 3 groups (perhaps 1 raidz1 + 1 mirrored + 1 mirrored), in terms of accessing them would they just be mounted like 3 partitions or could it all be accessed like one big partition? Anywho, I have indiana DL'ing now (very slow connection so thought I would post while i wait). Cheers, -Gaiko -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
Le 7 févr. 2011 à 06:25, Richard Elling a écrit : On Feb 5, 2011, at 8:10 AM, Yi Zhang wrote: Hi all, I'm trying to achieve the same effect of UFS directio on ZFS and here is what I did: Solaris UFS directio has three functions: 1. improved async code path 2. multiple concurrent writers 3. no buffering Of the three, #1 and #2 were designed into ZFS from day 1, so there is nothing to set or change to take advantage of the feature. 1. Set the primarycache of zfs to metadata and secondarycache to none, recordsize to 8K (to match the unit size of writes) 2. Run my test program (code below) with different options and measure the running time. a) open the file without O_DSYNC flag: 0.11s. This doesn't seem like directio is in effect, because I tried on UFS and time was 2s. So I went on with more experiments with the O_DSYNC flag set. I know that directio and O_DSYNC are two different things, but I thought the flag would force synchronous writes and achieve what directio does (and more). Directio and O_DSYNC are two different features. b) open the file with O_DSYNC flag: 147.26s ouch how big a file ? Does the resuld holds if you don't truncate ? -r c) same as b) but also enabled zfs_nocacheflush: 5.87s Is your pool created from a single HDD? My questions are: 1. With my primarycache and secondarycache settings, the FS shouldn't buffer reads and writes anymore. Wouldn't that be equivalent to O_DSYNC? Why a) and b) are so different? No. O_DSYNC deals with when the I/O is committed to media. 2. My understanding is that zfs_nocacheflush essentially removes the sync command sent to the device, which cancels the O_DSYNC flag. Why b) and c) are so different? No. Disabling the cache flush means that the volatile write buffer in the disk is not flushed. In other words, disabling the cache flush is in direct conflict with the semantics of O_DSYNC. 3. Does ZIL have anything to do with these results? Yes. The ZIL is used for meeting the O_DSYNC requirements. This has nothing to do with buffering. More details are on the ZFS Best Practices Guide. -- richard Thanks in advance for any suggestion/insight! Yi #include fcntl.h #include sys/time.h int main(int argc, char **argv) { struct timeval tim; gettimeofday(tim, NULL); double t1 = tim.tv_sec + tim.tv_usec/100.0; char a[8192]; int fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC, 0660); //int fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC|O_DSYNC, 0660); if (argv[2][0] == '1') directio(fd, DIRECTIO_ON); int i; for (i=0; i1; ++i) pwrite(fd, a, sizeof(a), i*8192); close(fd); gettimeofday(tim, NULL); double t2 = tim.tv_sec + tim.tv_usec/100.0; printf(%f\n, t2-t1); } ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Replace block devices to increase pool size
On Sun, February 6, 2011 08:41, Achim Wolpers wrote: I have a zpool biult up from two vdrives (one mirror and one raidz). The raidz is built up from 4x1TB HDs. When I successively replace each 1TB drive with a 2TB drive will the capacity of the raidz double after the last block device is replaced? You may have to manually set property autoexpand=on; I found yesterday that I had to (in my case on a mirror that I was upgrading). Probably depends on what version you created things at and/or what version you're running now. I replaced the drives in one of the three mirror vdevs in my main pool over this last weekend, and it all went quite smoothly, but I did have to turn on autoexpand at the end of the process to see the new space. -- David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs-discuss Digest, Vol 64, Issue 13
On Sun, February 6, 2011 13:01, Michael Armstrong wrote: Additionally, the way I do it is to draw a diagram of the drives in the system, labelled with the drive serial numbers. Then when a drive fails, I can find out from smartctl which drive it is and remove/replace without trial and error. Having managed to muddle through this weekend without loss (though with a certain amount of angst and duplication of efforts), I'm in the mood to label things a bit more clearly on my system :-). smartctl doesn't seem to be on my system, though. I'm running snv_134. I'm still pretty badly lost in the whole repository / package thing with Solaris, most of my brain cells were already occupied with Red Hat, Debian, and Perl package information :-( . Where do I look? Are the controller port IDs, the C9T3D0 things that ZFS likes, reasonably stable? They won't change just because I add or remove drives, right; only maybe if I change controller cards? -- David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On Mon, Feb 7, 2011 at 10:26 AM, Roch roch.bourbonn...@oracle.com wrote: Le 7 févr. 2011 à 06:25, Richard Elling a écrit : On Feb 5, 2011, at 8:10 AM, Yi Zhang wrote: Hi all, I'm trying to achieve the same effect of UFS directio on ZFS and here is what I did: Solaris UFS directio has three functions: 1. improved async code path 2. multiple concurrent writers 3. no buffering Of the three, #1 and #2 were designed into ZFS from day 1, so there is nothing to set or change to take advantage of the feature. 1. Set the primarycache of zfs to metadata and secondarycache to none, recordsize to 8K (to match the unit size of writes) 2. Run my test program (code below) with different options and measure the running time. a) open the file without O_DSYNC flag: 0.11s. This doesn't seem like directio is in effect, because I tried on UFS and time was 2s. So I went on with more experiments with the O_DSYNC flag set. I know that directio and O_DSYNC are two different things, but I thought the flag would force synchronous writes and achieve what directio does (and more). Directio and O_DSYNC are two different features. b) open the file with O_DSYNC flag: 147.26s ouch how big a file ? Does the resuld holds if you don't truncate ? -r The file is 8K*1 about 80M. I removed the O_TRUNC flag and the results stayed the same... c) same as b) but also enabled zfs_nocacheflush: 5.87s Is your pool created from a single HDD? My questions are: 1. With my primarycache and secondarycache settings, the FS shouldn't buffer reads and writes anymore. Wouldn't that be equivalent to O_DSYNC? Why a) and b) are so different? No. O_DSYNC deals with when the I/O is committed to media. 2. My understanding is that zfs_nocacheflush essentially removes the sync command sent to the device, which cancels the O_DSYNC flag. Why b) and c) are so different? No. Disabling the cache flush means that the volatile write buffer in the disk is not flushed. In other words, disabling the cache flush is in direct conflict with the semantics of O_DSYNC. 3. Does ZIL have anything to do with these results? Yes. The ZIL is used for meeting the O_DSYNC requirements. This has nothing to do with buffering. More details are on the ZFS Best Practices Guide. -- richard Thanks in advance for any suggestion/insight! Yi #include fcntl.h #include sys/time.h int main(int argc, char **argv) { struct timeval tim; gettimeofday(tim, NULL); double t1 = tim.tv_sec + tim.tv_usec/100.0; char a[8192]; int fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC, 0660); //int fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC|O_DSYNC, 0660); if (argv[2][0] == '1') directio(fd, DIRECTIO_ON); int i; for (i=0; i1; ++i) pwrite(fd, a, sizeof(a), i*8192); close(fd); gettimeofday(tim, NULL); double t2 = tim.tv_sec + tim.tv_usec/100.0; printf(%f\n, t2-t1); } ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS/Drobo (Newbie) Question
In zfs terminology each of the groups you have is a VDEV and a zpool can be made of a number of VDEVs. This zpool can then be mounted as a single filesystem, or you can split it into as many filesystems as you wish. So the answer is yes to all the configurations you asked about and a lot more :) Bye, Deano de...@cloudpixies.com -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Gaikokujin Kyofusho Sent: 05 February 2011 17:55 To: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] ZFS/Drobo (Newbie) Question Thank you kebabber. I will try out indiana and virtual box to play around with it a bit. Just to make sure I understand your example, if I say had a 4x2tb drives, 2x750gb, 2x1.5tb drives etc then i could make 3 groups (perhaps 1 raidz1 + 1 mirrored + 1 mirrored), in terms of accessing them would they just be mounted like 3 partitions or could it all be accessed like one big partition? Anywho, I have indiana DL'ing now (very slow connection so thought I would post while i wait). Cheers, -Gaiko -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS/Drobo (Newbie) Question
On Sat, Feb 5, 2011 at 9:54 AM, Gaikokujin Kyofusho gaikokujinkyofu...@gmail.com wrote: Just to make sure I understand your example, if I say had a 4x2tb drives, 2x750gb, 2x1.5tb drives etc then i could make 3 groups (perhaps 1 raidz1 + 1 mirrored + 1 mirrored), in terms of accessing them would they just be mounted like 3 partitions or could it all be accessed like one big partition? You could add them to one pool, and then create multiple filesystems inside the pool. You total storage would be the sum of the drives' capacity after redundancy, or 3x2tb + 750gb + 1.5tb. It's not recommended to use different levels of redundancy in a pool, so you may want to consider using mirrors for everything. This also makes it easier to add or upgrade capacity later. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On Mon, Feb 7, 2011 at 6:15 AM, Yi Zhang yizhan...@gmail.com wrote: On Mon, Feb 7, 2011 at 12:25 AM, Richard Elling richard.ell...@gmail.com wrote: Solaris UFS directio has three functions: 1. improved async code path 2. multiple concurrent writers 3. no buffering Thanks for the comments, Richard. All I wanted is to achieve 3 on ZFS. But as I said, apprently 2.a) below didn't give me that. Do you have any suggestion? Don't. Use a ZIL, which will meet the requirements for synchronous IO. Set primarycache to metadata to prevent caching reads. ZFS is a very different beast than UFS and doesn't require the same tuning. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs-discuss Digest, Vol 64, Issue 21
I obtained smartmontools (which includes smartctl) from the standard apt repository (i'm using nexenta however), in addition its neccessary to use the device type of sat,12 with smartctl to get it to read attributes correctly in OS afaik. Also regarding dev id's on the system, from what i've seen they are assigned to ports therefor do not change, however upon changing a controller will most likely change unless its the same chipset with exactly the same port configuration. Hope this helps. On 7 Feb 2011, at 18:04, zfs-discuss-requ...@opensolaris.org wrote: Having managed to muddle through this weekend without loss (though with a certain amount of angst and duplication of efforts), I'm in the mood to label things a bit more clearly on my system :-). smartctl doesn't seem to be on my system, though. I'm running snv_134. I'm still pretty badly lost in the whole repository / package thing with Solaris, most of my brain cells were already occupied with Red Hat, Debian, and Perl package information :-( . Where do I look? Are the controller port IDs, the C9T3D0 things that ZFS likes, reasonably stable? They won't change just because I add or remove drives, right; only maybe if I change controller cards? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On Mon, Feb 7, 2011 at 1:06 PM, Brandon High bh...@freaks.com wrote: On Mon, Feb 7, 2011 at 6:15 AM, Yi Zhang yizhan...@gmail.com wrote: On Mon, Feb 7, 2011 at 12:25 AM, Richard Elling richard.ell...@gmail.com wrote: Solaris UFS directio has three functions: 1. improved async code path 2. multiple concurrent writers 3. no buffering Thanks for the comments, Richard. All I wanted is to achieve 3 on ZFS. But as I said, apprently 2.a) below didn't give me that. Do you have any suggestion? Don't. Use a ZIL, which will meet the requirements for synchronous IO. Set primarycache to metadata to prevent caching reads. ZFS is a very different beast than UFS and doesn't require the same tuning. I already set primarycache to metadata, and I'm not concerned about caching reads, but caching writes. It appears writes are indeed cached judging from the time of 2.a) compared to UFS+directio. More specifically, 80MB/2s=40MB/s (UFS+directio) looks realistic while 80MB/0.11s=800MB/s (ZFS+primarycache=metadata) doesn't. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On Mon, Feb 7, 2011 at 10:29 AM, Yi Zhang yizhan...@gmail.com wrote: I already set primarycache to metadata, and I'm not concerned about caching reads, but caching writes. It appears writes are indeed cached judging from the time of 2.a) compared to UFS+directio. More specifically, 80MB/2s=40MB/s (UFS+directio) looks realistic while 80MB/0.11s=800MB/s (ZFS+primarycache=metadata) doesn't. You're trying to force a solution that isn't relevant for the situation. ZFS is not UFS, and solutions that are required for UFS to work correctly are not needed with ZFS. Yes, writes are cached, but all the POSIX requirements for synchronous IO are met by the ZIL. As long as your storage devices, be they SAN, DAS or somewhere in between respect cache flushes, you're fine. If you need more performance, use a slog device that respects cache flushes. You don't need to worry about whether writes are being cached, because any data that is written synchronously will be committed to stable storage before the write returns. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Sector size on 7K3000 drives?
Hi al Does anyone here that knows if the new 7K3000 drives from Hitachi uses 4k sectors or not? The docs say Sector size (variable, Bytes/sector): 512, but since it's variable, any idea what it might be? I'm planning to replace 7x3+1 drives on this system to try to get some free space on some full VDEVs. If the drives are in fact 4k sector drives, will it be possible to remedy the performance penalty now that we have already used about 99% of the original drives? Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On Mon, Feb 7, 2011 at 1:51 PM, Brandon High bh...@freaks.com wrote: On Mon, Feb 7, 2011 at 10:29 AM, Yi Zhang yizhan...@gmail.com wrote: I already set primarycache to metadata, and I'm not concerned about caching reads, but caching writes. It appears writes are indeed cached judging from the time of 2.a) compared to UFS+directio. More specifically, 80MB/2s=40MB/s (UFS+directio) looks realistic while 80MB/0.11s=800MB/s (ZFS+primarycache=metadata) doesn't. You're trying to force a solution that isn't relevant for the situation. ZFS is not UFS, and solutions that are required for UFS to work correctly are not needed with ZFS. Yes, writes are cached, but all the POSIX requirements for synchronous IO are met by the ZIL. As long as your storage devices, be they SAN, DAS or somewhere in between respect cache flushes, you're fine. If you need more performance, use a slog device that respects cache flushes. You don't need to worry about whether writes are being cached, because any data that is written synchronously will be committed to stable storage before the write returns. -B -- Brandon High : bh...@freaks.com Maybe I didn't make my intention clear. UFS with directio is reasonably close to a raw disk from my application's perspective: when the app writes to a file location, no buffering happens. My goal is to find a way to duplicate this on ZFS. Setting primarycache didn't eliminate the buffering, using O_DSYNC (whose side effects include elimination of buffering) made it ridiculously slow: none of the things I tried eliminated buffering, and just buffering, on ZFS. From the discussion so far my feeling is that ZFS is too different from UFS that there's simply no way to achieve this goal... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
Maybe I didn't make my intention clear. UFS with directio is reasonably close to a raw disk from my application's perspective: when the app writes to a file location, no buffering happens. My goal is to find a way to duplicate this on ZFS. There really is no need to do this on ZFS. Using an SLOG device (ZIL on an SSD) will allow ZFS to do its caching transpararently to the application. Successive read operations will read from the cache if that's available and writes will go to the SLOG _and_ the ARC for successive commits. So long the SLOG device supports cache flush, or have a supercap/BBU, your data will be safe. Setting primarycache didn't eliminate the buffering, using O_DSYNC (whose side effects include elimination of buffering) made it ridiculously slow: none of the things I tried eliminated buffering, and just buffering, on ZFS. From the discussion so far my feeling is that ZFS is too different from UFS that there's simply no way to achieve this goal... See above - ZFS is quite safe to use for this, given a good hardware configuration. Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Replace block devices to increase pool size
I have a zpool biult up from two vdrives (one mirror and one raidz). The raidz is built up from 4x1TB HDs. When I successively replace each 1TB drive with a 2TB drive will the capacity of the raidz double after the last block device is replaced? You may have to manually set property autoexpand=on; I found yesterday that I had to (in my case on a mirror that I was upgrading). Probably depends on what version you created things at and/or what version you're running now. I replaced the drives in one of the three mirror vdevs in my main pool over this last weekend, and it all went quite smoothly, but I did have to turn on autoexpand at the end of the process to see the new space. autoexpand is by default off. I guess this is in case someone do something as replacing two 500GB drives with two 1TB drives and then wants to replace those with new 500GB drives, since setting autoexpand=on is quite simple and for the expansion, irreversible. If you have expanded a VDEV, the only way to shrink it is to backup, reconfigure ZFS and restore. Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sector size on 7K3000 drives?
On Mon, February 7, 2011 14:12, Roy Sigurd Karlsbakk wrote: Hi al Does anyone here that knows if the new 7K3000 drives from Hitachi uses 4k sectors or not? The docs say Sector size (variable, Bytes/sector): 512, but since it's variable, any idea what it might be? I'm planning to [...] This PDF data sheet for SATA says 512, without the word variable: http://tinyurl.com/4b6qgtc http://www.hitachigst.com/tech/techlib.nsf/techdocs/EC6D440C3F64DBCC8825782300026498/$file/US7K3000_ds.pdf Of course they could mean what's reported to the OS (the SAS models say 512 / 520 / 528). I'd e-mail or call Hitachi and ask them directly (the contact info is in the PDF). ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On Mon, Feb 7, 2011 at 2:21 PM, Brandon High bh...@freaks.com wrote: On Mon, Feb 7, 2011 at 11:17 AM, Yi Zhang yizhan...@gmail.com wrote: Maybe I didn't make my intention clear. UFS with directio is reasonably close to a raw disk from my application's perspective: when the app writes to a file location, no buffering happens. My goal is to find a way to duplicate this on ZFS. Step back an consider *why* you need no buffering. I'm writing an database-like application which manages its own page buffer, so I want to disable the buffering in the OS/FS level. UFS with directio suits my need perfectly, but I also want to try it on ZFS because ZFS does't directly overwrite a page which is being modified (it allocates a new page instead), and thus it represents a different category of FS. I want to measure the performance difference of my app on UFS and ZFS and tell how my app is FS-dependent. From the discussion so far my feeling is that ZFS is too different from UFS that there's simply no way to achieve this goal... ZFS is not UFS, and solutions that are required for UFS to work correctly are not needed with ZFS. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On Mon, Feb 7, 2011 at 1:17 PM, Yi Zhang yizhan...@gmail.com wrote: On Mon, Feb 7, 2011 at 1:51 PM, Brandon High bh...@freaks.com wrote: Maybe I didn't make my intention clear. UFS with directio is reasonably close to a raw disk from my application's perspective: when the app writes to a file location, no buffering happens. My goal is to find a way to duplicate this on ZFS. You're still mixing directio and O_DSYNC. O_DSYNC is like calling fsync(2) after every write(2). fsync(2) is useful to obtain some limited transactional semantics, as well as for durability semantics. In ZFS you don't need to call fsync(2) to get those transactional semantics, but you do need to call fsync(2) get those durability semantics. Now, in ZFS fsync(2) implies a synchronous I/O operation involving significantly more than just the data blocks you wrote to. Which means that O_DSYNC on ZFS is significantly slower than on UFS. You can address this in one of two ways: a) you might realize that you don't need every write(2) to be durable, then stop using O_DSYNC, b) you might get a fast ZIL device. I'm betting that if you look carefully at your application's requirements you'll probably conclude that you don't need O_DSYNC at all. Perhaps you can tell us more about your application. Setting primarycache didn't eliminate the buffering, using O_DSYNC (whose side effects include elimination of buffering) made it ridiculously slow: none of the things I tried eliminated buffering, and just buffering, on ZFS. From the discussion so far my feeling is that ZFS is too different from UFS that there's simply no way to achieve this goal... You've not really stated your application's requirements. You may be convinced that you need O_DSYNC, but chances are that you don't. And yes, it's possible that you'd need O_DSYNC on UFS but not on ZFS. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On Mon, Feb 7, 2011 at 2:42 PM, Nico Williams n...@cryptonector.com wrote: On Mon, Feb 7, 2011 at 1:17 PM, Yi Zhang yizhan...@gmail.com wrote: On Mon, Feb 7, 2011 at 1:51 PM, Brandon High bh...@freaks.com wrote: Maybe I didn't make my intention clear. UFS with directio is reasonably close to a raw disk from my application's perspective: when the app writes to a file location, no buffering happens. My goal is to find a way to duplicate this on ZFS. You're still mixing directio and O_DSYNC. O_DSYNC is like calling fsync(2) after every write(2). fsync(2) is useful to obtain some limited transactional semantics, as well as for durability semantics. In ZFS you don't need to call fsync(2) to get those transactional semantics, but you do need to call fsync(2) get those durability semantics. Now, in ZFS fsync(2) implies a synchronous I/O operation involving significantly more than just the data blocks you wrote to. Which means that O_DSYNC on ZFS is significantly slower than on UFS. You can address this in one of two ways: a) you might realize that you don't need every write(2) to be durable, then stop using O_DSYNC, b) you might get a fast ZIL device. I'm betting that if you look carefully at your application's requirements you'll probably conclude that you don't need O_DSYNC at all. Perhaps you can tell us more about your application. Setting primarycache didn't eliminate the buffering, using O_DSYNC (whose side effects include elimination of buffering) made it ridiculously slow: none of the things I tried eliminated buffering, and just buffering, on ZFS. From the discussion so far my feeling is that ZFS is too different from UFS that there's simply no way to achieve this goal... You've not really stated your application's requirements. You may be convinced that you need O_DSYNC, but chances are that you don't. And yes, it's possible that you'd need O_DSYNC on UFS but not on ZFS. Nico -- Please see my previous email for a high-level discussion of my application. I know that I don't really need O_DSYNC. The reason why I tried that is to get the side effect of no buffering, which is my ultimate goal. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Drive i/o anomaly
matt.connolly...@gmail.com said: extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 1.2 36.0 153.6 4608.0 1.2 0.3 31.99.3 16 18 c12d0 0.0 113.40.0 7446.7 0.8 0.17.00.5 15 5 c8t0d0 0.2 106.44.1 7427.8 4.0 0.1 37.81.4 93 14 c8t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.4 73.2 25.7 9243.0 2.3 0.7 31.69.8 34 37 c12d0 0.0 226.60.0 24860.5 1.6 0.27.00.9 25 19 c8t0d0 0.2 127.63.4 12377.6 3.8 0.3 29.72.2 91 27 c8t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 44.20.0 5657.6 1.4 0.4 31.79.0 19 20 c12d0 0.2 76.04.8 9420.8 1.1 0.1 14.21.7 12 13 c8t0d0 0.0 16.60.0 2058.4 9.0 1.0 542.1 60.2 100 100 c8t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.00.20.0 25.6 0.0 0.00.32.3 0 0 c12d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c8t0d0 0.0 11.00.0 1365.6 9.0 1.0 818.1 90.9 100 100 c8t1d0 . . . matt.connolly...@gmail.com said: I expect that the c8t0d0 WD Green is the lemon here and for some reason is getting stuck in periods where it can write no faster than about 2MB/s. Does this sound right? No, it's the opposite. The drive sitting at 100%-busy, c8t1d0, while the other drive is idle, is the sick one. It's slower than the other, has 9.0 operations waiting (queued) to finish. The other one is idle because it has already finished the write activity and is waiting for the slow one in the mirror to catch up. If you run iostat -xn without the interval argument, i.e. so it prints out only one set of stats, you'll see the average performance of the drives since last reboot. If the asvc_t figure is significantly larger for one drive than the other, that's a way to identify the one which has been slower over the long term. Secondly, what I wonder is why it is that the whole file system seems to hang up at this time. Surely if the other drive is doing nothing, a web page can be served by reading from the available drive (c8t1d0) while the slow drive (c8t0d0) is stuck writing slow. The available drive is c8t0d0 in this case. However, if ZFS is in the middle of a txg (ZFS transaction) commit, it cannot safely do much with the pool until that commit finishes. You can see that ZFS only lets 10 operations accumulate per drive (used to be 35), i.e. 9.0 in the wait column, and 1.0 in the actv column, so it's kinda stuck until the drive gets its work done. Maybe the drive is failing, or maybe it's one of those with large sectors that are not properly aligned with the on-disk partitions. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
Le 7 févr. 2011 à 17:08, Yi Zhang a écrit : On Mon, Feb 7, 2011 at 10:26 AM, Roch roch.bourbonn...@oracle.com wrote: Le 7 févr. 2011 à 06:25, Richard Elling a écrit : On Feb 5, 2011, at 8:10 AM, Yi Zhang wrote: Hi all, I'm trying to achieve the same effect of UFS directio on ZFS and here is what I did: Solaris UFS directio has three functions: 1. improved async code path 2. multiple concurrent writers 3. no buffering Of the three, #1 and #2 were designed into ZFS from day 1, so there is nothing to set or change to take advantage of the feature. 1. Set the primarycache of zfs to metadata and secondarycache to none, recordsize to 8K (to match the unit size of writes) 2. Run my test program (code below) with different options and measure the running time. a) open the file without O_DSYNC flag: 0.11s. This doesn't seem like directio is in effect, because I tried on UFS and time was 2s. So I went on with more experiments with the O_DSYNC flag set. I know that directio and O_DSYNC are two different things, but I thought the flag would force synchronous writes and achieve what directio does (and more). Directio and O_DSYNC are two different features. b) open the file with O_DSYNC flag: 147.26s ouch how big a file ? Does the resuld holds if you don't truncate ? OK, if it had been a 2TB file, I could have seen an opening. Not for 80M though. So it's baffling Unless ! It's not just the open which takes 147s it's the whole run, 1 writes. 1 sync writes without an SDD would take at 150 second at 68 IO/S. with the O_DSYNC flag then all writes are to memory so it's expected to take 0.11s to tracne 8K at 750MB/sec (memcopy speed). O_DSYNC + zfs_nocacheflush is in between. Every write transfers data to an unstable cache but then does not flush it. At some point the cache might overflow and so some write have high latency while the data is transfering from disk cache to disk platter. So those results are inline with what everybody has been seeing before Note that to compare with UFS, since UFS does not cache flush after every sync write like ZFS correctly does you have to compare UFS + write cache disabled to ZFS (with or without write cache). After deleting a zfs pool, the disk write is left on and so a UFS filesystem wil appear inordinately fast then unless you turn off the write cache with format -e; cache , write_cache; disable. -r -r The file is 8K*1 about 80M. I removed the O_TRUNC flag and the results stayed the same... c) same as b) but also enabled zfs_nocacheflush: 5.87s Is your pool created from a single HDD? My questions are: 1. With my primarycache and secondarycache settings, the FS shouldn't buffer reads and writes anymore. Wouldn't that be equivalent to O_DSYNC? Why a) and b) are so different? No. O_DSYNC deals with when the I/O is committed to media. 2. My understanding is that zfs_nocacheflush essentially removes the sync command sent to the device, which cancels the O_DSYNC flag. Why b) and c) are so different? No. Disabling the cache flush means that the volatile write buffer in the disk is not flushed. In other words, disabling the cache flush is in direct conflict with the semantics of O_DSYNC. 3. Does ZIL have anything to do with these results? Yes. The ZIL is used for meeting the O_DSYNC requirements. This has nothing to do with buffering. More details are on the ZFS Best Practices Guide. -- richard Thanks in advance for any suggestion/insight! Yi #include fcntl.h #include sys/time.h int main(int argc, char **argv) { struct timeval tim; gettimeofday(tim, NULL); double t1 = tim.tv_sec + tim.tv_usec/100.0; char a[8192]; int fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC, 0660); //int fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC|O_DSYNC, 0660); if (argv[2][0] == '1') directio(fd, DIRECTIO_ON); int i; for (i=0; i1; ++i) pwrite(fd, a, sizeof(a), i*8192); close(fd); gettimeofday(tim, NULL); double t2 = tim.tv_sec + tim.tv_usec/100.0; printf(%f\n, t2-t1); } ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On 02/07/11 11:49, Yi Zhang wrote: The reason why I tried that is to get the side effect of no buffering, which is my ultimate goal. ultimate = final. you must have a goal beyond the elimination of buffering in the filesystem. if the writes are made durable by zfs when you need them to be durable, why does it matter that it may buffer data while it is doing so? - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On Mon, Feb 7, 2011 at 2:54 PM, Nico Williams n...@cryptonector.com wrote: On Mon, Feb 7, 2011 at 1:49 PM, Yi Zhang yizhan...@gmail.com wrote: Please see my previous email for a high-level discussion of my application. I know that I don't really need O_DSYNC. The reason why I tried that is to get the side effect of no buffering, which is my ultimate goal. ZFS cannot not buffer. The reason is that ZFS likes to batch transactions into as large a contiguous write to disk as possible. The ZIL exists to support fsyn(2) operations that must commit before the rest of a ZFS transaction. In other words: there's always some amount of buffering of writes in ZFS. In that case, ZFS doesn't suit my needs. As to read buffering, why would you want to disable those? My application manages its own buffer and reads/writes go through that buffer first. I don't want double buffering. You still haven't told us what your application does. Or why you want to get close to the metal. Simply telling us that you need no buffering doesn't really help us help you -- with that approach you'll simply end up believing that ZFS is not appropriate for your needs, even though it well might be. It's like the Berkeley DB on a high level, though it doesn't require transaction support, durability, etc. I'm measuring its performance and don't want FS buffer to pollute my results (hence directio). Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On Mon, Feb 7, 2011 at 3:14 PM, Bill Sommerfeld sommerf...@alum.mit.edu wrote: On 02/07/11 11:49, Yi Zhang wrote: The reason why I tried that is to get the side effect of no buffering, which is my ultimate goal. ultimate = final. you must have a goal beyond the elimination of buffering in the filesystem. if the writes are made durable by zfs when you need them to be durable, why does it matter that it may buffer data while it is doing so? - Bill If buffering is on, the running time of my app doesn't reflect the actual I/O cost. My goal is to accurately measure the time of I/O. With buffering on, ZFS would batch up a bunch of writes and change both the original I/O activity and the time. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Drive i/o anomaly
Thanks, Marion. (I actually got the drive labels mixed up in the original post... I edited it on the forum page: http://opensolaris.org/jive/thread.jspa?messageID=511057#511057 ) My suspicion was the same: the drive doing the slow i/o is the problem. I managed to confirm that by taking the other drive offline (c8t0d0 samsung), and the same stalls and slow i/o occurred. After putting the drive online (and letting the resilver complete) I took the slow drive (c8t1d0 western digital green) offline and the system ran very nicely. It is a 4k sector drive, but I thought zfs recognised those drives and didn't need any special configuration...? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Newbie question
On Sat, February 5, 2011 03:54, Gaikokujin Kyofusho wrote: From what I understand using ZFS one could setup something like RAID 6 (RAID-Z2?) but with the ability to use drives of varying sizes/speeds/brands and able to add additional drives later. Am I about right? If so I will continue studying up on this if not then I guess I need to continue exploring different options. Thanks!! IMHO, your best bet for this kind of configuration is to use mirror pairs, not RAIDZ*. Because... Things you can't do with RAIDZ*: You cannot remove a vdev from a pool. You cannot make a RAIDZ* vdev smaller (fewer disks). You cannot make a RAIDZ* vdev larger (more disks). To increase the storage capacity of a RAIDZ* vdev you need to replace all the drives, one at a time, waiting for resilver between replacements (resilver times can be VERY long with big modern drives). And during each resilver, your redundancy will be reduced by 1 -- meaning a RAIDZ array would have NO redundancy during the resilver. (And activity in the pool is high during the resilver -- meaning the chances of any marginal drive crapping out are higher than normal during the resilver.) With mirrors, you can add new space by adding simply two drives (add a new mirror vdev). You can upgrade an existing mirror by replacing only two drives. You can upgrade an existing mirror without reducing redundancy below your starting point ever -- you attach a new drive, wait for the resilver to complete (at this point you have a three-way mirror), then detach one of the original drives; repeat for another new drive and the other original drive. Obviously, using mirrors requires you to buy more drives for any given amount of usable space. I must admit that my 8-bay hot-swap ZFS server cost me a LOT more than a Drobo (but then I bought in 2006, too). -- David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS/Drobo (Newbie) Question
On Sat, February 5, 2011 11:54, Gaikokujin Kyofusho wrote: Thank you kebabber. I will try out indiana and virtual box to play around with it a bit. Just to make sure I understand your example, if I say had a 4x2tb drives, 2x750gb, 2x1.5tb drives etc then i could make 3 groups (perhaps 1 raidz1 + 1 mirrored + 1 mirrored), in terms of accessing them would they just be mounted like 3 partitions or could it all be accessed like one big partition? A ZFS pool can contain many vdevs; you could put the three groups you describe into one pool, and then assign one (or more) file-systems to that pool. Putting them all in one pool seems to me the natural way to handle it; they're all similar levels of redundancy. It's more flexible to have everything in one pool, generally. (You could also make separate pools; my experience, for what it's worth, argues for making pools based on redundancy and performance (and only worry about BIG differences), and assign file-systems to pools based on needs for redundancy and performance. And for my home system I just have one big data pool, currently consisting of 1x1TB, 2x400GB, 2x400GB, plus 1TB hot spare.) Or you could stick strictly to mirrors; 4 pools 2x2T, 2x2T, 2x750G, 2x1.5T. Mirrors are more flexible, give you more redundancy, and are much easier to work with. -- David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On Mon, February 7, 2011 14:49, Yi Zhang wrote: On Mon, Feb 7, 2011 at 3:14 PM, Bill Sommerfeld sommerf...@alum.mit.edu wrote: On 02/07/11 11:49, Yi Zhang wrote: The reason why I tried that is to get the side effect of no buffering, which is my ultimate goal. ultimate = final. Â you must have a goal beyond the elimination of buffering in the filesystem. if the writes are made durable by zfs when you need them to be durable, why does it matter that it may buffer data while it is doing so? Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â - Bill If buffering is on, the running time of my app doesn't reflect the actual I/O cost. My goal is to accurately measure the time of I/O. With buffering on, ZFS would batch up a bunch of writes and change both the original I/O activity and the time. I'm not sure I understand what you're trying to measure (which seems to be your top priority). Achievable performance with ZFS would be better using suitable caching; normally that's the benchmark statistic people would care about. -- David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] deduplication requirements
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Michael Core i7 2600 CPU 16gb DDR3 Memory 64GB SSD for ZIL (optional) Would this produce decent results for deduplication of 16TB worth of pools or would I need more RAM still? What matters is the amount of unique data in your pool. I'll just assume it's all unique, but of course that's ridiculous because if it's all unique then why would you want to enable dedup. But anyway, I'm assuming 16T of unique data. The rule is a little less than 3G of ram for every 1T of unique data. In your case, 16*2.8 = 44.8G ram required in addition to your base ram configuration. You need at least 48G of ram. Or less unique data. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Drive i/o anomaly
matt.connolly...@gmail.com said: After putting the drive online (and letting the resilver complete) I took the slow drive (c8t1d0 western digital green) offline and the system ran very nicely. It is a 4k sector drive, but I thought zfs recognised those drives and didn't need any special configuration...? That's a nice confirmation of the cost of not doing anything special (:-). I hear the problem may be due to 4k drives which report themselves as 512b drives, for boot/BIOS compatibility reasons. I've also seen various ways to force 4k alignment, and check what the ashift value is in your pool's drives, etc. Google solaris zfs 4k sector align will lead the way. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On Mon, Feb 7, 2011 at 3:47 PM, Nico Williams n...@cryptonector.com wrote: On Mon, Feb 7, 2011 at 2:39 PM, Yi Zhang yizhan...@gmail.com wrote: On Mon, Feb 7, 2011 at 2:54 PM, Nico Williams n...@cryptonector.com wrote: ZFS cannot not buffer. The reason is that ZFS likes to batch transactions into as large a contiguous write to disk as possible. The ZIL exists to support fsyn(2) operations that must commit before the rest of a ZFS transaction. In other words: there's always some amount of buffering of writes in ZFS. In that case, ZFS doesn't suit my needs. Maybe. See below. As to read buffering, why would you want to disable those? My application manages its own buffer and reads/writes go through that buffer first. I don't want double buffering. So your concern is that you don't want to pay twice the memory cost for buffering? If so, set primarycache as described earlier and drop the O_DSYNC flag. ZFS will then buffer your writes, but only for a little while, and you should want it to because ZFS will almost certainly do a better job of batching transactions than your application would. With ZFS you'll benefit from: advanced volume management, snapshots/clones, dedup, Merkle hash trees (i.e., corruption detection), encryption, and so on. You'll almost certainly not be implementing any of those in your application... You still haven't told us what your application does. Or why you want to get close to the metal. Simply telling us that you need no buffering doesn't really help us help you -- with that approach you'll simply end up believing that ZFS is not appropriate for your needs, even though it well might be. It's like the Berkeley DB on a high level, though it doesn't require transaction support, durability, etc. I'm measuring its performance and don't want FS buffer to pollute my results (hence directio). You're still mixing directio and O_DSYNC. You should do three things: a) set primarycache=metadata, b) set recordsize to whatever your application's page size is (e.g., 8KB), c) stop using O_DSYNC. Tell us how that goes. I suspect the performance will be much better. Nico -- This is actually what I did for 2.a) in my original post. My concern there is that ZFS' internal write buffering makes it hard to get a grip on my application's behavior. I want to present my application's raw I/O performance without too much outside factors... UFS plus directio gives me exactly (or close to) that but ZFS doesn't... Of course, in the final deployment, it would be great to be able to take advantage of ZFS' advanced features such as I/O optimization. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On 02/07/11 12:49, Yi Zhang wrote: If buffering is on, the running time of my app doesn't reflect the actual I/O cost. My goal is to accurately measure the time of I/O. With buffering on, ZFS would batch up a bunch of writes and change both the original I/O activity and the time. if batching main pool writes improves the overall throughput of the system over a more naive i/o scheduling model, don't you want your users to see the improvement in performance from that batching? why not set up a steady-state sustained workload that will run for hours, and measure how long it takes the system to commit each 1000 or 1 transactions in the middle of the steady state workload? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)
On 2011-Feb-07 14:22:51 +0800, Matthew Angelo bang...@gmail.com wrote: I'm actually more leaning towards running a simple 7+1 RAIDZ1. Running this with 1TB is not a problem but I just wanted to investigate at what TB size the scales would tip. It's not that simple. Whilst resilver time is proportional to device size, it's far more impacted by the degree of fragmentation of the pool. And there's no 'tipping point' - it's a gradual slope so it's really up to you to decide where you want to sit on the probability curve. I understand RAIDZ2 protects against failures during a rebuild process. This would be its current primary purpose. Currently, my RAIDZ1 takes 24 hours to rebuild a failed disk, so with 2TB disks and worse case assuming this is 2 days this is my 'exposure' time. Unless this is a write-once pool, you can probably also assume that your pool will get more fragmented over time, so by the time your pool gets to twice it's current capacity, it might well take 3 days to rebuild due to the additional fragmentation. One point I haven't seen mentioned elsewhere in this thread is that all the calculations so far have assumed that drive failures were independent. In practice, this probably isn't true. All HDD manufacturers have their off days - where whole batches or models of disks are cr*p and fail unexpectedly early. The WD EARS is simply a demonstration that it's WD's turn to turn out junk. Your best protection against this is to have disks from enough different batches that a batch failure won't take out your pool. PSU, fan and SATA controller failures are likely to take out multiple disks but it's far harder to include enough redundancy to handle this and your best approach is probably to have good backups. I will be running hot (or maybe cold) spare. So I don't need to factor in Time it takes for a manufacture to replace the drive. In which case, the question is more whether 8-way RAIDZ1 with a hot spare (7+1+1) is better than 9-way RAIDZ2 (7+2). In the latter case, your hot spare is already part of the pool so you don't lose the time-to-notice plus time-to-resilver before regaining redundancy. The downside is that actively using the hot spare may increase the probability of it failing. -- Peter Jeremy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Migrating zpool to new drives with 4K Sectors
Except for meta data which seems to be written in small pieces, wouldn't having a zfs record size being a multiple of 4k on a vdev that is 4k aligned work ok? Or can the start of a zfs record that's 16kb for example start at any sector in the vdev? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] deduplication requirements
On 2/7/2011 1:06 PM, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Michael Core i7 2600 CPU 16gb DDR3 Memory 64GB SSD for ZIL (optional) Would this produce decent results for deduplication of 16TB worth of pools or would I need more RAM still? What matters is the amount of unique data in your pool. I'll just assume it's all unique, but of course that's ridiculous because if it's all unique then why would you want to enable dedup. But anyway, I'm assuming 16T of unique data. The rule is a little less than 3G of ram for every 1T of unique data. In your case, 16*2.8 = 44.8G ram required in addition to your base ram configuration. You need at least 48G of ram. Or less unique data. To follow up on Ned's estimation, please let us know what kind of data you're planning on putting in the Dedup'd zpool. That can really give us a better idea as to the number of slabs that the pool will have, which is what drives dedup RAM and L2ARC usage. You also want to use an SSD for L2ARC, NOT for ZIL (though, you *might* also want one for ZIL, depending on your write patterns). In all honesty, these days, it doesn't pay to dedup a pool unless you can count on large amounts of common data. Virtual Machine images, incremental backups, ISO images of data CD/DVDs, and some Video are your best bet. Pretty much everything else is going to cost you more in RAM/L2ARC than it's worth. IMHO, you don't want Dedup unless you can *count* on a 10x savings factor. Also, for reasons discussed here before, I would not recommend a Core i7 for use as a fileserver CPU. It's an Intel Desktop CPU, and almost certainly won't support ECC Ram on your motherboard, and it seriously overpowered for your use. See if you can find a nice socket AM3+ motherboard for a low-range Athlon X3/X4. You can get ECC RAM for it (even in a desktop motherboard), it will cost less, and perform at least as well. Dedup is not CPU intensive. Compression is, and you may very well want to enable that, but you're still very unlikely to hit a CPU bottleneck before RAM starvation or disk wait occurs. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] deduplication requirements
On 2/7/2011 1:06 PM, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Michael Core i7 2600 CPU 16gb DDR3 Memory 64GB SSD for ZIL (optional) Would this produce decent results for deduplication of 16TB worth of pools or would I need more RAM still? What matters is the amount of unique data in your pool. I'll just assume it's all unique, but of course that's ridiculous because if it's all unique then why would you want to enable dedup. But anyway, I'm assuming 16T of unique data. The rule is a little less than 3G of ram for every 1T of unique data. In your case, 16*2.8 = 44.8G ram required in addition to your base ram configuration. You need at least 48G of ram. Or less unique data. To follow up on Ned's estimation, please let us know what kind of data you're planning on putting in the Dedup'd zpool. That can really give us a better idea as to the number of slabs that the pool will have, which is what drives dedup RAM and L2ARC usage. You also want to use an SSD for L2ARC, NOT for ZIL (though, you *might* also want one for ZIL, depending on your write patterns). In all honesty, these days, it doesn't pay to dedup a pool unless you can count on large amounts of common data. Virtual Machine images, incremental backups, ISO images of data CD/DVDs, and some Video are your best bet. Pretty much everything else is going to cost you more in RAM/L2ARC than it's worth. IMHO, you don't want Dedup unless you can *count* on a 10x savings factor. Also, for reasons discussed here before, I would not recommend a Core i7 for use as a fileserver CPU. It's an Intel Desktop CPU, and almost certainly won't support ECC Ram on your motherboard, and it seriously overpowered for your use. See if you can find a nice socket AM3+ motherboard for a low-range Athlon X3/X4. You can get ECC RAM for it (even in a desktop motherboard), it will cost less, and perform at least as well. Dedup is not CPU intensive. Compression is, and you may very well want to enable that, but you're still very unlikely to hit a CPU bottleneck before RAM starvation or disk wait occurs. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On 2/7/2011 1:10 PM, Yi Zhang wrote: [snip] This is actually what I did for 2.a) in my original post. My concern there is that ZFS' internal write buffering makes it hard to get a grip on my application's behavior. I want to present my application's raw I/O performance without too much outside factors... UFS plus directio gives me exactly (or close to) that but ZFS doesn't... Of course, in the final deployment, it would be great to be able to take advantage of ZFS' advanced features such as I/O optimization. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss And, there's your answer. You seem to care about doing bare-metal I/O for tuning of your application, so you can do consistent measurements. Not for actual usage in production. Therefore, do what's inferred in the above: develop your app, using it on UFS w/directio to work out the application issues and tune. When you deploy it, use ZFS and its caching techniques to get maximum (though not absolutely consistently measurable) performance for the already-tuned app. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)
On Feb 7, 2011, at 1:07 PM, Peter Jeremy wrote: On 2011-Feb-07 14:22:51 +0800, Matthew Angelo bang...@gmail.com wrote: I'm actually more leaning towards running a simple 7+1 RAIDZ1. Running this with 1TB is not a problem but I just wanted to investigate at what TB size the scales would tip. It's not that simple. Whilst resilver time is proportional to device size, it's far more impacted by the degree of fragmentation of the pool. And there's no 'tipping point' - it's a gradual slope so it's really up to you to decide where you want to sit on the probability curve. The tipping point won't occur for similar configurations. The tip occurs for different configurations. In particular, if the size of the N+M parity scheme is very large and the resilver times become very, very large (weeks) then a (M-1)-way mirror scheme can provide better performance and dependability. But I consider these to be extreme cases. I understand RAIDZ2 protects against failures during a rebuild process. This would be its current primary purpose. Currently, my RAIDZ1 takes 24 hours to rebuild a failed disk, so with 2TB disks and worse case assuming this is 2 days this is my 'exposure' time. Unless this is a write-once pool, you can probably also assume that your pool will get more fragmented over time, so by the time your pool gets to twice it's current capacity, it might well take 3 days to rebuild due to the additional fragmentation. One point I haven't seen mentioned elsewhere in this thread is that all the calculations so far have assumed that drive failures were independent. In practice, this probably isn't true. All HDD manufacturers have their off days - where whole batches or models of disks are cr*p and fail unexpectedly early. The WD EARS is simply a demonstration that it's WD's turn to turn out junk. Your best protection against this is to have disks from enough different batches that a batch failure won't take out your pool. The problem with considering the failures as interdependent is that you cannot get the failure rate information from the vendors. You could guess, or use your own, but it would not always help you make a better design decision. PSU, fan and SATA controller failures are likely to take out multiple disks but it's far harder to include enough redundancy to handle this and your best approach is probably to have good backups. The top 4 items that fail most often, in no particular order, are: fans, power supplies, memory, and disk. This is why you will see the enterprise class servers use redundant fans, multiple high-quality power supplies, ECC memory, and some sort of RAID. I will be running hot (or maybe cold) spare. So I don't need to factor in Time it takes for a manufacture to replace the drive. In which case, the question is more whether 8-way RAIDZ1 with a hot spare (7+1+1) is better than 9-way RAIDZ2 (7+2). In this case, raidz2 is much better for dependability because the spare is already resilvered. It also performs better, though the dependability gains tend to be bigger than the performance gains. In the latter case, your hot spare is already part of the pool so you don't lose the time-to-notice plus time-to-resilver before regaining redundancy. The downside is that actively using the hot spare may increase the probability of it failing. No. The disk failure rate data does not conclusively show that activity causes premature failure. Other failure modes dominate. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On Feb 7, 2011, at 1:10 PM, Yi Zhang wrote: This is actually what I did for 2.a) in my original post. My concern there is that ZFS' internal write buffering makes it hard to get a grip on my application's behavior. I want to present my application's raw I/O performance without too much outside factors... UFS plus directio gives me exactly (or close to) that but ZFS doesn't... In the bad old days when processors only had one memory controller, one could make an argument that not copying was an important optimization. Today, with the fast memory controllers (plural) we have, memory copies don't hurt very much. Other factors will dominate. Of course, with dtrace it should be relatively easy to measure the copy. Of course, in the final deployment, it would be great to be able to take advantage of ZFS' advanced features such as I/O optimization. Nice save :-) otherwise we wonder why you don't just use raw disk if you are so concerned about memory copies :-) -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Drive i/o anomaly
Observation below... On Feb 4, 2011, at 7:10 PM, Matt Connolly wrote: Hi, I have a low-power server with three drives in it, like so: matt@vault:~$ zpool status pool: rpool state: ONLINE scan: resilvered 588M in 0h3m with 0 errors on Fri Jan 7 07:38:06 2011 config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror-0ONLINE 0 0 0 c8t1d0s0 ONLINE 0 0 0 c8t0d0s0 ONLINE 0 0 0 cache c12d0s0 ONLINE 0 0 0 errors: No known data errors I'm running netatalk file sharing for mac, and using it as a time machine backup server for my mac laptop. When files are copying to the server, I often see periods of a minute or so where network traffic stops. I'm convinced that there's some bottleneck in the storage side of things because when this happens, I can still ping the machine and if I have an ssh window, open, I can still see output from a `top` command running smoothly. However, if I try and do anything that touches disk (eg `ls`) that command stalls. At the time it comes good, everything comes good, file copies across the network continue, etc. If I have a ssh terminal session open and run `iostat -nv 5` I see something like this: extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 1.2 36.0 153.6 4608.0 1.2 0.3 31.99.3 16 18 c12d0 0.0 113.40.0 7446.7 0.8 0.17.00.5 15 5 c8t0d0 0.2 106.44.1 7427.8 4.0 0.1 37.81.4 93 14 c8t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.4 73.2 25.7 9243.0 2.3 0.7 31.69.8 34 37 c12d0 0.0 226.60.0 24860.5 1.6 0.27.00.9 25 19 c8t0d0 0.2 127.63.4 12377.6 3.8 0.3 29.72.2 91 27 c8t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 44.20.0 5657.6 1.4 0.4 31.79.0 19 20 c12d0 0.2 76.04.8 9420.8 1.1 0.1 14.21.7 12 13 c8t0d0 0.0 16.60.0 2058.4 9.0 1.0 542.1 60.2 100 100 c8t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.00.20.0 25.6 0.0 0.00.32.3 0 0 c12d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c8t0d0 0.0 11.00.0 1365.6 9.0 1.0 818.1 90.9 100 100 c8t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.20.00.10.0 0.0 0.00.1 25.4 0 1 c12d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c8t0d0 0.0 17.60.0 2182.4 9.0 1.0 511.3 56.8 100 100 c8t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.00.00.00.0 0.0 0.00.00.0 0 0 c12d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c8t0d0 0.0 16.60.0 2058.4 9.0 1.0 542.1 60.2 100 100 c8t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.00.00.00.0 0.0 0.00.00.0 0 0 c12d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c8t0d0 0.0 15.80.0 1959.2 9.0 1.0 569.6 63.3 100 100 c8t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.20.00.10.0 0.0 0.00.10.1 0 0 c12d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c8t0d0 0.0 17.40.0 2157.6 9.0 1.0 517.2 57.4 100 100 c8t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.00.00.00.0 0.0 0.00.00.0 0 0 c12d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c8t0d0 0.0 18.20.0 2256.8 9.0 1.0 494.5 54.9 100 100 c8t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.00.00.00.0 0.0 0.00.00.0 0 0 c12d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c8t0d0 0.0 14.80.0 1835.2 9.0 1.0 608.1 67.5 100 100 c8t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.20.00.10.0 0.0 0.00.10.1 0 0 c12d0 0.01.40.00.6 0.0 0.00.00.2 0 0 c8t0d0 0.0 49.00.0 6049.6 6.7 0.5 137.6 11.2
Re: [zfs-discuss] deduplication requirements
On 6 February 2011 01:34, Michael michael.armstr...@gmail.com wrote: Hi guys, I'm currently running 2 zpools each in a raidz1 configuration, totally around 16TB usable data. I'm running it all on an OpenSolaris based box with 2gb memory and an old Athlon 64 3700 CPU, I understand this is very poor and underpowered for deduplication, so I'm looking at building a new system, but wanted some advice first, here is what i've planned so far: Core i7 2600 CPU 16gb DDR3 Memory 64GB SSD for ZIL (optional) http://ark.intel.com/Product.aspx?id=52213 http://ark.intel.com/Product.aspx?id=52213The desktop Core i* range doesn't support ECC ram at all, this could potentially be a pool breaker if you get a flipped bit in the wrong place (a significant metadata block). Just something to keep in mind. Also, Intel have issued a recall (ish) for all of the 6 series chipsets released so far, the PLL unit for the 3gbit SATA ports on the chipset is driven too hard and will likely degrade over time (5~15% failure rate over three years). They are talking about a March~April time to fix in the channel. If you don't plan on using the 3gbit SATA ports, then you're fine. Intel will make 1155 Xeon's at some point, ie http://en.wikipedia.org/wiki/List_of_future_Intel_microprocessors#.22Sandy_Bridge.22_.2832_nm.29_8 They support ECC (just check for a specific QVL after launch, DDR3 ECC isn't necessarily the only thing you need to look for). I think the Feb 20 release date may have been pushed for the chipset respin. Cheers, ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and TRIM - No need for TRIM
On Sun, 6 Feb 2011, Orvar Korvar wrote: 1) Using SSD without TRIM is acceptable. The only drawback is that without TRIM, the SSD will write much more, which effects life time. Because when the SSD has written enough, it will break. Why do you think that the SSD should necessarily write much more? I don't follow that conclusion. If I can figure out how to design a SSD which does not necessarily write much more, I suspect that an actual SSD designer can do the same. USB sticks and Compact Flash cards need not apply. :-) Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and TRIM - No need for TRIM
On Mon, Feb 7 at 20:43, Bob Friesenhahn wrote: On Sun, 6 Feb 2011, Orvar Korvar wrote: 1) Using SSD without TRIM is acceptable. The only drawback is that without TRIM, the SSD will write much more, which effects life time. Because when the SSD has written enough, it will break. Why do you think that the SSD should necessarily write much more? I don't follow that conclusion. If I can figure out how to design a SSD which does not necessarily write much more, I suspect that an actual SSD designer can do the same. Blocks/sectors marked as being TRIM'd do not need to be maintained by the garbage collection engine. Depending on the design of the SSD, this can significantly reduce the write amplification of the SSD. -- Eric D. Mudama edmud...@bounceswoosh.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Sil3124 Sata controller for ZFS on Sparc OpenSolaris Nevada b130
As part of a small home project, I have purchased a SIL3124 hba in hopes of attaching an external drive/drive enclosure via eSATA. The host in question is an old Sun Netra T1 currently running OpenSolaris Nevada b130. The card in question is this Sil3124 card: http://www.newegg.com/product/product.aspx?item=N82E16816124003 although I did not purchase it from Newegg. I specifically purchased this card as I have seen specific reports of it working under Solaris/OpenSolaris distro's on several Solaris mailing lists. After installing the card, and associated components, I did numerous things in an attempt to see the single drive attached to my Netra, to include reconfigure boot, several different devfsadm commands, and looking for components using scanpci and cfgadm. Although I am not functional yet (I can't see my drive with format or format -e), I believe I see my hba with both prtdiag and prtconf. I will post some additional system information at the bottom of this note. To cut to the chase, after jumping on Yahoo for some RTFM stuff, it looks like there is a system package called *SUNWsi3124*. I looked on my Netra, and it isn't there. I reviewed my OpenSolaris Nevada b130 iso, and it also doesn't have a SUNWsi3124 package. On a whim, I looked on my Sun Ultra20m2 system (X64 AMD system) which is also running OpenSolaris Nevada b130, and this package is there. So it looks like the SUNWsi3124 package is x86/x64 only? I don't know if this specifically is why my Netra doesn't see my eSATA drive, but the SUNWsi3124 package is the best lead I have so far. Thanks for any comments, Jerry . / = # prtdiag System Configuration: Sun Microsystems sun4u Netra T1 200 (UltraSPARC-IIe 500MHz) System clock frequency: 100 MHz Memory size: 2048 Megabytes = CPUs = Run Ecache CPUCPU Brd CPU Module MHz MBImpl. Mask --- --- --- - -- -- 0 0 0 500 0.2 13 1.4 = IO Cards = Bus# Freq Brd Type MHz Slot Name Model --- -- 0 PCI-1 3312 ebus 0 PCI-1 33 3 pmu-pci10b9,7101 0 PCI-1 33 3 lomp 0 PCI-1 33 7 isa 0 PCI-1 3312 network-pci108e,1101 SUNW,pci-eri 0 PCI-1 3312 usb-pci108e,1103.1 0 PCI-1 3313 ide-pci10b9,5229 0 PCI-1 33 5 network-pci108e,1101 SUNW,pci-eri 0 PCI-1 33 5 usb-pci108e,1103.1 0 PCI-2 33 8 scsi-glm Symbios,53C896 0 PCI-2 33 8 scsi-glm Symbios,53C896 0 PCI-2 33 5 raid-pci1095,7124--- No failures found in System === # . / = relevant section from prtconf -v raid (driver not attached) Hardware properties: name='compatible' type=string items=3 value='pci1095,7124' + 'pci1095,3124' + 'pciclass,010400' . / = output from prtdev.ksh script, from here http://bolthole.com/solaris/HCL/ VendorID=0x1095, DeviceID=0x3124 Sub VendorID=0x1095, Sub DeviceID=0x7124 name: 'raid' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss