Re: [zfs-discuss] Performance with Sun StorageTek 2540
On Thu, 14 Feb 2008, Tim wrote: If you're going for best single file write performance, why are you doing mirrors of the LUNs? Perhaps I'm misunderstanding why you went from one giant raid-0 to what is essentially a raid-10. That decision was made because I also need data reliability. As mentioned before, the write rate peaked at 200MB/second using RAID-0 across 12 disks exported as one big LUN. Other firmware-based methods I tried typically offered about 170MB/second. Even a four disk firmware-managed RAID-5 with ZFS on top offered about 165MB/second. Given that I would like to achieve 300MB/second, a few tens of MB don't make much difference. It may be that I bought the wrong product, but perhaps there is a configuration change which will help make up some of the difference without sacrificing data reliability. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance with Sun StorageTek 2540
On Fri, 15 Feb 2008, Will Murnane wrote: What is the workload for this system? Benchmarks are fine and good, but application performance is the determining factor of whether a system is performing acceptably. The system is primarily used for image processing where the image data is uncompressed and a typical file is 12MB. In some cases the files will be hundreds of MB or GB. The typical case is to read a file and output a new file. For some very large files, an uncompressed temporary file is edited in place with random access. I am the author of the application and need the filesystem to be fast enough that it will uncover any slowness in my code. :-) Perhaps iozone is behaving in a bad way; you might investigate That is always possible. Iozone (http://www.iozone.org/) has been around for a very long time and has seen a lot of improvement by many smart people so it does not seem very suspect. bonnie++: http://www.sunfreeware.com/programlistintel10.html I will check it out. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS write throttling
On Fri, 15 Feb 2008, Roch Bourbonnais wrote: The latter appears to be bug 6429855. But the underlying behaviour doesn't really seem desirable; are there plans afoot to do any work on ZFS write throttling to address this kind of thing? Throttling is being addressed. http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6429205 I have observed similar behavior when using 'iozone' on a large file to benchmark ZFS on my StorageTek 2540 array. Fsstat shows gaps of up to 30 seconds of no I/O when run on a 10 second update cycle but when I go to look at the lights on the array, I see that it is actually fully busy. It seems that the application is stalled during this load. It also seems that simple operations like 'ls' get stalled under such heavy load. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance with Sun StorageTek 2540
On Fri, 15 Feb 2008, Roch Bourbonnais wrote: What was the interlace on the LUN ? The question was about LUN interlace not interface. 128K to 1M works better. The segment size is set to 128K. The max the 2540 allows is 512K. Unfortunately, the StorageTek 2540 and CAM documentation does not really define what segment size means. Any compression ? Compression is disabled. Does turn off checksum helps the number (that would point to a CPU limited throughput). I have not tried that but this system is loafing during the benchmark. It has four 3GHz Opteron cores. Does this output from 'iostat -xnz 20' help to understand issues? extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 3.00.7 26.43.5 0.0 0.00.04.2 0 2 c1t1d0 0.0 154.20.0 19680.3 0.0 20.70.0 134.2 0 59 c4t600A0B80003A8A0B096147B451BEd0 0.0 211.50.0 26940.5 1.1 33.95.0 160.5 99 100 c4t600A0B800039C9B50A9C47B4522Dd0 0.0 211.50.0 26940.6 1.1 33.95.0 160.4 99 100 c4t600A0B800039C9B50AA047B4529Bd0 0.0 154.00.0 19654.7 0.0 20.70.0 134.2 0 59 c4t600A0B80003A8A0B096647B453CEd0 0.0 211.30.0 26915.0 1.1 33.95.0 160.5 99 100 c4t600A0B800039C9B50AA447B4544Fd0 0.0 152.40.0 19447.0 0.0 20.50.0 134.5 0 59 c4t600A0B80003A8A0B096A47B4559Ed0 0.0 213.20.0 27183.8 0.9 34.14.2 159.9 90 100 c4t600A0B800039C9B50AA847B45605d0 0.0 152.50.0 19453.4 0.0 20.50.0 134.5 0 59 c4t600A0B80003A8A0B096E47B456DAd0 0.0 213.20.0 27177.4 0.9 34.14.2 159.9 90 100 c4t600A0B800039C9B50AAC47B45739d0 0.0 213.20.0 27195.3 0.9 34.14.2 159.9 90 100 c4t600A0B800039C9B50AB047B457ADd0 0.0 154.40.0 19711.8 0.0 20.70.0 134.0 0 59 c4t600A0B80003A8A0B097347B457D4d0 0.0 211.30.0 26958.6 1.1 33.95.0 160.6 99 100 c4t600A0B800039C9B50AB447B4595Fd0 Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance with Sun StorageTek 2540
On Fri, 15 Feb 2008, Peter Tribble wrote: Each LUN is accessed through only one of the controllers (I presume the 2540 works the same way as the 2530 and 61X0 arrays). The paths are active/passive (if the active fails it will relocate to the other path). When I set mine up the first time it allocated all the LUNs to controller B and performance was terrible. I then manually transferred half the LUNs to controller A and it started to fly. I assume that you either altered the Access State shown for the LUN in the output of 'mpathadm show lu DEVICE' or you noticed and observed the pattern: Target Port Groups: ID: 3 Explicit Failover: yes Access State: active Target Ports: Name: 200400a0b83a8a0c Relative ID: 0 ID: 2 Explicit Failover: yes Access State: standby Target Ports: Name: 200500a0b83a8a0c Relative ID: 0 I find this all very interesting and illuminating: for dev in c4t600A0B80003A8A0B096A47B4559Ed0 \ c4t600A0B80003A8A0B096E47B456DAd0 \ c4t600A0B80003A8A0B096147B451BEd0 \ c4t600A0B80003A8A0B096647B453CEd0 \ c4t600A0B80003A8A0B097347B457D4d0 \ c4t600A0B800039C9B50A9C47B4522Dd0 \ c4t600A0B800039C9B50AA047B4529Bd0 \ c4t600A0B800039C9B50AA447B4544Fd0 \ c4t600A0B800039C9B50AA847B45605d0 \ c4t600A0B800039C9B50AAC47B45739d0 \ c4t600A0B800039C9B50AB047B457ADd0 \ c4t600A0B800039C9B50AB447B4595Fd0 \ do echo === $dev === for mpathadm show lu /dev/rdsk/$dev | grep 'Access State' for done === c4t600A0B80003A8A0B096A47B4559Ed0 === Access State: active Access State: standby === c4t600A0B80003A8A0B096E47B456DAd0 === Access State: active Access State: standby === c4t600A0B80003A8A0B096147B451BEd0 === Access State: active Access State: standby === c4t600A0B80003A8A0B096647B453CEd0 === Access State: active Access State: standby === c4t600A0B80003A8A0B097347B457D4d0 === Access State: active Access State: standby === c4t600A0B800039C9B50A9C47B4522Dd0 === Access State: active Access State: standby === c4t600A0B800039C9B50AA047B4529Bd0 === Access State: standby Access State: active === c4t600A0B800039C9B50AA447B4544Fd0 === Access State: standby Access State: active === c4t600A0B800039C9B50AA847B45605d0 === Access State: standby Access State: active === c4t600A0B800039C9B50AAC47B45739d0 === Access State: standby Access State: active === c4t600A0B800039C9B50AB047B457ADd0 === Access State: standby Access State: active === c4t600A0B800039C9B50AB447B4595Fd0 === Access State: standby Access State: active Notice that the first six LUNs are active to one controller while the second six LUNs are active to the other controller. Based on this, I should rebuild my pool by splitting my mirrors across this boundary. I am really happy that ZFS makes such things easy to try out. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance with Sun StorageTek 2540
On Fri, 15 Feb 2008, Peter Tribble wrote: May not be relevant, but still worth checking - I have a 2530 (which ought to be that same only SAS instead of FC), and got fairly poor performance at first. Things improved significantly when I got the LUNs properly balanced across the controllers. What do you mean by properly balanced across the controllers? Are you using the multipath support in Solaris 10 or are you relying on ZFS to balance the I/O load? Do some disks have more affinity for a controller than the other? With the 2540, there is a FC connection to each redundant controller. The Solaris 10 multipathing presumably load-shares the I/O to each controller. The controllers then perform some sort of magic to get the data to and from the SAS drives. The controller stats are below. I notice that it seems that controller B has seen a bit more activity than controller A but the firmware does not provide a controller uptime value so it is possible that one controller was up longer than another: Performance Statistics - A on Storage System Array-1 Timestamp: Fri Feb 15 14:37:39 CST 2008 Total IOPS: 1098.83 Average IOPS: 355.83 Read %: 38.28 Write %:61.71 Total Data Transferred: 139284.41 KBps Read: 53844.26 KBps Average Read: 17224.04 KBps Peak Read: 242232.70 KBps Written:85440.15 KBps Average Written:26966.58 KBps Peak Written: 139918.90 KBps Average Read Size: 639.96 KB Average Write Size: 629.94 KB Cache Hit %:85.32 Performance Statistics - B on Storage System Array-1 Timestamp: Fri Feb 15 14:37:45 CST 2008 Total IOPS: 1526.69 Average IOPS: 497.32 Read %: 34.90 Write %:65.09 Total Data Transferred: 193594.58 KBps Read: 68200.00 KBps Average Read: 24052.61 KBps Peak Read: 339693.55 KBps Written:125394.58 KBps Average Written:37768.40 KBps Peak Written: 183534.66 KBps Average Read Size: 895.80 KB Average Write Size: 883.38 KB Cache Hit %:75.05 If I then go to the performance stats on an individual disk, I see Performance Statistics - Disk-08 on Storage System Array-1 Timestamp: Fri Feb 15 14:43:36 CST 2008 Total IOPS: 196.33 Average IOPS: 72.01 Read %: 9.65 Write %:90.34 Total Data Transferred: 25076.91 KBps Read: 2414.11 KBps Average Read: 3521.44 KBps Peak Read: 48422.00 KBps Written:22662.79 KBps Average Written:5423.78 KBps Peak Written: 28036.43 KBps Average Read Size: 127.29 KB Average Write Size: 127.77 KB Cache Hit %:89.30 Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance with Sun StorageTek 2540
On Fri, 15 Feb 2008, Luke Lonergan wrote: I only managed to get 200 MB/s write when I did RAID 0 across all drives using the 2540's RAID controller and with ZFS on top. Ridiculously bad. I agree. :-( While I agree that data is sent twice (actually up to 8X if striping across four mirrors) Still only twice the data that would otherwise be sent, in other words: the mirroring causes a duplicate set of data to be written. Right. But more little bits of data to be sent due to ZFS striping. Given that you're not even saturating the FC-AL links, the problem is in the hardware RAID. I suggest disabling read and write caching in the hardware RAID. Hardware RAID is not an issue in this case since each disk is exported as a LUN. Performance with ZFS is not much different than when hardware RAID was used. I previously tried disabling caching in the hardware and it did not make a difference in the results. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance with Sun StorageTek 2540
On Fri, 15 Feb 2008, Bob Friesenhahn wrote: Notice that the first six LUNs are active to one controller while the second six LUNs are active to the other controller. Based on this, I should rebuild my pool by splitting my mirrors across this boundary. I am really happy that ZFS makes such things easy to try out. Now that I have tried this out, I can unhappily say that it made no measurable difference to actual performance. However it seems like a better layout anyway. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance with Sun StorageTek 2540
On Fri, 15 Feb 2008, Albert Chin wrote: http://groups.google.com/group/comp.unix.solaris/browse_frm/thread/59b43034602a7b7f/0b500afc4d62d434?lnk=stq=#0b500afc4d62d434 This is really discouraging. Based on these newsgroup postings I am thinking that the Sun StorageTek 2540 was not a good investment for me, especially given that the $23K for it came right out of my own paycheck and it took me 6 months of frustration (first shipment was damaged) to receive it. Regardless, this was the best I was able to afford unless I built the drive array myself. The page at http://www.sun.com/storagetek/disk_systems/workgroup/2540/benchmarks.jsp claims 546.22 MBPS for the large file processing benchmark. So I go to look at the actual SPC2 full disclosure report and see that for one stream, the average data rate is 105MB/second (compared with 102MB/second with RAID-5), and rises to 284MB/second with 10 streams. The product obviously performs much better for reads than it does for writes and is better for multi-user performance than single-user. It seems like I am getting a good bit more performance from my own setup than what the official benchmark suggests (they used 72MB drives, with 24-drives total) so it seems that everything is working fine. This is a lesson for me, and I have certainly learned a fair amount about drive arrays, fiber channel, and ZFS, in the process. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] 'du' is not accurate on zfs
I have a script which generates a file and then immediately uses 'du -h' to obtain its size. With Solaris 10 I notice that this often returns an incorrect value of '0' as if ZFS is lazy about reporting actual disk use. Meanwhile, 'ls -l' does report the correct size. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance with Sun StorageTek 2540
On Sat, 16 Feb 2008, Peter Tribble wrote: Agreed. My 2530 gives me about 450MB/s on writes and 800 on reads. That's zfs striped across 4 LUNs, each of which is hardware raid-5 (24 drives in total, so each raid-5 LUN is 5 data + 1 parity). Is this single-file bandwidth or multiple-file/thread bandwidth? According to Sun's own benchmark data, the 2530 was capable of 20MB/second more than the 2540 on writes for a single large file, and the difference went away after that. For multi-user activity the throughput clearly improves to be similar to what you describe. Most people are likely interested in maximizing multi-user performance, and particularly for reads. Visit http://www.storageperformance.org/results/benchmark_results_spc2/#sun_spc2 to see the various benchmark results. According to these results, for large-file writes the 2530/2540 compares well with other StorageTek products, including the more expensive 6140 and 6540 arrays. It also compares well with similarly-sized storage products from other vendors. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 'du' is not accurate on zfs
On Sat, 16 Feb 2008, Richard Elling wrote: ls -l shows the length. ls -s shows the size, which may be different than the length. You probably want size rather than du. That is true. Unfortunately 'ls -s' displays in units of disk blocks and does not also consider the 'h' option in order to provide a value suitable for humans. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance with Sun StorageTek 2540
On Sat, 16 Feb 2008, Joel Miller wrote: Here is how you can tell the array to ignore cache sync commands and the force unit access bits...(Sorry if it wraps..) Thanks to the kind advice of yourself and Mertol Ozyoney, there is a huge boost in write performance: Was: 154MB/second Now: 279MB/second The average service time for each disk LUN has dropped considerably. The numbers provided by 'zfs iostat' are very close to what is measured by 'iozone'. This is like night and day and gets me very close to my original target write speed of 300MB/second. Thank you very much! Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance with Sun StorageTek 2540
On Sat, 16 Feb 2008, Mertol Ozyoney wrote: Please try to distribute Lun's between controllers and try to benchmark by disabling cache mirroring. (it's different then disableing cache) By the term disabling cache mirroring are you talking about Write Cache With Replication Enabled in the Common Array Manager? Does this feature maintain a redundant cache (two data copies) between controllers? Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] filebench for Solaris 10?
Some of us are still using Solaris 10 since it is the version of Solaris released and supported by Sun. The 'filebench' software from SourceForge does not seem to install or work on Solaris 10. The 'pkgadd' command refuses to recognize the package, even when it is set to Solaris 2.4 mode. I was able to build the software but observation of what 'make install' does is that it installs into the private home directory of some hard-coded user. The 'make package' command builds an unusable package similar to the one on SourceForge. Are the filebench maintainers aware of this problem? Will a package which works for Solaris 10 (which some of us are still using) be posted? Thanks, Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Recommendations for per-user NFS shared home directories?
I am attempting to create per-user ZFS filesystems under an exported /home ZFS filesystem. This would work fine except that the ownership/permissions settings applied to the mount point of those per-user filesystems on the server are not seen by NFS clients. Instead NFS clients see directory ownership of root:other (Solaris 9 clients), root:wheel (OS-X clients), and root:daemon (FreeBSD clients). Only Solaris 10 clients seem to preserve original ownership and permissions. Is there a way to resolve this problem? Thanks, Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recommendations for per-user NFS shared home directories?
On Sun, 17 Feb 2008, Mattias Pantzare wrote: You should use automount for your mountings if you have many clients. Change the automount map and all clients will mount the new filesystem if needed. You can move some users to a new server with very little work, just change the mapping for that user. Yes, of course. This would be easy if I was running a homogeneous network, but instead I have to deal with several kinds of automounter, some of which seem to change between each major release. This seems like a good task for another day. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance with Sun StorageTek 2540
On Mon, 18 Feb 2008, Ralf Ramge wrote: I'm a bit disturbed because I think about switching to 2530/2540 shelves, but a maximum 250 MB/sec would disqualify them instantly, even Note that this is single-file/single-thread I/O performance. I suggest that you read the formal benchmark report for this equipment since it covers multi-thread I/O performance as well. The multi-user performance is considerably higher. Given ZFS's smarts, the JBOD approach seems like a good one as long as the hardware provides a non-volatile cache. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] filebench for Solaris 10?
On Tue, 19 Feb 2008, Marion Hakanson wrote: I've installed and run filebench (version 1.1.0) from the SourceForge packages on Solaris-10 here, both SPARC and x86_64, with no problems. Looks like I downloaded it 23-Jan-2008. This is what I get with the filebench-1.1.0_x86_pkg.tar.gz from SourceForge: # pkgadd -d . pkgadd: ERROR: no packages were found in /home/bfriesen/src/benchmark/filebench # ls install/ pkginfo pkgmapreloc/ My system has the latest package management patches applied. What am I missing? Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] five megabytes per second with Microsoft iSCSI initiator (2.06)
It would be useful if people here who have used iSCSI on top of ZFS could share their performance experiences. It is very easy to waste a lot of time trying to realize unrealistic expectations. Hopefully iSCSI on top of ZFS normally manages to transfer much more than 5MB/second! Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] filebench for Solaris 10?
On Tue, 19 Feb 2008, Marion Hakanson wrote: # pkgadd -d . pkgadd: ERROR: no packages were found in /home/bfriesen/src/benchmark/filebench # ls install/ pkginfo pkgmapreloc/ . . . Um, cd .. and pkgadd -d . again. The package is the actual directory that you unpacked. Note the instructions for unpacking confused me a bit as well. I had expected to pkgadd -d . filebench, but pkgadd is smart enough to scan the entire -d directory for packages. Very odd. That worked. Thank you very much!. It seems that filebench is unconventional in almost every possible way. Installing it based on the available documentation was an exercise in frustration. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Preferred backup s/w
On advice of Joerg Schilling and not knowing what 'star' was, I decided to install it for testing. Star uses a very unorthodox build and install approach so the person building it has very little control over what it does. Unfortunately I made the mistake of installing it under /usr/local where it decided to remove the GNU tar I had installed there. Star does not support traditional tar command line syntax so it can't be used with existing scripts. Performance testing showed that it was no more efficient than the 'gtar' which comes with Solaris. It seems that 'star' does not support an 'uninstall' target so now I am forced to manually remove it from my system. It seems that the best way to deal with star is to install it into its own directory so that it does not interfere with existing software. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Preferred backup s/w
On Fri, 22 Feb 2008, Bob Friesenhahn wrote: where it decided to remove the GNU tar I had installed there. Star does not support traditional tar command line syntax so it can't be used with existing scripts. Performance testing showed that it was no more efficient than the 'gtar' which comes with Solaris. It seems There is something I should clarify in the above. Star is a stickler for POSIX command line syntax so syntax like 'tar -cvf foo.tar' or 'tar cvf foo.tar' does not work, but 'tar -c -v -f foo.tar' does work. Testing with Star, GNU tar, and Solaris cpio showed that Star and GNU tar were able to archive the content of my home directory with no complaint whereas Solaris cpio required specification of the 'ustar' format in order to deal with long file and path names, as well as large inode numbers. Solaris cpio complained about many things with my files (e.g. unresolved passwd and group info), but managed to produce the highest throughput when archiving to a disk file. I can not attest to the ability of these tools to deal with ACLs since I don't use them. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Preferred backup s/w
On Sat, 23 Feb 2008, Joerg Schilling wrote: Star typically needs 1/4 .. 1/3 of the CPU time needed by GNU tar ans it uses two processes to do the work in parallel. If you found a case where star is not faster than GNU tar andwhere the speed is not limited by the filesystem or the I/O devices, this is a bug that will be fixed if you provide the needed information to repeat it. I re-ran my little test today and do see that 'star' does produce somewhat reduced overall run time but does not consume less CPU than GNU tar. This is just a test of the time to archive the files in my home directory. My home directory is in a zfs filesystem. The output is written to a file in the same storage pool but a different filesystem. This time around I used default block sizes rather than 128K. Overall throughput seems on the order of 40MB/second. gtar -cf gtar.tar /home/bfriesen 6.42s user 128.27s system 12% cpu 17:19.66 total -rw-r--r-- 1 bfriesen home 37G Feb 23 10:55 gtar.tar star -c -f star.tar /home/bfriesen 4.11s user 142.65s system 15% cpu 16:03.41 total -rw-r--r-- 1 bfriesen home 37G Feb 23 11:15 star.tar find /home/bfriesen -depth -print 0.55s user 3.52s system 6% cpu 1:01.61 total cpio -o -H ustar -O cpio.tar 11.47s user 122.28s system 11% cpu 18:38.97 total -rwxr-xr-x 1 bfriesen home 37G Feb 23 11:40 cpio.tar* Notice that Sun's cpio marks its output file as executable, which is clearly a bug. Clearly none of these tools are adequate to deal with the massive data storage made easy with zfs storage pools. Zfs requires similarly innovative backup solutions to deal with it. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] The old problem with tar, zfs, nfs and zil
On Mon, 25 Feb 2008, msl wrote: I mean, can you confirm that the zil_disable/zfs solaris nfs service, is a similar service like a standard xfs or ext3 linux/nfs solution (take into account the NFS service provided)? From what I have heard: * Linux does not implement NFS writes correctly in that data is not flushed to disk before returning. Don't turn your Linux system off during application writes since user data will likely be lost when the system returns. Besides the applications losing data, running applications are likely to become confused. * ZFS has had an issue in that requesting a fsync() of one file causes a sync of the entire filesystem. This is a huge performance glitch. Wikipedia says that it is fixed in Solaris Nevada. Someone should update this WikiPedia section: http://en.wikipedia.org/wiki/ZFS#Solaris_implementation_issues Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance with Sun StorageTek 2540
On Sun, 17 Feb 2008, Mertol Ozyoney wrote: Hi Bob; When you have some spare time can you prepare a simple benchmark report in PDF that I can share with my customers to demonstrate the performance of 2540 ? While I do not claim that it is simple I have created a report on my configuration and experience. It should be useful for users of the Sun StorageTek 2540, ZFS, and Solaris 10 multipathing. See http://www.simplesystems.org/users/bfriesen/zfs-discuss/2540-zfs-performance.pdf or http://tinyurl.com/2djewn for the URL challenged. Feel free this share this document with anyone who is interested. Thanks Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance with Sun StorageTek 2540
On Wed, 27 Feb 2008, Cyril Plisko wrote: http://www.simplesystems.org/users/bfriesen/zfs-discuss/2540-zfs-performance.pdf Nov 26, 2008 ??? May I borrow your time machine ? ;-) Are there any stock prices you would like to know about? Perhaps you are interested in the outcome of the elections? There was a time inversion layer in Texas. Fixed now ... Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can ZFS be event-driven or not?
On Wed, 27 Feb 2008, Nicolas Williams wrote: Maybe snapshot file whenever a write-filedescriptor is closed or somesuch? Again. Not enough. Some apps (many!) deal with multiple files. Or more significantly, with multiple pages. When using memory mapping the application may close its file descriptor, but then the underlying file is updated in a somewhat random fashion as dirty pages are written to disk. It seems that this hypothesis is without merit. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can ZFS be event-driven or not?
On Wed, 27 Feb 2008, Uwe Dippel wrote: As much as ZFS is revolutionary, it is far away from being the 'ultimate file system', if it doesn't know how to handle event-driven snapshots UFS == Ultimate File System ZFS == Zettabyte File System Perhaps you have these two confused? ZFS does not lay claim to being the ultimate file system. You can provide great benefit to society if you invent and implement a filesystem with all that ZFS offers, plus your remarkable ideas, provided that the result still provides the performance that users expect and there is sufficient storage space available. Consider this to be your life's mission. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can ZFS be event-driven or not?
On Thu, 28 Feb 2008, Uwe Dippel wrote: 1. The application (NFS - sftp) does not know about the state of writing? Sometimes applications know about the state of writing and sometimes they do not. Sometimes they don't even know they are writing. 2. Obviously nobody sees anything in having access to all versions of a file stored there? First it is necessary to determine what version means when it comes to a file. At the application level, the system presents a different view than what is actually stored on disk since the system uses several levels of write caching to improve performance. The only time that these should necessarily be the same is if the application uses a file descriptor to access the file (no memory mapping) and invokes fsync(). If memory mapping is used, the equivalent is msync() with the MS_SYNC option. Using fsync() or msync(MS_SYNC) blocks the application until the I/O is done. If a file is updated via memory mapping, then the data sent to the underlying file is based on the system's virtual memory system so the actually data sent to disk may not be coherent at all. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Patch 127729-07 not NFS patch!
The Sun Update Manager on my x86 Solaris 10 box describes this new patch as SunOS 5.10_x86 nfs fs patch (note use of nfs) but looking at the problem descriptions this is quite clearly a big ZFS patch that Solaris 10 users should pay attention to since it fixes a bunch of nasty bugs. Maybe someone can fix this fat-fingered patch description in Sun Update Manager? Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic ZFS disk accesses
On Sat, 1 Mar 2008, Bill Shannon wrote: I think I've reached the limit of what I can do remotely. Now I have to repeat all these experiments when I'm sitting next to the disk and can actually hear it and see if the correlation remains. Then, it may be time to dig into the ksh93 code and figure out what it thinks it's doing. Fortunately, I've been there before... One thing that can make a shell periodically active is if it is checking for new mail. Check the ksh man page for descriptions of MAIL, MAILCHECK, MAILPATH. Perhaps whenever it checks for new mail, it also updates this file. Unsetting the MAIL environment variable may make the noise go away. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?
On Mon, 3 Mar 2008, Darren J Moffat wrote: I'm not convinced that single bit flips are the common failure mode for disks. Most enterprise class disks already have enough ECC to correct at least 8 bytes per block. and for consumer rather than enterprise class disks ? You are assuming that the ECC used for consumer disks is substantially different than that used for enterprise disks. That is likely not the case since ECC is provided by a chip which costs a few dollars. The only reason to use a lesser grade algorithm would be to save a small bit of storage space. Consumer disks use essentially the same media as enterprise disks. Consumer disks store a higher bit density on similar media. Consumer disks have less precise/consistent head controllers than enterprise disks. Consumer disks are less well-specified than enterprise disks. Due to the higher bit density we can expect more wrong bits to be read since we are pushing the media harder. Due to less consistent head controllers we can expect more incidences of reading or writing the wrong track or writing something which can't be read. Consumer disks are often used in an environment where they may be physically disturbed while they are writing or reading the data. Enterprise disks are usually used in very stable environments. The upshot of this is that we can expect more unrecoverable errors, but it seems unlikely that there will be more single bit errors recoverable at the ZFS level. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?
On Tue, 4 Mar 2008, Richard Elling wrote: Also note: the checksums don't have enough information to recreate the data for very many bit changes. Hashes might, but I don't know anyone using sha256. It is indeed important to recognize that the checksums are a way to detect that the data is incorrect rather than a way to tell that the data is correct. There may be several permutations of wrong data which can result in the same checksum, but the probability of encountering those permutations due to natural causes is quite small. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs send/recv question
On Fri, 7 Mar 2008, Rob Logan wrote: zfs send -i z/[EMAIL PROTECTED] z/[EMAIL PROTECTED] | bzip2 -c |\ ssh host.com bzcat | zfs recv -v -F -d z zfs send -i z/[EMAIL PROTECTED] z/[EMAIL PROTECTED] | bzip2 -c |\ ssh host.com bzcat | zfs recv -v -F -d z zfs send -i z/[EMAIL PROTECTED]z/[EMAIL PROTECTED]| bzip2 -c |\ ssh host.com bzcat | zfs recv -v -F -d z Since I see 'bzip2' mentioned here (a rather slow compressor), I should mention that based on a recommendation from a friend, I gave a compressor called 'lzop' (http://www.lzop.org/) a try due to its reputation for compression speed. Compressing zfs send was causing it to take much longer. Testing with 'lzop' showed that it was 2.5X faster than gzip on the Opteron CPU and that the compression was just a bit worse than gzip's default compression level. It seems that some assembly language is used for x86 and Opteron. I did not test the relative speed differences on SPARC. The benefit from a compressor depends on the speed of the pipe and the speed of the filesystem. If CPU and/or network is the bottleneck, then LZO compression may be the solution. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Preserve creator across send/receive
On Tue, 11 Mar 2008, Haik Aftandilian wrote: Or is there a way to manually set the creator of a fileystem? Not knowing any better I used a simple 'chown owner:group' syntax. :-) You could also use 'cpio -p' to transfer directory ownership based on the original master. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs backups to tape
On Fri, 14 Mar 2008, Bill Shannon wrote: What's the best way to backup a zfs filesystem to tape, where the size of the filesystem is larger than what can fit on a single tape? ufsdump handles this quite nicely. Is there a similar backup program for zfs? Or a general tape management program that can take data from Previously it was suggested on this list to use a special version of tar called 'star' (ftp://ftp.berlios.de/pub/star). Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS I/O algorithms
On Sat, 15 Mar 2008, Richard Elling wrote: My observation, is that each metaslab is, by default, 1 MByte in size. Each top-level vdev is allocated by metaslabs. ZFS tries to allocate a top-level vdev's metaslab before moving onto another one. So you should see eight 128kByte allocs per top-level vdev before the next top-level vdev is allocated. That said, the actual iops are sent in parallel. So it is not unusual to see many, most, or all of the top-level vdevs concurrently busy. Does this match your experience? I do see that all the devices are quite evenly busy. There is no doubt that the load balancing is quite good. The main question is if there is any actual striping going on (breaking the data into smaller chunks), or if the algorithm is simply load balancing. Striping trades IOPS for bandwidth. Using my application, I did some tests today. The application was used to do balanced read/write of about 500GB of data in some tens of thousand of reasonably large files. The application sequentially reads a file, then sequentially writes a file. Several copies (2-6) of the application were run at once for concurrency. What I noticed is that with hardly any CPU being used, the read+write bandwidth seemed to be bottlenecked at about 280MB/second with 'zfs iostat' showing very balanced I/O between the reads and the writes. The system I set up is performing quite a bit differently than I anticipated. The I/O is bottlenecked and I find that my application can do significant processing of the data without significantly increasing the application run time. So CPU time is almost free. If I was to assign a smaller block size for the filesystem, would that provide more of the benefits of striping or would it be detrimental to performance due to the number of I/Os? Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS I/O algorithms
On Sun, 16 Mar 2008, Richard Elling wrote: But where is the bottleneck? iostat will show bottlenecks in the physical disks and channels. vmstat or mpstat will show the bottlenecks in cpus. To see if the app is the bottleneck will require some analysis of the app itself. Is it spending its time blocked on I/O? The application is spending almost all the time blocked on I/O. I see that the number of device writes per second seems pretty high. The application is doing I/O in 128K blocks. How many IOPS does a modern 300GB 15K RPM SAS drive typically deliver? Of course the IOPS capacity depends on if the access is random or sequential. At the application level, the access is completely sequential but ZFS is likely doing some extra seeks. iostat output (atime=off): extended device statistics devicer/sw/s Mr/s Mw/s wait actv svc_t %w %b sd0 0.00.00.00.0 0.0 0.00.0 0 0 sd1 0.00.00.00.0 0.0 0.02.8 0 0 sd2 0.00.00.00.0 0.0 0.00.0 0 0 sd10 80.4 170.7 10.0 19.9 0.0 9.2 36.5 0 54 sd11 82.1 170.2 10.2 20.0 0.0 13.3 52.9 0 71 sd12 79.3 168.39.9 20.0 0.0 13.1 53.1 0 69 sd13 80.6 173.0 10.0 19.9 0.0 9.3 36.7 0 56 sd14 80.9 167.8 10.1 20.0 0.0 13.4 53.8 0 70 sd15 77.7 168.79.7 19.9 0.0 9.1 37.1 0 52 sd16 77.3 170.69.6 20.0 0.0 13.3 53.7 0 70 sd17 76.4 168.29.5 20.0 0.0 9.1 37.2 0 52 sd18 76.7 172.29.5 19.9 0.0 13.5 54.2 0 70 sd19 83.8 173.2 10.4 20.0 0.0 13.7 53.4 0 74 sd20 73.3 174.39.1 20.0 0.0 9.1 36.9 0 56 sd21 75.3 170.29.4 20.0 0.0 13.2 53.9 0 69 nfs1 0.00.00.00.0 0.0 0.00.0 0 0 % mpstat CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 288 1 189 1018 413 815 26 102 880 30463 3 0 94 1 185 1 180 6341 830 43 111 740 31173 2 0 94 2 284 1 183 5216 617 27 98 670 49544 3 0 93 3 176 1 239 748 353 555 25 76 620 39334 3 0 93 Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Microsoft WinFS for ZFS?
On Mon, 17 Mar 2008, Orvar Korvar wrote: My question is, because WinFS database is running on top of NTFS, could a similar thing be done for ZFS? Implement a database running on top of ZFS, that has similar functionality as WinFS? Object-oriented content management could be run on any sort of underlying file system. It is just a layer on top. (I never understood the advantages of having a database on top of NTFS, maybe it would be pointless for ZFS? Can someone knowledgeable give some input to my question?) ZFS just provides storage. It seems that the problem with object-oriented content management is that a user interface needs to be provided, which is not standardized in any way. This user interface needs to be used to put content into the system, to find content in the system, and to use content from the system. There also needs to be a way to back everything up. If the content management knows about the internal structure of the objects, then it might provide a way to access a document so that all of the objects (e.g. figures) used by that document are visible and may be updated. There are likely some mainframe environments which do this sort of thing, but mainframes are essentially closed systems so the mainframe vendor has more control. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Microsoft WinFS for ZFS?
On Tue, 18 Mar 2008, Orvar Korvar wrote: Just as ZFS makes NTFS look like crap, I would like SUN to make something that makes WinFS look like crap! :o) Would it be possible to utilize the unique functions ZFS has, to revolutionize again? What possible advantages could ZFS provide for the database thingy? Are there any advantages to use ZFS instead, at all? Speculations are welcome! :o) ZFS is cool because it is very clean, nicely documented, and is very simple for the user. It would be quite wrong for Sun to diverge from this. There are many other things that Sun should focus on before worrying about content management. It would be useful if ZFS helped make using the SAN as easy as it makes using a collection of already accessible disks. ZFS is pretty, but it is layered on top of some very ugly looking things (e.g. multipath is super-ugly), so lets attend to those ugly things before worrying about adding frosting on top. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS I/O algorithms
On Wed, 19 Mar 2008, Bill Moloney wrote: When application IO sizes get small, the overhead in ZFS goes up dramatically. Thanks for the feedback. However, from what I have observed, it is not a full story at all. On my own system, when a new file is written, the write block size does not make a significant difference to the write speed. Similarly, read block size does not make a significant difference to the sequential read speed. I do see a large difference in rates when an existing file is updated sequentially. There is a many orders of magnitude difference for random I/O type updates. I think that there some rather obvious reasons for the difference between writing a new file, or updating an existing file. When writing a new file, the system can buffer up to a disk block's worth of size prior to issuing a a disk I/O, or it can immedialy write what it has and since the write is sequential, it does not need to re-read prior to write (but there may be more metadata I/Os). For the case of updating part of a disk block, there needs to be a read prior to write if the block is not cached in RAM. If the system is short on RAM, it may be that ZFS issues many more write I/Os than if it has a lot of RAM. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS I/O algorithms
On Thu, 20 Mar 2008, Mario Goebbels wrote: Similarly, read block size does not make a significant difference to the sequential read speed. Last time I did a simple bench using dd, supplying the record size as blocksize to it instead of no blocksize parameter bumped the mirror pool speed from 90MB/s to 130MB/s. Indeed. However, as an interesting twist to things, in my own benchmark runs I see two behaviors. When the file size is smaller than the amount of RAM the ARC can reasonably grow to, the write block size does make a clear difference. When the file size is larger than RAM, the write block size no longer makes much difference and sometimes larger block sizes actually go slower. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS I/O algorithms
On Thu, 20 Mar 2008, Jonathan Edwards wrote: in that case .. try fixing the ARC size .. the dynamic resizing on the ARC can be less than optimal IMHO Is a 16GB ARC size not considered to be enough? ;-) I was only describing the behavior that I observed. It seems to me that when large files are written very quickly, that when the file becomes bigger than the ARC, that what is contained in the ARC is mostly stale and does not help much any more. If the file is smaller than the ARC, then there is likely to be more useful caching. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best practices for ZFS plaiding
On Wed, 26 Mar 2008, Tim wrote: No raid at all. The system should just stripe across all of the LUN's automagically, and since you're already doing your raid on the thumper's, they're *protected*. You can keep growing the zpool indefinitely, I'm not aware of any maximum disk limitation. The data may be protected, but the uptime will be dependent on the uptime of all of those systems. Downtime of *any* of the systems in a load-share configuration means downtime for the entire pool. Of course this is the case with any storage system as more hardware is added but autonomously administered hardware is more likely to encounter a problem. Local disk is usually more reliable than remote disk. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Status of ZFS boot for sparc?
On Wed, 26 Mar 2008, Lori Alt wrote: zfs boot support for sparc (included in the overall delivery of zfs boot, which includes install support, support for swap and dump zvols, and various other improvements) is still planned for Update 6. Does zfs boot have any particular firmware dependencies? Will it work on old SPARC systems? Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Periodic flush
My application processes thousands of files sequentially, reading input files, and outputting new files. I am using Solaris 10U4. While running the application in a verbose mode, I see that it runs very fast but pauses about every 7 seconds for a second or two. This is while reading 50MB/second and writing 73MB/second (ARC cache miss rate of 87%). The pause does not occur if the application spends more time doing real work. However, it would be nice if the pause went away. I have tried turning down the ARC size (from 14GB to 10GB) but the behavior did not noticeably improve. The storage device is trained to ignore cache flush requests. According to the Evil Tuning Guide, the pause I am seeing is due to a cache flush after the uberblock updates. It does not seem like a wise choice to disable ZFS cache flushing entirely. Is there a better way other than adding a small delay into my application? Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Periodic flush
On Wed, 26 Mar 2008, Neelakanth Nadgir wrote: When you experience the pause at the application level, do you see an increase in writes to disk? This might the regular syncing of the transaction group to disk. If I use 'zpool iostat' with a one second interval what I see is two or three samples with no write I/O at all followed by a huge write of 100 to 312MB/second. Writes claimed to be a lower rate are split across two sample intervale. It seems that writes are being cached and then issued all at once. This behavior assumes that the file may be written multiple times so a delayed write is more efficient. If I run a script like while true do sync done then the write data rate is much more consistent (at about 66MB/second) and the program does not stall. Of course this is not very efficient. Are the 'zpool iostat' statistics accurate? Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Periodic flush
On Thu, 27 Mar 2008, Neelakanth Nadgir wrote: This causes the sync to happen much faster, but as you say, suboptimal. Haven't had the time to go through the bug report, but probably CR 6429205 each zpool needs to monitor its throughput and throttle heavy writers will help. I hope that this feature is implemented soon, and works well. :-) I tested with my application outputting to a UFS filesystem on a single 15K RPM SAS disk and saw that it writes about 50MB/second and without the bursty behavior of ZFS. When writing to ZFS filesystem on a RAID array, zpool I/O stat reports an average (over 10 seconds) write rate of 54MB/second. Given that the throughput is not much higher on the RAID array, I assume that the bottleneck is in my application. Are the 'zpool iostat' statistics accurate? Yes. You could also look at regular iostat and correlate it. Iostat shows that my RAID array disks are loafing with only 9MB/second writes to each but with 82 writes/second. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] nfs and smb performance
On Fri, 28 Mar 2008, abs wrote: Sorry for being vague but I actually tried it with the cifs in zfs option, but I think I will try the samba option now that you mention it. Also is there a way to actually improve the nfs performance specifically? CIFS uses TCP. NFS uses either TCP or UDP, and usually UDP by default. In order to improve NFS client performance, it may be useful to increase the 'rsize' and 'wsize' client mount options to 32K. Solaris 10 defaults the buffer size to 32K but many other clients use 8K. Some clients support a '-a' option to specify the maximum read-ahead and tuning this value can help considerably for sequential access. Using gigabit eithernet with jumbo frames will improve performance even further. Notice that most of these tunings are for the client-side and not for the server. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] nfs and smb performance
CIFS uses TCP. NFS uses either TCP or UDP, and usually UDP by default. For Sun systems, NFSv3 using 32kByte [rw]size over TCP has been the default configuration for 10+ years. Do you still see clients running NFSv2 over UDP? Yes, I see that TCP is the default in Solaris 9. Is it also the default in Solaris 8?. I do know that tuning mount options made a considerable difference for FreeBSD 5.X and Apple's OS X Tiger. Apple's OS X Leopard does not seem to need tuning like previous versions did. OS X Tiger and earlier actually sent application writes directly to NFS so that performance was very dependent on application write size regardless of client NFS tunings. Unfortunately, not everyone is using Solaris. The Solaris 10 NFS client implementation really screams. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Problem importing pool from BSD 7.0 into Nexenta
On Mon, 31 Mar 2008, Tim wrote: Perhaps someone else can correct me if I'm wrong, but if you're using the whole disk, ZFS shouldn't be displaying a slice when listing your disks, should it? I've *NEVER* seen it do that on any of mine except when using partials/slicese. I would expect: c1d1s8 To be: c1d1 Yes, this seems suspicious. It is also suspicious that some devices use 'p' (partition?) while others use 's' (slice?). The partitions may be FreeBSD partitions or some other type that Solaris is not expecting. FreeBSD can partition at a level visible to the BIOS and it can further sub-partition a FreeBSD partition for use in individual filesystems. Regardless, I am very interested to hear if ZFS pools can really be transferred back and forth between Solaris and FreeBSD. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OpenSolaris ZFS NAS Setup
On Mon, 7 Apr 2008, Ross wrote: However that doesn't necessarily mean it's ready for production use. ZFS will hang for 3 mins (180 seconds) waiting for the iSCSI client to timeout. Now I don't know about you, but HA to me doesn't mean Highly Available, but with occasional 3 minute breaks. Most of the client applications we would want to run on ZFS would be broken with a 3 minute delay returning data, and this was enough for us to give up on ZFS over iSCSI for now. It seems to me that this is a problem with the iSCSI client timeout parameters rather than ZFS itself. Three minutes is sufficient for use over the internet but seems excessive on a LAN. Have you investigated to see if the iSCSI client timeout parameters can be adjusted? Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS volume export to USB-2 or Firewire?
Currently it is easy to share a ZFS volume as an iSCSI target. Has there been any thought toward adding the ability to share a ZFS volume via USB-2 or Firewire to a directly attached client? There is a substantial market for storage products which act like a USB-2 or Firewire drive. Some of these offer some form of RAID. It seems to me that ZFS with a server capability to appear as several USB-2 or Firewire drives (or eSATA) may be appealing for larger RAIDs of several terrabytes. Is anyone aware of an application which can usefully share a ZFS volume (essentially a file) in this way? Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance of one single 'cp'
On my drive array (capable of 260MB/second single-process writes and 450MB/second single-process reads) 'zfs iostat' reports a read rate of about 59MB/second and a write rate of about 59MB/second when executing 'cp -r' on a directory containing thousands of 8MB files. This seems very similar to the performance you are seeing. The system indicators (other than disk I/O) are almost flatlined at zero while the copy is going on. It seems that a multi-threaded 'cp' could be much faster. With GNU xargs, find, and cpio, I think that it is possible to cobble together a much faster copy since GNU xargs supports --max-procs and --max-args arguments to allow executing commands concurrently with different sets of files. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ls -lt for links slower than for regular files
On Tue, 8 Apr 2008, [EMAIL PROTECTED] wrote: a few seconds and the links list in, perhaps, 60 seconds. Is there a difference in what ls has to do when listing links versus listing regular files in ZFS that would cause a slowdown? Since you specified '-t' the links have to be dereferenced (find the file that is referred to) which results in opening the directory to see if the file exists, and what its properties are. With 50K+ files, opening the directory and finding the file will take tangible time. If there are multiple directories in the symbolic link path, then these directories need to be opened as well. Symbolic links are not free. More RAM may help if it results in keeping the directory data hot in the cache. If the links were hard links rather than symbolic links, then performance will be similar to a regular file (since it is then a regular file). Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS volume export to USB-2 or Firewire?
On Wed, 9 Apr 2008, Ross wrote: Well the first problem is that USB cables are directional, and you don't have the port you need on any standard motherboard. That Thanks for that info. I did not know that. Adding iSCSI support to ZFS is relatively easy since Solaris already supported TCP/IP and iSCSI. Adding USB support is much more difficult and isn't likely to happen since afaik the hardware to do it just doens't exist. I don't believe that Firewire is directional but presumably the Firewire support in Solaris only expects to support certain types of devices. My workstation has Firewire but most systems won't have it. It seemed really cool to be able to put your laptop next to your Solaris workstation and just plug it in via USB or Firewire so it can be used as a removable storage device. Or Solaris could be used on appropriate hardware to create a more reliable portable storage device. Apparently this is not to be and it will be necessary to deal with iSCSI instead. I have never used iSCSI so I don't know how difficult it is to use as temporary removable storage under Windows or OS-X. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS volume export to USB-2 or Firewire?
On Wed, 9 Apr 2008, Richard Elling wrote: I just get my laptop within WiFi range and mount :-). I don't see any benefit to a wire which is slower than Ethernet, when an Ethernet port is readily available on almost all modern laptops. Under Windows or Mac, is this as convenient as pugging in a USB or Firewire disk or does it require system administrator type knowledge? If you go to Starbucks, does your laptop attempt to mount your iSCSI volume on a (presumably) unreachable network? Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OpenSolaris ZFS NAS Setup
On Fri, 11 Apr 2008, Simon Breden wrote: Thanks myxiplx for the info on replacing a faulted drive. I think the X4500 has LEDs to show drive statuses so you can see which physical drive to pull and replace, but how does one know which physical disk to pull out when you just have a standard PC with drives directly plugged into on-motherboard SATA connectors -- i.e. with no status LEDs? This should be a wakeup call to make sure that this is all figured out in advance before the hardware fails. If you were to format the drive for a traditional filesystem you would need to know which one it was. Failure recovery should be no different except for the fact that the machine may be down, pressure is on, and the information you expected to use for recovery was on that machine. :-) This is a case where it is worthwhile maintaining a folder (in paper form) which contains important recovery information for your machines. Open up the machine in advance and put sticky labels on the drives with their device names. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] LZO compression?
On Sat, 12 Apr 2008, roland wrote: i'm really wondering that interest in alternative compression schemes is that low, especially due to the fact that lzo seems to compress better and be faster than lzjb. LZO seems to have a whole family of compressors. One reason why it is faster is that the author has worked really hard on a few CPU-specific optimizations. Is the license ok for Solaris? Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 24-port SATA controller options?
On Mon, 14 Apr 2008, Blake Irvin wrote: The only supported controller I've found is the Areca ARC-1280ML. I want to put it in one of the 24-disk Supermicro chassis that Silicon Mechanics builds. For obvious reasons (redundancy and throughput), it makes more sense to purchase two 12 port cards. I see that there is an option to populate more cache RAM. I would be interested to know what actual throughput that one card is capable of. The CDW site says 300MB/s. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance of one single 'cp'
On Mon, 14 Apr 2008, Jeff Bonwick wrote: disks=`format /dev/null | grep c.t.d | nawk '{print $2}'` I had to change the above line to disks=`format /dev/null | grep ' c.t' | nawk '{print $2}'` in order to match my mutipathed devices. ./diskqual.sh c1t0d0 130 MB/sec c1t1d0 13422 MB/sec c4t600A0B80003A8A0B096A47B4559Ed0 190 MB/sec c4t600A0B80003A8A0B096E47B456DAd0 202 MB/sec c4t600A0B80003A8A0B096147B451BEd0 186 MB/sec c4t600A0B80003A8A0B096647B453CEd0 176 MB/sec c4t600A0B80003A8A0B097347B457D4d0 189 MB/sec c4t600A0B800039C9B50A9C47B4522Dd0 174 MB/sec c4t600A0B800039C9B50AA047B4529Bd0 197 MB/sec c4t600A0B800039C9B50AA447B4544Fd0 223 MB/sec c4t600A0B800039C9B50AA847B45605d0 224 MB/sec c4t600A0B800039C9B50AAC47B45739d0 223 MB/sec c4t600A0B800039C9B50AB047B457ADd0 219 MB/sec c4t600A0B800039C9B50AB447B4595Fd0 223 MB/sec My 'cp -r' performance is about the same as Henrik's. The 'cp -r' performance is much less than disk benchmark tools would suggest. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Confused by compressratio
On Tue, 15 Apr 2008, Luke Scharf wrote: AFAIK, ext3 supports sparse files just like it should -- but it doesn't dynamically figure out what to write based on the contents of the file. Since zfs inspects all data anyway in order to compute the block checksum, it can easily know if a block is all zeros. For ext3, inspecting all blocks for zeros would be viewed as unnecessary overhead. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 24-port SATA controller options?
On Tue, 15 Apr 2008, Keith Bierman wrote: Perhaps providing the computations rather than the conclusions would be more persuasive on a technical list ; No doubt. The computations depend considerably on the size of the disk drives involved. The odds of experiencing media failure on a single 1TB SATA disk are quite high. Consider that this media failure may occur while attempting to recover from a failed disk. There have been some good articles on this in USENIX Login magazine. ZFS raidz1 and raidz2 are NOT directly equivalent to RAID5 and RAID6 so the failure statistics would be different. Regardless, single disk failure in a raidz1 substantially increases the risk that something won't be recoverable if there is a media failure while rebuilding. Since ZFS duplicates its own metadata blocks, it is most likely that some user data would be lost but the pool would otherwise recover. If a second disk drive completely fails, then you are toast with raidz1. RAID5 and RAID6 rebuild the entire disk while raidz1 and raidz2 only rebuild existing data blocks so raidz1 and raidz2 are less likely to experience media failure if the pool is not full. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 24-port SATA controller options?
On Tue, 15 Apr 2008, Maurice Volaski wrote: 4 drive failures over 5 years. Of course, YMMV, especially if you drive drunk :-) Note that there is a difference between drive failure and media data loss. In a system which has been running fine for a while, the chance of a second drive failing during rebuild may be low, but the chance of block-level media failure is not. However, computers do not normally run in a vaccum. Many failures are caused by something like a power glitch, temperature cycle, or the flap of a butterfly's wings. Unless your environment is completely stable and the devices are not dependent on some of the same things (e.g. power supplies, chassis, SATA controller, air conditioning) then what caused one device to fail may very well cause another device to fail. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Will ZFS employ raid0 stripes in an ordinary storage pool?
On Tue, 15 Apr 2008, Brandon High wrote: I think RAID-Z is different, since the stripe needs to spread across all devices for protection. I'm not sure how it's done. My understanding is that RAID-Z is indeed different and does NOT have to spread across all devices for protection. It can use less than the total available devices and since parity is distributed the parity could be written to any drive. I am sure that someone will correct me if the above is wrong. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Periodic flush
On Tue, 15 Apr 2008, Mark Maybee wrote: going to take 12sec to get this data onto the disk. This impedance mis-match is going to manifest as pauses: the application fills the pipe, then waits for the pipe to empty, then starts writing again. Note that this won't be smooth, since we need to complete an entire sync phase before allowing things to progress. So you can end up with IO gaps. This is probably what the original submitter is Yes. With an application which also needs to make best use of available CPU, these I/O gaps cut into available CPU time (by blocking the process) unless the application uses multithreading and an intermediate write queue (more memory) to separate the CPU-centric parts from the I/O-centric parts. While the single-threaded application is waiting for data to be written, it is not able to read and process more data. Since reads take time to complete, being blocked on write stops new reads from being started so the data is ready when it is needed. There is one down side to this new model: if a write load is very bursty, e.g., a large 5GB write followed by 30secs of idle, the new code may be less efficient than the old. In the old code, all This is also a common scenario. :-) Presumably the special slow I/O code would not kick in unless the burst was large enough to fill quite a bit of the ARC. Real time throttling is quite a challenge to do in software. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 24-port SATA controller options?
On Wed, 16 Apr 2008, David Magda wrote: RAID5 and RAID6 rebuild the entire disk while raidz1 and raidz2 only rebuild existing data blocks so raidz1 and raidz2 are less likely to experience media failure if the pool is not full. While the failure statistics may be different, I think any comparison would be apples-to-apples. Note that if the pool is only 10% full, then it is 10X less likely to experience a media failure during rebuild than traditional RAID-5/6 with the same disks. In addition to this, zfs replicates metadata and writes the copies to different disks depending on the redundancy strategy. A traditional filesystem on traditional RAID does not have this same option (having no knowledge of the underlying disks) even though it does replicate some essential metadata (multiple super blocks). Since my time on this list, the vast majority of reports have been of the nature my pool did not come back up after system crash or the pool stopped responding and not that their properly redundant pool lost some user data. This indicates that the storage principles are quite sound but the implementation (being relatively new) still has a few rough edges. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS raidz write performance:what to expect from SATA drives on ICH9R (AHCI)
On Thu, 17 Apr 2008, Tim wrote: Along those lines, I'd *strongly* suggest running Jeff's script to pin down whether one drive is the culprit: But that script only tests read speed and Pascal's read performance seems fine. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Solaris 10U5 ZFS features?
Even though I am on a bunch of Sun propaganda lists, I have not yet spotted an announcement for Solaris 10U5 even though it is now available for download. Sun's formal web site is useless for comparing what is in different update releases since its notion of What's New is a comparison with Solaris 9 and Solaris 8, which are as old as dirt and it is not clear if and when this summary gets updated. Can someone please post a summary of any new ZFS features or significant fixes which are in Solaris 10U5? Is there value to upgrading a system to this release over and above what is provided by patches? Thanks, Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS raidz write performance:what to expect from SATA drives on ICH9R
On Fri, 18 Apr 2008, Pascal Vandeputte wrote: Thanks for all the replies! Some output from iostat -x 1 while doing a dd of /dev/zero to a file on a raidz of c1t0d0s3, c1t1d0 and c1t2d0 using bs=1048576: [ data removed ] It's all a little fishy, and kw/s doesn't differ much between the drives (but this could be explained as drive(s) with longer wait queues holding back the others I guess?). Your data does strongly support my hypothesis that using a slice on 'sd0' would slow down writes. It may also be that your boot drive is a different type and vintage from the other drives. Testing with output from /dev/zero is not very good since zfs treats blocks of zeros specially. I have found 'iozone' (http://www.iozone.org/) to be quite useful for basic filesystem throughput testing. Hmm, doesn't look like one drive holding back another one, all of them seem to be equally slow at writing. Note that if drives are paired, or raidz requires a write to all drives, then the write rate is necessarily limited to the speed of the slowest device. I suspect that your c1t1d0 and c1t2d0 drives are similar type and vintage whereas the boot drive was delivered with the computer and has different performance characteristics (double wammy). Usually drives delivered with computers are selected by the computer vendor based on lowest cost in order to decrease the cost of the entire computer. SATA drives are cheap this days so perhaps you can find a way to add a fourth drive which is at least as good as the drives you are using for c1t1d0 and c1t2d0. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS raidz write performance:what to expect from SATA drives on ICH9R
On Fri, 18 Apr 2008, Pascal Vandeputte wrote: - does Solaris require a swap space on disk No, Solaris does not require a swap space. However you do not have a lot of memory so when there is not enough virtual memory available, programs will fail to allocate memory and quit running. There is an advantage to having a swap area since then Solaris can put rarely used pages in swap to improve overall performance. The memory can then be used for useful caching (e.g. ZFS ARC), or for your applications. In addition to using a dedicated partition, you can use a file on UFS for swap ('man swap') and ZFS itself is able to support a swap volume. I don't think that you can put a normal swap file on ZFS so you would want to use ZFS's built-in support for that. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS raidz write performance:what to expect from SATA drives on ICH9R
On Sun, 20 Apr 2008, A Darren Dunham wrote: I think these paragraphs are referring to two different concepts with swap. Swapfiles or backing store in the first, and virtual memory space in the second. The swap area is mis-named since Solaris never swaps. Some older operating systems would put an entire program in the swap area when the system ran short on memory and would have to swap between programs. Solaris just pages (a virtual memory function) and it is very smart about how and when it does it. Only dirty pages which are not write-mapped to a file in the filesystem need to go in the swap area, and only when the system runs short on RAM. Solaris is a quite-intensely memory-mapped system. The memory mapping allows a huge amount of sharing of shared library files, program text images, and unmodified pages shared after fork(). The end result is a very memory-efficient OS. Now if we could just get ZFS ARC and Gnome Desktop to not use any memory, we would be in nirvana. :-) Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS raidz write performance:what to expect from SATA drives on ICH9R
On Sat, 19 Apr 2008, michael schuster wrote: that's true most of the time ... unless free memory gets *really* low, then Solaris *does* start to swap (ie page out pages by process). IIRC, the threshold for swapping is minfree (measured in pages), and the value that needs to fall below this threshold is freemem. Most people here are likely too young to know what swapping really is. Swapping is not the same as the paging that Solaris does. With swapping the kernel knows that this address region belongs to this process and we are short of RAM so block copy the process to the swap area, and only remember that it exists via the process table. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] backup for x4500?
On Sun, 20 Apr 2008, Peter Tribble wrote: Does anyone here have experience of this with multi-TB filesystems and any of these solutions that they'd be willing to share with me please? My experience so far is that anything past a terabyte and 10 million files, and any backup software struggles. What is the cause of the struggling? Does the backup host run short of RAM or CPU? If backups are incremental, is a large portion of time spent determining the changes to be backed up? What is the relative cost of many small files vs large files? How does 'zfs send' performance compare with a traditional incremental backup system? Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] backup for x4500?
On Sun, 20 Apr 2008, Peter Tribble wrote: What is the cause of the struggling? Does the backup host run short of RAM or CPU? If backups are incremental, is a large portion of time spent determining the changes to be backed up? What is the relative cost of many small files vs large files? It's just the fact that, while the backup completes, it can take over 24 hours. Clearly this takes you well over any backup window. It's not so much that the backup software is defective; it's an indication that traditional notions of backup need to be rethought. There is no doubt about that. However, there are organizations with hundreds of terrabytes online and they manage to survive somehow. I receive bug reports from people with 600K files in a single subdirectory. Terrabyte-sized USB drives are available now. When you say that the backup can take over 24 hours, are you talking only about the initial backup, or incrementals as well? I have one small (200G) filesystem that takes an hour to do an incremental with no changes. (After a while, it was obvious we don't need to do that every night.) That is pretty outrageous. It seems that your backup software is suspect since it must be severely assaulting the filesystem. I am using 'rsync' (version 3.0) to do disk-to-disk network backups (with differencing) to a large Firewire type drive and have not noticed any performance issues. I do not have 10 million files though (I have about half of that). Since zfs supports really efficient snapshots, a backup system which is aware of snapshots can take snapshots and then backup safely even if the initial dump takes several days. Really smart software could perform both initial dump and incremental dump simultaneously. The minimum useful incremental backup interval would still be be limited to the time required to do one incremental backup. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS for write-only media?
Are there any plans to support ZFS for write-only media such as optical storage? It seems that if mirroring or even zraid is used that ZFS would be a good basis for long term archival storage. Has this been considered? I expect that it is possible today by using files as the underlying media and then copying those individual files to optical storage. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS for write-only media?
On Mon, 21 Apr 2008, Dana H. Myers wrote: Bob Friesenhahn wrote: Are there any plans to support ZFS for write-only media such as optical storage? It seems that if mirroring or even zraid is used that ZFS would be a good basis for long term archival storage. I'm just going to assume that write-only here means write-once, read-many, since it's far too late for an April Fool's joke. Yes, of course. Such as to CD-R, DVD-RW, or more exotic technologies such as holographic drives (300GB drives are on the market). For example, with two CD-R drives it should be possible to build a ZFS mirror on two CDs, but the I/O to these devices may need to be done in a linear sequential fashion at a rate sufficient to keep the writer happy, so temporary files (or memory-based buffering) likely need to be used. No one wants to be faced with a situation in which two copies are made to CD but both copies are deemed to be bad when they are read. ZFS could make that situation much better. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS for write-only media?
On Mon, 21 Apr 2008, Mark A. Carlson wrote: Maybe what you want is to archive files off to optical media? Perhaps ADM - http://opensolaris.org/os/project/adm ? That looks interesting, but true archiving is needed. The level of archiving for this application is that copies would be kept thousands of feet underground in a stable salt mine on continents 'A' and 'B'. An alternative is special temperature, humidity, and pressure controlled above-ground bunkers. It is desired that the data be preserved for hundreds or a thousand years, which would of course require copying to more modern media ever so often. The cost to create the original data is up to $200 million (today's cost) and it can not be recreated. The size of the originals to be archived ranges from 2TB to 400TB depending on how deep the archiving is. The existing archive approach is in analog form but it is found that there is noticeable degredation after 50 or 100 years which is not possible to fully correct. When saw a discussion of these requirements today, ZFS immediately came to mind due to its many media-independent error detection and correction features, and the fact that it is open source. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS for write-only media?
On Mon, 21 Apr 2008, Mark A. Carlson wrote: Interesting problem. And yes you are right, there are a number of problems to solve here, see: http://blogs.sun.com/mac/en_US/entry/open_archive Standards and open source are clearly the way to go. Many open source applications have already been demonstrated to last far longer than their commercial counterparts. ZFS is open sourced but it is perhaps not mature and widespread enough yet to be seen as a stable long-term storage standard. The problem is a long term problem so there seems to be opportunity here for ZFS if it is adapted somewhat to address archiving. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS for write-only media?
On Tue, 22 Apr 2008, Ralf Bertling wrote: Hi Bob, If I was willing to do that I would simply build a pool from file- based storage being n-ISO images. It would involve the following steps 1. create blank ISO images of the size of your media 2. zpool create wormyz raidz2 image1.iso image2.iso image3.iso ... 3. Move your data to the pool 4. export the pool 5. burn the media If you need to recover, copy the data from the device using dd conv=sync,noerror Yes, I know that this will work and what I thought of. But I was thinking that perhaps ZFS would be able to attach to the read-only pool. At the moment it is likely not willing to attach to read-only devices since part of its function depends on writing. The problem here is that by putting the data away from your machine, you loose the chance to scrub it on a regular basis, i.e. there is always the risk of silent corruption. Running a scrub is pointless since the media is not writeable. :-) I am not an expert, but the MTTDL is in tousands of years when using raidz2 with a hot-spare and regular scrubbing. A thousand years ago, knights were storming castle calls. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS for write-only media?
On Tue, 22 Apr 2008, Jonathan Loran wrote: But that's the point. You can't correct silent errors on write once media because you can't write the repair. Yes, you can correct the error (at time of read) due to having both redundant media, and redundant blocks. That is a normal function of ZFS. It just not possible to correct the failed block on the media by re-writing it or moving its data to a new location. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS for write-only media?
On Tue, 22 Apr 2008, Jonathan Loran wrote: I suppose with ditto blocks, this has some merrit. Someone needs to characterize how errors probigate on different types of WORM media. perhaps this has already been done. In my experience, when DVD-R go south, they really go bad at once. Not a lot of small bit errors. But a full analysis would be good. Probably it would make the most sence to write mirrored WORM disks with different technology to hedge your bets. It does not really matter since ZFS supports various forms of RAID, including arbitrary mirroring. If possible, the media can be purchased from different vendors so there is less chance of similar bit-rot across the lot. With $40 to $200 million spent per project, a few extra copies is in the noise. :-) Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Diverse, Dispersed, Distributed, Unscheduled RAID volumes
On Fri, 25 Apr 2008, Richard Elling wrote: No. ZFS is not a distributed file system. While the results might not be pretty, if each PC exports a drive via iSCSI and mirroring is used with plenty of PCs in each mirror, it seems like it would work but with likely dismal performance if a PC was turned off (retries and 3+ minute iSCSI failure recovery logic). There would be additional dismal performance when the PC is turned back on due to cumulative resilvering. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs data corruption
On Sat, 26 Apr 2008, Carson Gaspar wrote: It's not safe to jump to this conclusion. Disk drivers that support FMA won't log error messages to /var/adm/messages. As more support for I/O FMA shows up, you won't see random spew in the messages file any more. mode=large financial institution paying support customer That is a Very Bad Idea. Please convey this to whoever thinks that they're helping by not sysloging I/O errors. If this shows up in Solaris 11, we will Not Be Amused. Lack of off-box error logging will directly cause loss of revenue. /mode I am glad to hear that your large financial institution (Bear Stearns?) is contributing to the OpenSolaris project. :-) Today's systems are very complex and may contain many tens of disks. Syslog is a bottleneck and often logs to local files, which grow very large, and hinder system performance while many log messages are being reported. If syslog is to a remote host, then the network is also impacted. If a device (or several inter-related devices) is/are experiencing problems, it seems best to isolate and diagnose it, with one intelligent notification rather than spewing hundreds of thousands of low-level error messages to a system logger. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS - Implementation Successes and Failures
On Mon, 28 Apr 2008, Dominic Kay wrote: I'm not looking to replace the Best Practices or Evil Tuning guides but to take a slightly different slant. If you have been involved in a ZFS implementation small or large and would like to discuss it either in confidence or as a referenceable case study that can be written up, I'd be grateful if you'd make contact. Back in February I set up ZFS on a 12-disk StorageTek 2540 array and documented my experience (at that time) in the white paper available at http://www.simplesystems.org/users/bfriesen/zfs-discuss/2540-zfs-performance.pdf;. Since then I am still quite satisified. ZFS has yet to report a bad block or cause me any trouble at all. The only complaint I would have is that 'cp -r' performance is less than would be expected given the raw bandwidth capacity. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs performance so bad on my system
On Tue, 29 Apr 2008, Krzys wrote: I am not sure, I had very ok system when I did originally build it and when I did originally started to use zfs, but now its so horribly slow. I do believe that amount of snaps that I have are causing it. This seems like a bold assumption without supportive evidence. # zpool list NAMESIZEUSED AVAILCAP HEALTH ALTROOT mypool 278G255G 23.0G91% ONLINE - mypool21.59T 1.54T 57.0G96% ONLINE - Very full! For example I am trying to copy 1.4G file from my /var/mail to /d/d1 directory which is zfs file system on mypool2 pool. It takes 25 minutes to copy it, while copying it to tmp directory only takes few seconds. Whats wrong with this? Why its so long to copy that wile to my zfs file system? Not good. Some filesystems get slower when they are almost full since they have to work harder to find resources and verify quota limits. I don't know if that applies to ZFS. However, it may be that you have one or more disks which is experiencing many soft errors (several re-tries before success) and maybe you should look into that first. ZFS runs on top of a bunch of other subsystems and drivers so if those other subsystems and drivers are slow to repond then ZFS will be slow. With your raidz2 setup, all it takes is one slow disk to slow everything down. I suggest using 'iostat -e' to check for device errors, and 'iostat -x' (while doing the copy) to look for suspicious device behavior. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] share zfs hierarchy over nfs
On Tue, 29 Apr 2008, Tim Wood wrote: but that makes it sound like this issue was resolved by changing the NFS client behavior in solaris. Since my NFS client machines are going to be linux machines that doesn't help me any. Yes, Solaris 10 does nice helpful things that other OSs don't do. I use per-user ZFS filesystems so I encountered the same problem. It is necessary to force the automounter to request the full mount path. On Solaris and OS-X Leopard client systems I use an /etc/auto_home like # Home directory map for automounter # * freddy:/home/ which also works for Solaris 9 without depending on the Solaris 10 feature. For FreeBSD (which uses the am-utils automounter) I figured out this horrific looking map incantation: * type:=nfs;rhost:=freddy;rfs:=/home/${key};fs:=${autodir}/${rhost}${rfs};opts:=rw,grpid,resvport,vers=3,proto=tcp,nosuid,nodev So for Linux, I think that you will also need to figure out an indirect-map incantation which works for its own broken automounter. Make sure that you read all available documentation for the Linux automounter so you know which parts don't actually work. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] share zfs hierarchy over nfs
On Tue, 29 Apr 2008, Jonathan Loran wrote: Oh contraire Bob. I'm not going to boost Linux, but in this department, they've tried to do it right. If you use Linux autofs V4 or higher, you can use Sun style maps (except there's no direct maps in V4. Need V5 for direct maps). For our home directories, which use an indirect map, we just use the Solaris map, thus: auto_home: *zfs-server:/home/ Sorry to be so off (ZFS) topic. I am glad to hear that the Linux automounter has moved forward since my experience with it a couple of years ago and indirect maps were documented but also documented not to actually work. :-) I don't think that this discussion is off-topic. Filesystems are so easy to create with ZFS that it has become popular to create per-user filesystems. It would be useful if the various automounter incantations to make everything work would appear in a ZFS-related Wiki somewhere. This can be an embarrassing situtation for the system administrator who thinks that everything is working fine due to testing with Solaris 10 clients. So he swiches all the home directories to ZFS per-user filesystems overnight. Imagine the frustration and embarrassment when that poor system administrator returns the next day and finds that many users can not access their home directories! Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS still crashing after patch
On Thu, 1 May 2008, Rustam wrote: Today my production server crashed 4 times. THIS IS NIGHTMARE! Self-healing file system?! For me ZFS is SELF-KILLING filesystem. I cannot fsck it, there's no such tool. I cannot scrub it, it crashes 30-40 minutes after scrub starts. I cannot use it, it crashes a number of times every day! And with every crash number of checksum failures is growing: Is your ZFS pool configured with redundancy (e.g mirrors, raidz) or is it non-redundant? If non-redundant, then there is not much that ZFS can really do if a device begins to fail. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS still crashing after patch
On Thu, 1 May 2008, Rustam wrote: operating system: 5.10 Generic_127112-07 (i86pc) Seems kind of old. I am using Generic_127112-11 here. Probably many hundreds of nasty bugs have been eliminated since the version you are using. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS still crashing after patch
On Mon, 5 May 2008, Marcelo Leal wrote: Hello, If you believe that the problem can be related to ZIL code, you can try to disable it to debug (isolate) the problem. If it is not a fileserver (NFS), disabling the zil should not impact consistency. In what way is NFS special when it comes to ZFS consistency? If NFS consistency is lost by disabling the zil then local consistency is also lost. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS still crashing after patch
On Mon, 5 May 2008, eric kustarz wrote: That's not true: http://blogs.sun.com/erickustarz/entry/zil_disable Perhaps people are using consistency to mean different things here... Consistency means that fsync() assures that the data will be written to disk so no data is lost. It is not the same thing as no corruption. ZFS will happily lose some data in order to avoid some corruption if the system loses power. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS still crashing after patch
On Mon, 5 May 2008, Marcelo Leal wrote: I'm calling consistency, a coherent local view... I think that was one option to debug (if not a NFS server), without generate a corrupted filesystem. In other words your flight reservation will not be lost if the system crashes. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and disk usage management?
On Mon, 5 May 2008, [EMAIL PROTECTED] wrote: The problem is the fact that NFS mounts cannot be done across filesystems as implemented with ZFS and Solaris 10. For example, we have client machines mounting to /groups/accounting... but we also have clients mounting to /groups directly. On my system I have a /home filesystem, and then I have additional logical-per user filesystems underneath. I know that I can mount /home directly but I currently automount the per-user filesystems since otherwise user permissions and filesystem quotas are not visible to the client for anything other than Solaris 10. I assume that ZFS quotas are enforced even if the current size and space free is not included in the user visible 'df'. Is that not true? Presumably applications get some unexpected error when the quota limit is hit since the client OS does not know the real amount of space free. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Linux
On Tue, 6 May 2008, Bill McGonigle wrote: That file says 'Copyright 2007 Sun Microsystems, Inc.', though, so Sun has the rights to do this. But being GPLv2 code, why do I have any patent rights to include/redistribute that grub code in my (theoretical) product (let's assume it does something that is covered By releasing this bit of code to Grub under the GPL v2 license, Sun has effectively transferred rights to use that scrap of code (in any context) regardless of any Sun patents which may apply. However, it seems that the useful ZFS patents would be for writing/updating the filesystem rather than reading from it. You can be sure that Sun put as little ZFS code in Grub as was possible (and not just for license reasons). Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sanity check -- x4500 storage server for enterprise file service
On Wed, 7 May 2008, Paul B. Henson wrote: I was thinking about allocating 2 drives for the OS (SVM mirroring, pending ZFS boot support), two hot spares, and allocating the other 44 drives as mirror pairs into a single pool. While this will result in lower available space than raidz, my understanding is that it should provide much better performance. Is there anything potentially problematic about this configuration? Low-level disk performance analysis is not really my field, It sounds quite solid. The load should be quite nicely distributed across the mirrors. It seems like kind of a waste to allocate 1TB to the operating system, would there be any issue in taking a slice of those boot disks and creating a zfs mirror with them to add to the pool? You don't want to go there. Keep in mind that there is currently no way to reclaim a device after it has been added to the pool other than substituting another device for it. Also, the write performance to these slices would be less than normal. If I was you, I would keep more disks spare in the beginning and see how the system is working. If everything is working great, then add more disks to the pool. Once disks are added to the pool, they are comitted. An advantage of load-shared mirrors is that more pairs can be added at any time. You need enough disks in the system to satisfy current disk space and I/O rate requirements, but it is not necessary to start off with all the disks added to the pool. Disks added earlier will be initially more loaded up than disks added later. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Image with DD from ZFS partition
On Wed, 7 May 2008, Hans wrote: hello, can i create a image from ZFS with the DD command? when i work with linux i use partimage to create an image from one partitino and store it on another. so i can restore it if an error. partimage do not work with zfs, so i must use the DD command. i think so: DD IF=/dev/sda1 OF=/backup/image can i create an image this way, and restore it the other: DD IF=/backup/image OF=/dev/sda1 when i have two partitions with zfs, can i boot from the live cd, mount one partition to use it as backup target? or is it possible to create a ext2 partition and use a linux rescue cd to backup the zfs partition with dd ? While the methods you describe are not the zfs way of doing things, they should work. The zfs pool would need to be offlined (taken completely out of service, via zpool export) before backing it up via raw devices with dd. Every raw device in the pool would need to be backed up at that time in order to make a valid restore possible. Once the devices in the pool have been copied, the pool can be re-imported to activate it. This approach is quite a lot of work and the pool is not available during this time. It is much better to do things the zfs way since then the pool can still be completely active. Taking a snapshot takes less than a second. Then you can send the filesystems to be backed up to a file or to another system. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sanity check -- x4500 storage server for enterprise file service
On Thu, 8 May 2008, Ross wrote: protected even if a disk fails. I found this post quite an interesting read:http://blogs.sun.com/relling/entry/raid_recommendations_space_vs_mttdl Richard's blog entry does not tell the whole story. ZFS does not protect against memory corruption errors and CPU execution errors except for in the validated data path. It also does not protect you against kernel bugs, corrosion, meteorite strikes, or civil unrest. As a result, the MTTDL plots (which only consider media reliability and redundancy) become quite incorrect as they reach stratospheric levels. Note that Richard does include a critical disclaimer: The MTTDL calculation is one attribute of Reliability, Availability, and Serviceability (RAS) which we can also calculate relatively easily. Notice the operative word one. The law of diminishing returns still applies. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sanity check -- x4500 storage server for enterprise file service
On Thu, 8 May 2008, Ross Smith wrote: True, but I'm seeing more and more articles pointing out that the risk of a secondary failure is increasing as disks grow in size, and Quite true. While I'm not sure of the actual error rates (Western digital list their unrecoverable rates as 1 in 10^15), I'm very concious that if you have any one disk fail completely, you are then reliant on being able to read without error every single bit of data from every other disk in that raid set. I'd much rather have dual parity and know that single bit errors are still easily recoverable during the rebuild process. I understand the concern. However, the published unrecoverable rates are for the completely random write/read case. ZFS validates the data read for each read and performs a repair if a read is faulty. Doing a zfs scrub forces all of the data to be read and repaired if necessary. Assuming that the data is read (and repaired if necessary) on a periodic basis, the chance that an unrecoverable read will occur will surely be dramatically lower. This of course assumes that the system administrator pays attention and proactively replaces disks which are reporting unusually high and increasing read failure rates. It is a simple matter of statistics. If you have read a disk block successfully 1000 times, what is the probability that the next read from that block will spontaneously fail? How about if you have read from it successfully a million times? Assuming a reasonably designed storage system, the most likely cause of data loss is human error due to carelessness or confusion. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss