[zfs-discuss] file system under heavy load, how to find out what the cause is?
%15290 - Has nayone any idea what's going on here? Cheers carsten -- Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics Callinstrasse 38, 30167 Hannover, Germany Phone/Fax: +49 511 762-17185 / -17193 http://www.top500.org/system/9234 | http://www.top500.org/connfam/6 CaCert Assurer | Get free certificates from http://www.cacert.org/ smime.p7s Description: S/MIME cryptographic signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resilver/scrub times?
Hi On Sunday 19 December 2010 11:12:32 Tobias Lauridsen wrote: sorry to bring the old one up, but I think it is better than make a new one ?? Are there some one who have some resilver time from a raidz1/2 pool whith 5TB+ data on it ? if you just looked into the discussion over the past day (or week), you would learn that the resilver time depends on the amount of writes to the system while resilvering. On an idle system you might be able to guesstimate this by taking the disk size and the number of iops of the disk and the system into account, usually a couple of hours should be alright. Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver that never finishes
Hi all one of our system just developed something remotely similar: s06:~# zpool status pool: atlashome state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress for 67h18m, 100.00% done, 0h0m to go config: NAME STATE READ WRITE CKSUM atlashome DEGRADED 0 0 0 raidz2-0DEGRADED 0 0 0 c0t0d0ONLINE 0 0 0 c1t0d0ONLINE 0 0 0 c5t0d0ONLINE 0 0 0 replacing-3 DEGRADED 0 0 0 c7t0d0s0/o FAULTED 0 0 0 corrupted data c7t0d0 ONLINE 0 0 0 678G resilvered [...] It's 100% done for more than a day now, system is running fully patched Solaris 10 (patchref from September 10th or 13th I believe) Has someone an idea how it is possible to resilver 678G of data on a 500G drive? s06:~# iostat -En c7t0d0 c7t0d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: HITACHI HDS7250S Revision: AV0A Serial No: Size: 500.11GB 500107861504 bytes Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 197 Predictive Failure Analysis: 0 I still have to upgrade the zpool versin, but wanted to wait for the resilver to complete. Any ideas? Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver that never finishes
Hi On Saturday 18 September 2010 10:02:42 Ian Collins wrote: I see this all the time on a troublesome Thumper. I believe this happens because the data in the pool is continuously changing. Ah ok, that may be, there is one particular active user on this box right now. Interesting I've never seen this in the past. Is there really an end to this and do I just have to wait? Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Suggested RaidZ configuration...
Hi On Monday 06 September 2010 17:53:44 hatish wrote: Im setting up a server with 20x1TB disks. Initially I had thought to setup the disks using 2 RaidZ2 groups of 10 discs. However, I have just read the Best Practices guide, and it says your group shouldnt have 9 disks. So Im thinking a better configuration would be 2 x 7disk RaidZ2 + 1 x 6disk RaidZ2. However its 14TB worth of data instead of 16TB. What are your suggestions and experiences? Another one is that in one pool all vdev should be equal, i.e. not mixed like 2x7 and 1x6 (this configuration you most likely will need to force anyway). First, I'd assess what you want/expect from this file system in then end. Maximum performance, maximum reliability or maximum size - as always pick two ;) Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS development moving behind closed doors
On Sunday 15 August 2010 11:56:22 Joerg Moellenkamp wrote: And by the way: Wasn't there a comment of Linus Torvals recently that people shound move their low-quality code into the codebase ??? ;) Yeah, those codes should be put into the staging part of the codebase, so that (more) people can work on it to insufficient quality code with a great idea behind better until it meets the quality of the mainline kernel. As you rightly pointed out, this is a development model which works nicely with open source in an open environment where developers are all around the globe and have a largely varying programming skill. I don't think that something like this would work in a (possibly much smaller) corporate environment/software engineering group. That said, I think it's actually a very good thing, to have this opportunity to push low-quality/non-conforming software into a controlled environment for polishing. Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Find out which of many FS from a zpool is busy?
Hi all, sorry if this is in any FAQ - then I've clearly missed it. Is there an easy or at least straight forward way to determine which of n ZFS is currently under heavy NFS load? Once upon a time, when one had old style file systems and exported these as a whole iostat -x came in handy, however, with zpools, this is not the case anymore, right? Imagine zpool create tank . (many devices here) zfs set sharenfs=on tank zfs create tank/a zfs create tank/b zfs create tank/c [...] zfs create tank/z Now, you have these lovely number of ZFS but how to find out which user is currently (ab)using the system most? Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Find out which of many FS from a zpool is busy?
Hi On Thursday 22 April 2010 16:33:51 Peter Tribble wrote: fsstat? Typically along the lines of fsstat /tank/* 1 Sh**, I knew about fsstat but never ever even tried to run it on many file systems at once. D'oh. *sigh* well, at least a good one for the archives... Thanks a lot! Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS-8000-8A: Able to go back to normal without destroying whole pool?
Hi all, on Friday night two disk in one raidz2 vdev decided to die within a couple of minutes. Swapping drives and resilvering one at a time worked quite ok, however, now I'm faced with a nasty problem: s07:~# zpool status -v pool: atlashome state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: resilver completed with 1 errors on Sat Apr 10 10:23:14 2010 [...] errors: Permanent errors have been detected in the following files: atlashome/BACKUP/userA:0x962de4 The web page is pretty generic (IMHO), I would like to restore this file from backup (or actually its origin) or simply unlink it permanently. But how do I find this blob and how to fix it without replaying the full pool (Is it really a pool issue or the file system). TIA for any advice Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS on a 11TB HW RAID-5 controller
Hi On Wednesday 24 March 2010 17:01:31 Dusan Radovanovic wrote: connected to P212 controller in RAID-5. Could someone direct me or suggest what I am doing wrong. Any help is greatly appreciated. I don't know, but I would get around this like this: My suggestion would be to configure the HW RAID controller to act as a dumb JBOD controller and thus make the 12 disks visible to the OS. Then you can start playing around with ZFS on these disks, e.g. creating different pools: zpool create testpool raidz c0t0d0 c0t1d0 c0t2d0 c0t3d0 c0t4d0 c0t5d0 \ raidz c0t6d0 c0t7d0 c0t8d0 c0t9d0 c0t10d0 c0t11d0 (Caveat: this is from the top of my head and might be - very -wrong). This would create something like RAID50. Then I would start reading, reading and testing and testing :) HTH Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies
Hi all On Thursday 18 March 2010 13:54:52 Joerg Schilling wrote: If you have no technical issues to discuss, please stop insulting people/products. We are on OpenSolaris and we don't like this kind of discussions on the mailing lists. Please act collaborative. May I suggest this to both of you. It has been widely discussed here already that the output of zfs send cannot be used as a backup. Depends on the exact definition of backup, e.g. if I may take this from wikipedia: In information technology, a backup or the process of backing up refers to making copies of data so that these additional copies may be used to restore the original after a data loss event. In this regard zfs send *could* be a tool for a backup provided you have the means of decrypting/deciphering the blob coming out of it. OTOH if I used zfs send to replicate data to another machine/location together with zfs receive and put a label backup onto the receiver this would also count as a backup from where you can restore everything and/or partially. In case of 'star' the blob coming out of it might also be useless if you don't have star (or other tools) around for deciphering it - very unlikely, but still possible ;) Of course your (plural!) definition of backup may vary, thus I would propose first to settle on this before exchanging blows... Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs/sol10u8 less stable than in sol10u5?
Hi all, it might not be a ZFS issue (and thus on the wrong list), but maybe there's someone here who might be able to give us a good hint: We are operating 13 x4500 and started to play with non-Sun blessed SSDs in there. As we were running Solaris 10u5 before and wanted to use them as log devices we upgraded to the latest and greatest 10u8 and changed the zpool layout[1]. However, on the first machine we found many, many problems with various disks failing in different vdevs (I wrote about this in December on this list IIRC). After going through this with Sun they gave us hints but mostly blamed (maybe rightfully the Intel X25e in there), we considered the 2.5 to 2.5 converter to be at fault as well. Thus we did the next test by placing the SSD into the tray without a conversion unit, but that box (a different one) failed with the same problems. Now, we learned from this experience and did the same to another box but without the SSD, i.e. jumpstarted the box and installed 10u8, redid the zpool and started to fill data in. In today's scrub suddenly this happened: s09:~# zpool status pool: atlashome state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver in progress for 0h9m, 3.89% done, 4h2m to go config: NAME STATE READ WRITE CKSUM atlashome DEGRADED 0 0 0 raidz1 ONLINE 0 0 0 c0t0d0ONLINE 0 0 0 c1t0d0ONLINE 0 0 0 c4t0d0ONLINE 0 0 0 c6t0d0ONLINE 0 0 0 c7t0d0ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c0t1d0ONLINE 0 0 0 c1t1d0ONLINE 0 0 0 c4t1d0ONLINE 0 0 0 c5t1d0ONLINE 0 0 0 c6t1d0ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c7t1d0ONLINE 0 0 1 c0t2d0ONLINE 0 0 0 c1t2d0ONLINE 0 0 2 c4t2d0ONLINE 0 0 0 c5t2d0ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c6t2d0ONLINE 0 0 0 c7t2d0ONLINE 0 0 0 c0t3d0ONLINE 0 0 0 c1t3d0ONLINE 0 0 0 c4t3d0ONLINE 0 0 0 raidz1 DEGRADED 0 0 0 c5t3d0ONLINE 0 0 0 c6t3d0ONLINE 0 0 0 c7t3d0ONLINE 0 0 0 c1t4d0ONLINE 0 0 1 spare DEGRADED 0 0 0 c4t4d0 DEGRADED 5 011 too many errors c0t4d0 ONLINE 0 0 0 5.38G resilvered raidz1 ONLINE 0 0 0 c5t4d0ONLINE 0 0 0 c6t4d0ONLINE 0 0 0 c7t4d0ONLINE 0 0 0 c0t5d0ONLINE 0 0 0 c1t5d0ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c4t5d0ONLINE 0 0 0 c5t5d0ONLINE 0 0 0 c6t5d0ONLINE 0 0 0 c7t5d0ONLINE 0 0 0 c0t6d0ONLINE 0 0 1 raidz1 ONLINE 0 0 0 c1t6d0ONLINE 0 0 0 c4t6d0ONLINE 0 0 0 c5t6d0ONLINE 0 0 0 c6t6d0ONLINE 0 0 0 c7t6d0ONLINE 0 0 1 raidz1 ONLINE 0 0 0 c0t7d0ONLINE 0 0 0 c1t7d0ONLINE 0 0 0 c4t7d0ONLINE 0 0 0 c5t7d0ONLINE 0 0 0 c6t7d0ONLINE 0 0 0 spares c0t4d0 INUSE currently in use c7t7d0 AVAIL Also similar to the other hosts were the much, much higher Soft/Hard error count in iostat: s09:~# iostat -En|grep Soft c2t0d0 Soft Errors: 1 Hard Errors: 2 Transport
Re: [zfs-discuss] How to grow ZFS on growing pool?
Hi Jörg, On Tuesday 02 February 2010 16:40:50 Joerg Schilling wrote: After that, the zpool did notice that there is more space: zpool list NAME SIZE USED AVAILCAP HEALTH ALTROOT test 476M 1,28M 475M 0% ONLINE - That's the size already after the initial creation, after exporting and importing it again: # zpool list NAMESIZE USED AVAILCAP HEALTH ALTROOT test976M 252K 976M 0% ONLINE - the ZFS however did not grow: zfs list NAME USED AVAIL REFER MOUNTPOINT test 728K 251M 297K /test # zfs list test NAME USED AVAIL REFER MOUNTPOINT test 139K 549M 37.5K /test I think you fell into the tarp that zpool just adds up all rows, especially visible on a thumper when it's under heavy load, the read and write operations per time slice for each vdev seem to be just the individual sums of the devices underneath. But this still does not explain why the pool is larger ater exporting and reimporting. Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] x4500...need input and clarity on striped/mirrored configuration
On Thursday 21 January 2010 10:29:16 Edward Ned Harvey wrote: zpool create -f testpool mirror c0t0d0 c1t0d0 mirror c4t0d0 c6t0d0 mirror c0t1d0 c1t1d0 mirror c4t1d0 c5t1d0 mirror c6t1d0 c7t1d0 mirror c0t2d0 c1t2d0 mirror c4t2d0 c5t2d0 mirror c6t2d0 c7t2d0 mirror c0t3d0 c1t3d0 mirror c4t3d0 c5t3d0 mirror c6t3d0 c7t3d0 mirror c0t4d0 c1t4d0 mirror c4t4d0 c6t4d0 mirror c0t5d0 c1t5d0 mirror c4t5d0 c5t5d0 mirror c6t5d0 c7t5d0 mirror c0t6d0 c1t6d0 mirror c4t6d0 c5t6d0 mirror c6t6d0 c7t6d0 mirror c0t7d0 c1t7d0 mirror c4t7d0 c5t7d0 mirror c6t7d0 c7t7d0 mirror c7t0d0 c7t4d0 This looks good. But you probably want to stick a spare in there, and add a SSD disk specified by log May I jump in here an ask how people are using SSDs relibly in a x4500? So far we had very little success with X25-E drives and a converter from 3.5 to 2.5 inches. So far two systems have shown pretty bad instabilities with that. Anyone with a success here? Cheers Carste ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] x4500...need input and clarity on striped/mirrored configuration
Hi On Friday 22 January 2010 07:04:06 Brad wrote: Did you buy the SSDs directly from Sun? I've heard there could possibly be firmware that's vendor specific for the X25-E. No. So far I've heard that they are not readily available as certification procedures are still underway (apart from this the 8850 firmware should be ok, but that's just what I've heard). C ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help needed to find out where the problem is
Hi all, On Thursday 26 November 2009 17:38:42 Cindy Swearingen wrote: Did anything about this configuration change before the checksum errors occurred? No, This machine is running in this configuration for a couple of weeks now The errors on c1t6d0 are severe enough that your spare kicked in. Yes and overnight more spare would have kicked in if available: s13:~# zpool status pool: atlashome state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver completed after 5h46m with 0 errors on Thu Nov 26 15:55:22 2009 config: NAME STATE READ WRITE CKSUM atlashome DEGRADED 0 0 0 raidz1 ONLINE 0 0 0 c0t0d0ONLINE 0 0 0 c1t0d0ONLINE 0 0 0 c5t0d0ONLINE 0 0 0 c7t0d0ONLINE 0 0 0 c8t0d0ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c0t1d0ONLINE 0 0 0 c1t1d0ONLINE 0 0 0 c5t1d0ONLINE 0 0 1 c6t1d0ONLINE 0 0 6 c7t1d0ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c8t1d0ONLINE 0 0 0 c0t2d0ONLINE 0 0 0 c1t2d0ONLINE 0 0 0 c5t2d0ONLINE 0 0 3 c6t2d0ONLINE 0 0 1 raidz1 ONLINE 0 0 0 c7t2d0ONLINE 0 0 0 c8t2d0ONLINE 0 0 1 c0t3d0ONLINE 0 0 0 c1t3d0ONLINE 0 0 0 c5t3d0ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c6t3d0ONLINE 0 0 0 c7t3d0ONLINE 0 0 0 c8t3d0ONLINE 0 0 0 c0t4d0ONLINE 0 0 0 c1t4d0ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c5t4d0ONLINE 0 0 0 c7t4d0ONLINE 0 0 0 c8t4d0ONLINE 0 0 0 c0t5d0ONLINE 0 0 1 c1t5d0ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c5t5d0ONLINE 0 0 0 c6t5d0ONLINE 0 0 0 c7t5d0ONLINE 0 0 0 c8t5d0ONLINE 0 0 1 c0t6d0ONLINE 0 0 0 raidz1 DEGRADED 0 0 0 spare DEGRADED 0 0 0 c1t6d0 DEGRADED 6 017 too many errors c8t7d0 ONLINE 0 0 0 130G resilvered c5t6d0ONLINE 0 0 0 c6t6d0DEGRADED 0 041 too many errors c7t6d0DEGRADED 1 014 too many errors c8t6d0ONLINE 0 0 1 raidz1 ONLINE 0 0 0 c0t7d0ONLINE 0 0 0 c1t7d0ONLINE 0 0 1 c5t7d0ONLINE 0 0 0 c6t7d0ONLINE 0 0 0 c7t7d0ONLINE 0 0 0 logs c6t4d0 ONLINE 0 0 0 spares c8t7d0 INUSE currently in use errors: No known data errors You can use the fmdump -eV command to review the disk errors that FMA has detected. This command can generate a lot of output but you can see if the checksum errors on the disks are transient or if they occur repeatedly. Hmm, The output does not seem to stop. After about 1.3 GB of file size I stopped it. There seem to be a few different types here: Nov 04 2009 15:54:08.039456458 ereport.fs.zfs.checksum nvlist version: 0 class = ereport.fs.zfs.checksum ena = 0x403c56a7d4a1 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0xea7c0de1586275c7 vdev = 0xfca535aa8bbc70d1 (end detector) pool = atlashome pool_guid = 0xea7c0de1586275c7 pool_context = 0
Re: [zfs-discuss] Help needed to find out where the problem is
Hi Bob On Friday 27 November 2009 17:19:22 Bob Friesenhahn wrote: It is interesting that in addition to being in the same vdev, the disks encountering serious problems are all target 6. Besides something at the zfs level, there could be some some issue at the device driver, or underlying hardware level. Or maybe just bad luck. As I recall, Albert Chin-A-Young posted about a pool failure where many devices in the same raidz2 vdev spontaneously failed somehow (in his case the whole pool was lost). He is using different hardware but this looks somewhat similar. It looks quite similar as this one: http://www.mail-archive.com/storage-disc...@opensolaris.org/msg06125.html we swapped the drive and resilvering is almost though and the vdev is showing a large number of errors: raidz1DEGRADED 0 0 1 spare DEGRADED 0 0 8.81M replacing DEGRADED 0 0 0 c1t6d0s0/o FAULTED 6 017 corrupted data c1t6d0 ONLINE 0 0 0 120G resilvered c8t7d0ONLINE 0 0 0 120G resilvered c5t6d0 ONLINE 0 0 0 c6t6d0 DEGRADED 0 041 too many errors c7t6d0 DEGRADED 1 014 too many errors c8t6d0 ONLINE 0 0 1 If having all sixes is a problem, maybe we should try to use a diagonal approach the next time (or solve the n-queen problem on a rectangular thumper layout)... I guess after resilvering the next step will be zpool clear and a new scrub, but I fear that will show errors again. Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help needed to find out where the problem is
Hi Ross, On Friday 27 November 2009 21:31:52 Ross Walker wrote: I would plan downtime to physically inspect the cabling. There is not much cabling as the disks are directly connected to a large backplane (Sun Fire X4500) Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Help needed to find out where the problem is
Hi all, on a x4500 with a relatively well patched Sol10u8 # uname -a SunOS s13 5.10 Generic_141445-09 i86pc i386 i86pc I've started a scrub after about 2 weeks of operation and have a lot of checksum errors: s13:~# zpool status pool: atlashome state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver in progress for 1h17m, 8.96% done, 13h5m to go config: NAME STATE READ WRITE CKSUM atlashome DEGRADED 0 0 0 raidz1 ONLINE 0 0 0 c0t0d0ONLINE 0 0 0 c1t0d0ONLINE 0 0 0 c5t0d0ONLINE 0 0 0 c7t0d0ONLINE 0 0 0 c8t0d0ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c0t1d0ONLINE 0 0 0 c1t1d0ONLINE 0 0 0 c5t1d0ONLINE 0 0 0 c6t1d0ONLINE 0 0 6 c7t1d0ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c8t1d0ONLINE 0 0 0 c0t2d0ONLINE 0 0 0 c1t2d0ONLINE 0 0 0 c5t2d0ONLINE 0 0 2 c6t2d0ONLINE 0 0 1 raidz1 ONLINE 0 0 0 c7t2d0ONLINE 0 0 0 c8t2d0ONLINE 0 0 0 c0t3d0ONLINE 0 0 0 c1t3d0ONLINE 0 0 0 c5t3d0ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c6t3d0ONLINE 0 0 0 c7t3d0ONLINE 0 0 0 c8t3d0ONLINE 0 0 0 c0t4d0ONLINE 0 0 0 c1t4d0ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c5t4d0ONLINE 0 0 0 c7t4d0ONLINE 0 0 0 c8t4d0ONLINE 0 0 0 c0t5d0ONLINE 0 0 1 c1t5d0ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c5t5d0ONLINE 0 0 0 c6t5d0ONLINE 0 0 0 c7t5d0ONLINE 0 0 0 c8t5d0ONLINE 0 0 1 c0t6d0ONLINE 0 0 0 raidz1 DEGRADED 0 0 0 spare DEGRADED 0 0 0 c1t6d0 DEGRADED 6 017 too many errors c8t7d0 ONLINE 0 0 0 11.8G resilvered c5t6d0ONLINE 0 0 0 c6t6d0ONLINE 0 0 0 c7t6d0ONLINE 0 0 1 c8t6d0ONLINE 0 0 1 raidz1 ONLINE 0 0 0 c0t7d0ONLINE 0 0 0 c1t7d0ONLINE 0 0 1 c5t7d0ONLINE 0 0 0 c6t7d0ONLINE 0 0 0 c7t7d0ONLINE 0 0 0 logs c6t4d0 ONLINE 0 0 0 spares c8t7d0 INUSE currently in use So far, it seems that the pool survived it, but I'm a bit worried how to trace down the problem of this. Any suggestion how to proceed? Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] how to discover disks?
Hi Hua-Ying Ling wrote: How do I discover the disk name to use for zfs commands such as: c3d0s0? I tried using format command but it only gave me the first 4 letters: c3d1. Also why do some command accept only 4 letter disk names and others require 6 letters? Usually i find cfgadm -a helpful enough for that (mayby adding '|grep disk' to it). Why sometimes 4 and sometimes 6 characters: c3d1 - this would be disk#1 on controller#3 c3d0s0 - this would be slice #0 (partition) on disk #0 on controller #3 Usually there is a also t0 there, e.g.: cfgadm -a|grep disk |head sata0/0::dsk/c0t0d0disk connectedconfigured ok sata0/1::dsk/c0t1d0disk connectedconfigured ok sata0/2::dsk/c0t2d0disk connectedconfigured ok sata0/3::dsk/c0t3d0disk connectedconfigured ok sata0/4::dsk/c0t4d0disk connectedconfigured ok sata0/5::dsk/c0t5d0disk connectedconfigured ok sata0/6::dsk/c0t6d0disk connectedconfigured ok sata0/7::dsk/c0t7d0disk connectedconfigured ok sata1/0::dsk/c1t0d0disk connectedconfigured ok sata1/1::dsk/c1t1d0disk connectedconfigured ok HTH Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import: Cannot mount,
Hi a small addendum. It seems that all sub ZFS below /atlashome/BACKUP are already mounted when /atlashome/BACKUP is tried to be mounted: # zfs get all atlashome/BACKUP|head -15 NAME PROPERTY VALUE SOURCE atlashome/BACKUP type filesystem - atlashome/BACKUP creation Thu Oct 9 16:30 2008 - atlashome/BACKUP used 9.95T - atlashome/BACKUP available 1.78T - atlashome/BACKUP referenced 172K - atlashome/BACKUP compressratio 1.47x - atlashome/BACKUP mountedno - atlashome/BACKUP quota none default atlashome/BACKUP reservationnone default atlashome/BACKUP recordsize 32Kinherited from atlashome atlashome/BACKUP mountpoint /atlashome/BACKUP default atlashome/BACKUP sharenfs on inherited from atlashome atlashome/BACKUP checksum on default atlashome/BACKUP compressionon local while # ls -l /atlashome/BACKUP | wc -l 33 Is there any way to force zpool import to re-order that? I could delete all stuff under BACKUP, however given the size I don't really want to. Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import: Cannot mount,
Hi Mark J Musante wrote: Do a zpool export first, and then check to see what's in /atlashome. My bet is that the BACKUP directory is still there. If so, do an rmdir on /atlashome/BACKUP and then try the import again. Sorry, I meant to copy this earlier: s11 console login: root Password: Last login: Mon Jun 29 10:37:47 on console Sun Microsystems Inc. SunOS 5.10 Generic January 2005 s11:~# zpool export atlashome s11:~# ls -l /atlashome /atlashome: No such file or directory s11:~# zpool import atlashome cannot mount '/atlashome/BACKUP': directory is not empty s11:~# ls -l /atlashome/BACKUP/|wc -l 33 s11:~# Thus you see that probably zpool import does the wrong thing(TM) (or wrong order) Any idea? Cheers Carsten PS: I opened a case for that, but waited for the call back. When solving this problem, I can post the case ID for further reference. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import: Cannot mount,
Hi Mark, Mark J Musante wrote: OK, looks like you're running into CR 6827199. There's a workaround for that as well. After the zpool import, manually zfs umount all the datasets under /atlashome/BACKUP. Once you've done that, the BACKUP directory will still be there. Manually mount the dataset that corresponds to /atlashome/BACKUP, and then try 'zfs mount -a'. I did that (needed to rmdir the directories under BACKUP) and then finally it worked - and the best even after a reboot it was able to mount all file systems again. Great and a lot of thanks! One question: Where can I find more about CR 6827199? I logged into sun.com with my service contract enabled log-in but I cannot find it there (or the search function does not like me too much). Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How do I mirror zfs rpool, x4500?
Hi Tim, Tim wrote: How does any of that affect an x4500 with onboard controllers that can't ever be moved? Well, consider one box being installed from CD (external USB-CD) and another one which is jumpstarted via the network. The results usually are two different boot device names :( Q: Is there an easy way to reset this without breaking everything? Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] seeking in ZFS when data is compressed
Hi all, I was just reading http://blogs.sun.com/dap/entry/zfs_compression and would like to know what the experience of people is about enabling compression in ZFS. In principle I don't think it's a bad thing, especially not when the CPUs are fast enough to improve the performance as the hard drives might be too slow. However, I'm missing two aspects: o what happens when a user opens the file and does a lot of seeking inside the file? For example our scientists use a data format where quite compressible data is contained in stretches and the file header contains a dictionary where each stretch of data starts. If these files are compressed on disk, what will happen with ZFS? Will it just make educated guesses, or does it have to read all of the typically 30-150 MB of the file and then does the seeking from buffer caches? o Another problem I see (but probably isn't): A user is accessing a file via a NFS-exported ZFS, appending a line of text, closing the file (and hopefully also flushing everything correctly. However, then the user opens it again appends another line of text, ... Imagine this happening a few times per second. How will ZFS react to this pattern? Will it only opens the final record of the file, uncompress it, adds data, recompresses it, flushes it to disk and reports that back to the user's processes? Is there a potential problem here? Cheers (and sorry if these questions are stupid ones) Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] seeking in ZFS when data is compressed
Hi Richard, Richard Elling wrote: Files are not compressed in ZFS. Blocks are compressed. Sorry, yes, I was not specific enough. If the compression of the blocks cannot gain more than 12.5% space savings, then the block will not be compressed. If your file contains compressable parts and uncompressable parts, then (depending on the size/blocks) it may be partially compressed. I guess the block size is related (or equal) to the record size set for this file system, right? What will happen then if I have a file which contains a header which fits into 1 or 2 blocks, and is followed by stretches of data which are say 500kB each (for simplicity) which could be visualized as sitting in a rectangle with M rows and N columns. Since the file system has no way of knowing details on the file, it will cut the file into blocks and store it compressed or uncompressed as you have written. However, what happens if the typical usage pattern is read only columns of the rectangle, i.e. read the header, seek to the start of stretch #1, then seeking to stretch #N+1, ... Can ZFS make educated guesses where the seek targets might be or will it read the file block by block until it reaches the target position, in the latter case it might be quite inefficient if the file is huge and has a large variance in compressibility. The file will be cached in RAM. When the file is closed and synced, the data will be written to the ZIL and ultimately to the data set. I don't think there is a fundamental problem here... you should notice the NFS sync behaviour whether the backing store is ZFS or some other file system. Using a slog or nonvolatile write cache will help performance for such workloads. Thanks, that's answer I was hoping for :) They are good questions :-) Good :) Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] seeking in ZFS when data is compressed
Darren, Richard, thanks a lot for the very good answers. Regarding the seeking I was probably mislead by the believe that the block size was like an impenetrable block where as much data as possible is being squeezed into (like .Z files would be if you first compressed and then cut the data into blocks). Thanks a lot! Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS: unreliable for professional usage?
Hi, i've followed this thread a bit and I think there are some correct points on any side of the discussion, but here I see a misconception (at least I think it is): D. Eckert schrieb: (..) Dave made a mistake pulling out the drives with out exporting them first. For sure also UFS/XFS/EXT4/.. doesn't like that kind of operations but only with ZFS you risk to loose ALL your data. that's the point! (...) I did that many times after performing the umount cmd with ufs/reiserfs filesystems on USB external drives. And they never complainted or got corrupted. This of ZFS as an entity which cannot live without the underlying ZPOOL. You can have reiserfs, jfs, ext?, xfs - you name it - on any logical device as it will only live on this one and when you umount it, it's safe to power it off, yank the disk out whatever since there is now other layer between the file system and the logical disk partition/slice/... However, as soon as you add another layer (say RAID which in this analogy is somehow the ZPOOL) you might also lose data when you have a RAID0 setup and umount reiserfs/ufs/whatever and take a disc out of the RAID and destroy it or change a few sectors on it. When you then mount the file system again, it's utterly broken and lost. Or - which might be worse - you might end up with a silent data corruption you will never notice unless you try to open the data block which is damaged. However, in your case you have some checksum error in the file system on a single hard disk which might have been caused by some accident. ZFS is good in the respect that it can tell you that somethings broken, but without a mirror or parity device it won't be able to fix the data out of thin air. I cannot claim to fully understand what happened to your devices, so please take my written stuff with a grain of salt. Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Introducing zilstat
Hi Richard, Richard Elling schrieb: Yes. I've got a few more columns in mind, too. Does anyone still use a VT100? :-) Only when using ILOM ;) (anyone using 72 char/line MUA, sorry to them, the following lines are longer): Thanks for the great tool, it showed something very interesting yesterday: s06: TIME N-MBytes N-MBytes/s N-Max-Rate B-MBytes B-MBytes/s B-Max-Rate s06: 2009 Feb 4 14:37:11 5 0 0 10 0 1 s06: 2009 Feb 4 14:37:26 6 0 1 12 0 1 s06: 2009 Feb 4 14:37:41 4 0 0 10 0 1 s06: 2009 Feb 4 14:37:56 5 0 1 11 0 1 s06: 2009 Feb 4 14:38:11 6 0 1 11 0 2 s06: 2009 Feb 4 14:38:26 7 0 1 13 0 2 s06: 2009 Feb 4 14:38:41 10 0 2 17 1 3 s06: 2009 Feb 4 14:38:56 4 0 0 9 0 1 s06: 2009 Feb 4 14:39:11 5 0 1 11 0 1 s06: 2009 Feb 4 14:39:26 7 0 0 13 0 1 s06: 2009 Feb 4 14:39:41 7 0 2 13 0 3 s06: 2009 Feb 4 14:39:56 6 0 1 11 0 2 s06: 2009 Feb 4 14:40:11 6 0 1 12 0 1 s06: 2009 Feb 4 14:40:26 6 0 0 13 0 1 s06: 2009 Feb 4 14:40:41 5 0 0 10 0 1 s06: 2009 Feb 4 14:40:56 6 0 1 12 0 1 s06: 2009 Feb 4 14:41:11 4 0 0 9 0 1 [..] so far, the box was almost idle, a little bit later: s06: 2009 Feb 4 14:53:41 2 0 0 5 0 0 s06: 2009 Feb 4 14:53:56 1 0 0 3 0 0 s06: 2009 Feb 4 14:54:11 1 0 0 4 0 0 s06: 2009 Feb 4 14:54:26 1 0 0 3 0 0 s06: 2009 Feb 4 14:54:41 2 0 0 5 0 0 s06: 2009 Feb 4 14:54:56604 40171702 46198 s06: 2009 Feb 4 14:55:11816 54130939 62154 s06: 2009 Feb 4 14:55:26 2 0 0 4 0 0 s06: 2009 Feb 4 14:55:41 2 0 0 4 0 0 s06: 2009 Feb 4 14:55:56 1 0 0 3 0 0 s06: 2009 Feb 4 14:56:11 3 0 0 6 0 1 s06: 2009 Feb 4 14:56:26 1 0 0 3 0 0 [...] s06: 2009 Feb 4 16:13:11 1 0 0 3 0 0 s06: 2009 Feb 4 16:13:26 2 0 0 5 0 0 s06: 2009 Feb 4 16:13:41389 25 97477 31119 s06: 2009 Feb 4 16:13:56505 33193599 39218 s06: 2009 Feb 4 16:14:11 2 0 0 4 0 0 s06: 2009 Feb 4 16:14:26 3 0 0 5 0 1 s06: 2009 Feb 4 16:14:41 1 0 0 3 0 0 s06: 2009 Feb 4 16:14:56 2 0 0 6 0 1 s06: 2009 Feb 4 16:15:11 4 0 2 10 0 4 s06: 2009 Feb 4 16:15:26 0 0 0 1 0 0 s06: 2009 Feb 4 16:15:41128 8 94168 11123 s06: 2009 Feb 4 16:15:56 1081 72212 1305 87279 s06: 2009 Feb 4 16:16:11262 17 99317 21122 s06: 2009 Feb 4 16:16:26 0 0 0 0 0 0 just showing a few bursts... Given that this is the output of 'zilstat.ksh -M -t 15' I guess we should really look into a fast device for it, right? Do you have any hint, which numbers are reasonable on a X4500 and which are approaching serious problems? Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Expert hint for replacing 3.5 SATA drive in X4500 with SSD for ZIL
Hi all, We would like to replace one of our 3.5 inch SATA drives of our Thumpers with a SSD device (and put the ZIL on this device). We are currently looking into this with in a bit more detail and would like to ask for input if people already have experience with single vs. multi cell SSDs, read- and write optimized devices (if these really exist) and so on. If possible I would like this discussion to take place on list, but if people want to suggest brand names/model numbers I'll be happy to accept them off-list as well. Thanks a lot in advance Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Expert hint for replacing 3.5 SATA drive in X4500 with SSD for ZIL
Just a brief addendum Something like this (or a fully DRAM based device if available in 3.5 inch FF) might also be interesting to test, http://www.platinumhdd.com/ any thoughts? Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] checksum errors on Sun Fire X4500
Hi Jay, Jay Anderson schrieb: I have b105 running on a Sun Fire X4500, and I am constantly seeing checksum errors reported by zpool status. The errors are showing up over time on every disk in the pool. In normal operation there might be errors on two or three disks each day, and sometimes there are enough errors so it reports too many errors, and the disk goes into a degraded state. I have had to remove the spares from the pool because otherwise the spares get pulled into the pool to replace the drives. There are no reported hardware problems with any of the drives. I have run scrub multiple times, and this also generates checksum errors. After the scrub completes the checksums continue to occur during normal operation. This problem also occurred with b103. Before that Solaris 10u4 was installed on the server, and it never had any checksum errors. With the OpenSolaris builds I am running CIFS Server, and that's the only difference in server function from when Solaris 10u4 was installed on it. Is this a known issue? Any suggestions or workarounds? We had something similar two or three disk slots which started to act weird and failed quite often - usually starting with a high error rate. After exchanging two hard drives, the Sun hotline initiated to exchange the backplane - essentially the chassis was replaced. Since then, we have not encountered anything like this anymore. So it *might* be the backplane or a broken Marvell controller, but it's hard to judge. HTH Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] hung when import zpool
Hi Qin Ming Hua wrote: bash-3.00# zpool import mypool ^C^C it hung when i try to re-import the zpool, has anyone see this before? How long did you wait? Once a zfs import took 1-2 hours to complete (it was seemingly stuck at a ~30 GB filesystem which it needed to do some work on). Cheer Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Benchmarking ZFS via NFS
Hi all, among many other things I recently restarted benchmarking ZFS over NFS3 performance between X4500 (host) and Linux clients. I've just iozone quite a while ago and am still a bit at a loss understanding the results. The automatic mode is pretty ok (and generates nice 3D plots for the people higher up the ladder), but someone gave a hint to use multiple threads for testing the ops/s and here I'm a bit at a loss how to understand the results and if the values are reasonable or not. Here is the current example - can anyone with deeper knowledge tell me if these are reasonable values to start with? Thanks a lot Carsten Iozone: Performance Test of File I/O Version $Revision: 3.315 $ Compiled for 64 bit mode. Build: linux-AMD64 Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins Al Slater, Scott Rhine, Mike Wisner, Ken Goss Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner, Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root. Run began: Wed Jan 7 09:31:49 2009 Multi_buffer. Work area 16777216 bytes OPS Mode. Output is in operations per second. Record Size 8 KB SYNC Mode. File size set to 4194304 KB Command line used: ../iozone3_315/src/current/iozone -m -t 8 -T -O -r 8k -o -s 4G iozone Time Resolution = 0.01 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. Throughput test with 8 threads Each thread writes a 4194304 Kbyte file in 8 Kbyte records Children see throughput for 8 initial writers =4925.20 ops/sec Parent sees throughput for 8 initial writers =4924.65 ops/sec Min throughput per thread = 615.61 ops/sec Max throughput per thread = 615.69 ops/sec Avg throughput per thread = 615.65 ops/sec Min xfer= 524219.00 ops Children see throughput for 8 rewriters=4208.45 ops/sec Parent sees throughput for 8 rewriters =4208.42 ops/sec Min throughput per thread = 525.88 ops/sec Max throughput per thread = 526.22 ops/sec Avg throughput per thread = 526.06 ops/sec Min xfer= 523944.00 ops Children see throughput for 8 readers = 11986.99 ops/sec Parent sees throughput for 8 readers = 11986.46 ops/sec Min throughput per thread =1481.13 ops/sec Max throughput per thread =1512.71 ops/sec Avg throughput per thread =1498.37 ops/sec Min xfer= 513361.00 ops Children see throughput for 8 re-readers= 12017.70 ops/sec Parent sees throughput for 8 re-readers = 12017.22 ops/sec Min throughput per thread =1486.72 ops/sec Max throughput per thread =1520.35 ops/sec Avg throughput per thread =1502.21 ops/sec Min xfer= 512761.00 ops Children see throughput for 8 reverse readers = 25741.62 ops/sec Parent sees throughput for 8 reverse readers= 25735.91 ops/sec Min throughput per thread =3141.50 ops/sec Max throughput per thread =3282.11 ops/sec Avg throughput per thread =3217.70 ops/sec Min xfer= 501956.00 ops Children see throughput for 8 stride readers=1434.73 ops/sec Parent sees throughput for 8 stride readers =1434.71 ops/sec Min throughput per thread = 122.51 ops/sec Max throughput per thread = 297.87 ops/sec Avg throughput per thread = 179.34 ops/sec Min xfer= 215638.00 ops Children see throughput for 8 random readers= 529.83 ops/sec Parent sees throughput for 8 random readers = 529.83 ops/sec Min throughput per thread = 55.63 ops/sec Max throughput per thread = 101.03 ops/sec Avg throughput per thread =
Re: [zfs-discuss] Benchmarking ZFS via NFS
Hi Bob. Bob Friesenhahn wrote: Here is the current example - can anyone with deeper knowledge tell me if these are reasonable values to start with? Everything depends on what you are planning do with your NFS access. For example, the default blocksize for zfs is 128K. My example tests performance when doing I/O with small 8K blocks (like a database), which will severely penalize zfs configured for 128K blocks. [...] My plans don't count in here, I need to optimize what the users want and they don't have a clue what they will do in 6 months from now, so I guess all detailed planning will fail anyway and I'm just searching for the one size fits almost all... My experience with iozone is that it refuses to run on an NFS client of a Solaris server using ZFS since it performs a test and then refuses to work since it says that the filesystem is not implemented correctly. Commenting a line of code in iozone will get over this hurdle. This seems to be a religious issue with the iozone maintainer. Interesting, I've been running this on a Linux client accessing a ZFS file system from one of our Thumpers without any source modifications and problems. Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 'zfs recv' is very slow
Hi, Brent Jones wrote: Using mbuffer can speed it up dramatically, but this seems like a hack without addressing a real problem with zfs send/recv. Trying to send any meaningful sized snapshots from say an X4540 takes up to 24 hours, for as little as 300GB changerate. I have not found a solution yet also. But it seems to depend highly on the distribution of file sizes, number of files per directory or whatever. The last tests I made showed still more than 50 hours for 700 GB and ~45 hours for 5 TB (both tests were null tests where zfs send wrote to /dev/null). Cheers from a still puzzled Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS send fails incremental snapshot
Hi Brent, Brent Jones wrote: I am using 2008.11 with the Timeslider automatic snapshots, and using it to automatically send snapshots to a remote host every 15 minutes. Both sides are X4540's, with the remote filesystem mounted read-only as I read earlier that would cause problems. The snapshots send fine for several days, I accumulate many snapshots at regular intervals, and they are sent without any problems. Then I will get the dreaded: cannot receive incremental stream: most recent snapshot of pdxfilu02 does not match incremental source Which command line are you using? Maybe you need to do a rollback first (zfs receive -F)? Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS vs HardWare raid - data integrity?
Hi Marc (and all the others), Marc Bevand wrote: So Carsten: Mattias is right, you did not simulate a silent data corruption error. hdparm --make-bad-sector just introduces a regular media error that *any* RAID level can detect and fix. OK, I'll need to go back to our tests performed months ago, but my feeling is now that we didn't it right in the first place. Will take some time to retest that. Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS vs HardWare raid - data integrity?
Hi Marc, Marc Bevand wrote: Carsten Aulbert carsten.aulbert at aei.mpg.de writes: In RAID6 you have redundant parity, thus the controller can find out if the parity was correct or not. At least I think that to be true for Areca controllers :) Are you sure about that ? The latest research I know of [1] says that although an algorithm does exist to theoretically recover from single-disk corruption in the case of RAID-6, it is *not* possible to detect dual-disk corruption with 100% certainty. And blindly running the said algorithm in such a case would even introduce corruption on a third disk. Well, I probably need to wade through the paper (and recall Galois field theory) before answering this. We did a few tests in a 16 disk RAID6 where we wrote data to the RAID, powered the system down, pulled out one disk, inserted it into another computer and changed the sector checksum of a few sectors (using hdparm's utility makebadsector). The we reinserted this into the original box, powered it up and ran a volume check and the controller did indeed find the corrupted sector and repaired the correct one without destroying data on another disk (as far as we know and tested). For the other point: dual-disk corruption can (to my understanding) never be healed by the controller since there is no redundant information available to check against. I don't recall if we performed some tests on that part as well, but maybe we should do that to learn how the controller will behave. As a matter of fact at that point it should just start crying out loud and tell me, that it cannot recover for that. But the chance of this happening should be relatively small unless the backplane/controller had a bad hiccup when writing that stripe. This is the reason why, AFAIK, no RAID-6 implementation actually attempts to recover from single-disk corruption (someone correct me if I am wrong). As I said I know that our Areca 1261ML does detect and correct those errors - if these are single-disk corruptions The exception is ZFS of course, but it accomplishes single and dual-disk corruption self-healing by using its own checksum, which is one layer above RAID-6 (therefore unrelated to it). Yes, very helpful and definitely desirable to have :) [1] http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf Thanks for the pointer Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS vs HardWare raid - data integrity?
Hi all, Bob Friesenhahn wrote: My understanding is that ordinary HW raid does not check data correctness. If the hardware reports failure to successfully read a block, then a simple algorithm is used to (hopefully) re-create the lost data based on data from other disks. The difference here is that ZFS does check the data correctness (at the CPU) for each read while HW raid depends on the hardware detecting a problem, and even if the data is ok when read from disk, it may be corrupted by the time it makes it to the CPU. AFAIK this is not done during the normal operation (unless a disk asked for a sector cannot get this sector). ZFS's scrub algorithm forces all of the written data to be read, with validation against the stored checksum. If a problem is found, then an attempt to correct is made from redundant storage using traditional RAID methods. That's exactly what volume checking for standard HW controllers does as well. Read all data and compare it with parity. This is exactly the point why RAID6 should always be chosen over RAID5, because in the event of a wrong parity check and RAID5 the controller can only say, oops, I have found a problem but cannot correct it - since it does not know if the parity is correct or any of the n data bits. In RAID6 you have redundant parity, thus the controller can find out if the parity was correct or not. At least I think that to be true for Areca controllers :) Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS vs HardWare raid - data integrity?
Hi Bob, Bob Friesenhahn wrote: AFAIK this is not done during the normal operation (unless a disk asked for a sector cannot get this sector). ZFS checksum validates all returned data. Are you saying that this fact is incorrect? No sorry, too long in front of a computer today I guess: I was referring to hardware RAID controllers, AFAIK these usually do not check the validity of data unless a disc returns an error. My knowledge regarding ZFS is exactly that, that data is checked in the CPU against the stored checksum. That's exactly what volume checking for standard HW controllers does as well. Read all data and compare it with parity. What if the data was corrupted prior to parity generation? Well, that is bad luck, same is true if your ZFS box has faulty memory and the computed checksum is right for the data on disk, but wrong in the sense of the file under consideration. Sorry for the confusion Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SMART data
Mam Ruoc wrote: Carsten wrote: I will ask my boss about this (since he is the one mentioned in the copyright line of smartctl ;)), please stay tuned. How is this going? I'm very interested too... Not much happening right now, December meetings, holiday season, ... But thanks for pinging me - I tend to forget such things. Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SMART data
Hi all, Miles Nordin wrote: rl == Rob Logan [EMAIL PROTECTED] writes: rl the sata framework uses the sd driver so its: yes but this is a really tiny and basically useless amount of output compared to what smartctl gives on Linux with SATA disks, where SATA disks also use the sd driver (the same driver Linux uses for SCSI disks). In particular, the reallocated sector count and raw read error rates are missing, as is the very useful offline self test interface and the sometimes useful last-5-errors log. I will ask my boss about this (since he is the one mentioned in the copyright line of smartctl ;)), please stay tuned. Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Asymmetric zpool load
Ross wrote: Aha, found it! It was this thread, also started by Carsten :) http://www.opensolaris.org/jive/thread.jspa?threadID=78921tstart=45 Did I? Darn, I need to get a brain upgrade. But yes, there it was mainly focused on zfs send/receive being slow - but maybe these are also linked. What I will try today/this week: Put some stress on the system with bonnie and other tools and try to find slow disks and see if this could be the main problem but also look into more vdevs and then possible move to raidz to somehow compensate for lost disk space. Since we have 4 cold spares on the shelf plus a SMS warnings on disk failures (that is if fma catches them) the risk involved should be tolerable. More later. Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Asymmetric zpool load
Carsten Aulbert wrote: Put some stress on the system with bonnie and other tools and try to find slow disks and see if this could be the main problem but also look into more vdevs and then possible move to raidz to somehow compensate for lost disk space. Since we have 4 cold spares on the shelf plus a SMS warnings on disk failures (that is if fma catches them) the risk involved should be tolerable. First result with bonnie during the writing intelligently... phase I see this in a 2 minute average: zpool iostats: capacity operationsbandwidth pool used avail read write read write -- - - - - - - atlashome 1.70T 19.2T225 1.49K 342K 107M raidz2 550G 6.28T 74409 114K 32.6M c0t0d0 - - 0314 32.3K 2.51M c1t0d0 - - 0315 31.8K 2.52M c4t0d0 - - 0313 31.3K 2.52M c6t0d0 - - 0315 32.3K 2.51M c7t0d0 - - 0326 32.8K 2.50M c0t1d0 - - 0309 33.9K 2.52M c1t1d0 - - 0313 33.4K 2.51M c4t1d0 - - 0314 33.4K 2.52M c5t1d0 - - 0308 32.8K 2.52M c6t1d0 - - 0314 31.3K 2.51M c7t1d0 - - 0311 31.8K 2.52M c0t2d0 - - 0309 31.8K 2.52M c1t2d0 - - 0313 31.8K 2.51M c4t2d0 - - 0315 31.8K 2.52M c5t2d0 - - 0307 32.8K 2.52M raidz2 567G 6.26T 64529 96.5K 36.3M c6t2d0 - - 1368 74.2K 2.79M c7t2d0 - - 1366 74.2K 2.80M c0t3d0 - - 1364 75.8K 2.80M c1t3d0 - - 1365 75.2K 2.80M c4t3d0 - - 1368 76.8K 2.80M c5t3d0 - - 1362 76.3K 2.80M c6t3d0 - - 1366 77.9K 2.80M c7t3d0 - - 1365 76.8K 2.80M c0t4d0 - - 1361 76.8K 2.80M c1t4d0 - - 1363 75.8K 2.80M c4t4d0 - - 1366 76.3K 2.80M c6t4d0 - - 1364 78.4K 2.80M c7t4d0 - - 1370 78.9K 2.79M c0t5d0 - - 1365 77.3K 2.80M c1t5d0 - - 1364 74.7K 2.80M raidz2 620G 6.64T 86582 131K 37.9M c4t5d0 - - 18382 1.16M 2.74M c5t5d0 - - 10380 674K 2.74M c6t5d0 - - 18378 1.15M 2.73M c7t5d0 - - 9384 628K 2.74M c0t6d0 - - 18377 1.16M 2.74M c1t6d0 - - 10383 680K 2.75M c4t6d0 - - 19379 1.21M 2.73M c5t6d0 - - 10383 691K 2.75M c6t6d0 - - 19379 1.21M 2.73M c7t6d0 - - 10383 676K 2.72M c0t7d0 - - 18374 1.19M 2.75M c1t7d0 - - 10381 676K 2.74M c4t7d0 - - 19380 1.22M 2.74M c5t7d0 - - 10382 696K 2.74M c6t7d0 - - 18381 1.17M 2.74M c7t7d0 - - 9386 631K 2.75M -- - - - - - - iostat -Mnx 120: extended device statistics r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 0.00.00.00.0 0.0 0.00.00.0 0 0 c2t0d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c3t0d0 0.01.40.00.0 0.0 0.01.50.4 0 0 c5t0d0 0.6 351.50.02.6 0.4 0.11.20.2 3 8 c7t0d0 0.6 336.30.02.6 0.1 0.10.40.2 3 7 c0t0d0 0.6 340.80.02.6 0.2 0.10.60.2 3 7 c1t0d0 0.6 330.60.02.6 0.1 0.10.30.2 3 7 c5t1d0 0.6 336.70.02.6 0.1 0.10.30.2 3 7 c4t0d0 0.6 331.80.02.6 0.1 0.10.30.2 3 7 c0t1d0 0.6 339.00.02.6 0.4 0.11.10.2 3 7 c7t1d0 0.6 335.40.02.6 0.1 0.10.40.2 3 7 c1t1d0 0.6 329.20.02.6 0.1 0.10.30.2 3 7 c5t2d0 0.6 343.70.02.6 0.3 0.10.70.2 3 7 c4t1d0 0.6 331.80.02.6 0.1 0.10.30.2 2 7 c0t2d0 1.2 396.30.12.9 0.3 0.10.70.2 4 8 c7t2d0 0.6 336.70.02.6 0.1 0.10.40.2 3 7 c1t2d0 0.6 341.90.02.6 0.2 0.10.70.2 3 7 c4t2d0 1.3 390.70.12.9 0.3 0.10.80.2 4 9 c5t3d0 1.3 396.70.12.9 0.3 0.10.80.2 4 9 c7t3d0 1.3 393.60.12.9 0.2 0.10.60.2 4 9 c0t3d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c5t4d0 1.3 396.20.12.9 0.2 0.10.5
[zfs-discuss] Asymmetric zpool load
Hi all, We are running pretty large vdevs since the initial testing showed that our setup was not too much off the optimum. However, under real world load we do see quite some weird behaviour: The system itself is a X4500 with 500 GB drives and right now the system seems to be under heavy load, e.g. ls takes minutes to return on only a few hundred entries, top shows 10% kernel, rest idle. zpool ioststat -v atlashome 60 shows (not the first output): capacity operationsbandwidth pool used avail read write read write -- - - - - - - atlashome 2.11T 18.8T 2.29K 36 71.7M 138K raidz2 466G 6.36T493 11 14.9M 34.1K c0t0d0 - - 48 5 1.81M 3.52K c1t0d0 - - 48 5 1.81M 3.46K c4t0d0 - - 48 5 1.81M 3.27K c6t0d0 - - 48 5 1.81M 3.40K c7t0d0 - - 47 5 1.81M 3.40K c0t1d0 - - 47 5 1.81M 3.20K c1t1d0 - - 47 6 1.81M 3.59K c4t1d0 - - 47 6 1.81M 3.53K c5t1d0 - - 47 5 1.81M 3.33K c6t1d0 - - 48 6 1.81M 3.67K c7t1d0 - - 48 6 1.81M 3.66K c0t2d0 - - 48 5 1.82M 3.42K c1t2d0 - - 48 6 1.81M 3.56K c4t2d0 - - 48 6 1.81M 3.54K c5t2d0 - - 48 5 1.81M 3.41K raidz2 732G 6.10T800 12 24.6M 52.3K c6t2d0 - -139 5 7.52M 4.54K c7t2d0 - -139 5 7.52M 4.81K c0t3d0 - -140 5 7.52M 4.98K c1t3d0 - -139 5 7.51M 4.47K c4t3d0 - -139 5 7.51M 4.82K c5t3d0 - -139 5 7.51M 4.99K c6t3d0 - -139 5 7.52M 4.44K c7t3d0 - -139 5 7.52M 4.78K c0t4d0 - -139 5 7.52M 4.97K c1t4d0 - -139 5 7.51M 4.60K c4t4d0 - -139 5 7.51M 4.86K c6t4d0 - -139 5 7.51M 4.99K c7t4d0 - -139 5 7.51M 4.52K c0t5d0 - -139 5 7.51M 4.78K c1t5d0 - -138 5 7.51M 4.94K raidz2 960G 6.31T 1.02K 12 32.2M 52.0K c4t5d0 - -178 5 9.29M 4.79K c5t5d0 - -178 5 9.28M 4.64K c6t5d0 - -179 5 9.29M 4.44K c7t5d0 - -178 4 9.26M 4.26K c0t6d0 - -178 5 9.28M 4.78K c1t6d0 - -178 5 9.20M 4.58K c4t6d0 - -178 5 9.26M 4.25K c5t6d0 - -177 4 9.21M 4.18K c6t6d0 - -178 5 9.29M 4.69K c7t6d0 - -177 5 9.26M 4.61K c0t7d0 - -177 5 9.29M 4.34K c1t7d0 - -177 5 9.24M 4.28K c4t7d0 - -177 5 9.29M 4.78K c5t7d0 - -177 5 9.27M 4.75K c6t7d0 - -177 5 9.29M 4.34K c7t7d0 - -177 5 9.27M 4.28K -- - - - - - - Questions: (a) Why the first vdev does not get an equal share of the load (b) Why is a large raidz2 so bad? When I use a standard Linux box with hardware raid6 over 16 disks I usually get more bandwidth and at least about the same small file performance (c) Would the use of several smaller vdev would help much? And which layout would be a good compromise for getting space as well as performance and reliability? 46 disks have so few prime factors Thanks a lot Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Asymmetric zpool load
Hi Miles, Miles Nordin wrote: ca == Carsten Aulbert [EMAIL PROTECTED] writes: ca (a) Why the first vdev does not get an equal share ca of the load I don't know. but, if you don't add all the vdev's before writing anything, there's no magic to make them balance themselves out. Stuff stays where it's written. I'm guessing you did add them at the same time, and they still filled up unevenly? Yes, they are created all in one go (even on the same command line) and only then are filled - either naturally over time or via zfs send/receive (all on Sol10u5). So yes, it seems they fill up unevenly. 'zpool iostat' that you showed is the place I found to see how data is spread among vdev's. ca (b) Why is a large raidz2 so bad? When I use a ca standard Linux box with hardware raid6 over 16 disks I usually ca get more bandwidth and at least about the same small file ca performance obviously there are all kinds of things going on but...the standard answer is, traditional RAID5/6 doesn't have to do full stripe I/O. ZFS is more like FreeBSD's RAID3: it gets around the NVRAMless-RAID5 write hole by always writing a full stripe, which means all spindles seek together and you get the seek performance of 1 drive (per vdev). Linux RAID5/6 just gives up and accepts a write hole, AIUI, but because the stripes are much fatter than a filesystem block, you'll sometimes get the record you need by seeking a subset of the drives rather than all of them, which means the drives you didn't seek have the chance to fetch another record. If you're saying you get worse performance than a single spindle, I'm not sure why. No I think a single disk would be much less performant, however I'm a bit disappointed by the overall performance of the boxes and just now we have users where they experience extremely slow performance. But already thanks for the inside Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Asymmetric zpool load
Bob Friesenhahn wrote: You may have one or more slow disk drives which slow down the whole vdev due to long wait times. If you can identify those slow disk drives and replace them, then overall performance is likely to improve. The problem is that under severe load, the vdev with the highest backlog will be used the least. One or more slow disks in the vdev will slow down the whole vdev. It takes only one slow disk to slow down the whole vdev. Hmm, since I only started with Solaris this year, is there a way to identify a slow disk? In principle these should all be identical Hitachi Deathstar^WDeskstar drives and should only have the standard deviation during production. ZFS commits the writes to all involved disks in a raidz2 before proceeding with the next write. With so many disks, you are asking for quite a lot of fortuitous luck in that everything must be working optimally. Compounding the problem is that I understand that when the stripe width exceeds the number of segmented blocks from the data to be written (ZFS is only willing to dice to a certain minimum size), then only a subset of the disks will be used, wasting potiential I/O bandwidth. Your stripes are too wide. Ah, ok, that's one of the first reasonable explanation (which I understand) why large zpools might be bad. So far I was not able to track that down and only found the standard magic rule not to exceed 10 drives - but our (synthetic) tests had not shown a significant drawbacks. But I guess we might be bitten by it now. (c) Would the use of several smaller vdev would help much? And which layout would be a good compromise for getting space as well as performance and reliability? 46 disks have so few prime factors Yes, more vdevs should definitely help quite a lot for dealing with real-world muti-user loads. One raidz/raidz2 vdev provides (at most) the IOPs of a single disk. There is a point of diminishing returns and your layout has gone far beyond this limit. Thanks for the insight, I guess I need to experiment with empty boxes to get into a better state! Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs snapshot stalled?
Hi all, I've just seen something weird. On a zpool which looks a bit busy right now (~ 100 read op/s, 100 kB/s) I started a zfs snapshot about an hour ago. Until now, a taking a snapshot took usually at few seconds at most even for largish ~TByte file systems. I don't know if the read IOs are currently related to the snapshot itself or if another user doing this since I have not looked prior to taking the snapshot. My remaining questions after searching the web: (1) Is it common that snapshots can take this long? (2) Is there a way to stop it if one assumes somethings went wrong? I.e. is there a special signal I could send it? Thanks for any hint Carsten -- Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics Callinstrasse 38, 30167 Hannover, Germany Phone/Fax: +49 511 762-17185 / -17193 http://www.top500.org/system/9234 | http://www.top500.org/connfam/6/list/31 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs snapshot stalled?
Hi again, brief update: the process ended successfully (at least a snapshot was created) after close to 2 hrs. Since the load is still the same as before taking the snapshot I blame other users' processes reading from that array for the long snapshot duration. Carsten Aulbert wrote: My remaining questions after searching the web: (1) Is it common that snapshots can take this long? (2) Is there a way to stop it if one assumes somethings went wrong? I.e. is there a special signal I could send it? Thanks for any hint Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving zfs send performance
Hi Miles Nordin wrote: r == Ross [EMAIL PROTECTED] writes: r figures so close to 10MB/s. All three servers are running r full duplex gigabit though there is one tricky way 100Mbit/s could still bite you, but it's probably not happening to you. It mostly affects home users with unmanaged switches: http://www.smallnetbuilder.com/content/view/30212/54/ http://virtualthreads.blogspot.com/2006/02/beware-ethernet-flow-control.html because the big switch vendors all use pause frames safely: http://www.networkworld.com/netresources/0913flow2.html -- pause frames as interpreted by netgear are harmful That rings a bell, Ross, are you using NFS via UDP or TCP? May it be that your network has different performance levels for different transport types? For our network we have disabled pause frames completey and rely only on TCP internal mechanisms to prevent flooding/blocking. Carsten PS: the job where 25k files sizing up to 800 GB is now done - zfs send took only 52 hrs and the speed was ~ 4.5 MB/s :( ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving zfs send performance
Hi Scott, Scott Williamson wrote: You seem to be using dd for write testing. In my testing I noted that there was a large difference in write speed between using dd to write from /dev/zero and using other files. Writing from /dev/zero always seemed to be fast, reaching the maximum of ~200MB/s and using cp which would perform poorler the fewer the vdevs. You are right, the write benchmarks were done with dd just to have some bulk bulk figures since usually zeros can be generated fast enough. This also impacted the zfs send speed, as with fewer vdevs in RaidZ2 the disks seemed to spend most of their time seeking during the send. That seems a bit too simplistic to me. If you compare raidz with raidz2 it seems that raidz2 is not too bad with fewer vdevs. I wish there was a way for zfs send to avoid so many seeks. The 1 TB file system is still being zfs send, now close to 48 hours. Cheers Carsten PS: We still have a spare thumper sitting around, maybe I give it a try with 5 vdevs ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving zfs send performance
Hi Ross Ross wrote: Now though I don't think it's network at all. The end result from that thread is that we can't see any errors in the network setup, and using nicstat and NFS I can show that the server is capable of 50-60MB/s over the gigabit link. Nicstat also shows clearly that both zfs send / receive and mbuffer are only sending 1/5 of that amount of data over the network. I've completely run out of ideas of my own (but I do half expect there's a simple explanation I haven't thought of). Can anybody think of a reason why both zfs send / receive and mbuffer would be so slow? Try to separate the two things: (1) Try /dev/zero - mbuffer --- network --- mbuffer /dev/null That should give you wirespeed (2) Try zfs send | mbuffer /dev/null That should give you an idea how fast zfs send really is locally. Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving zfs send performance
Hi all, Carsten Aulbert wrote: More later. OK, I'm completely puzzled right now (and sorry for this lengthy email). My first (and currently only idea) was that the size of the files is related to this effect, but that does not seem to be the case: (1) A 185 GB zfs file system was transferred yesterday with a speed of about 60 MB/s to two different servers. The histogram of files looks like: 2822 files were investigated, total size is: 185.82 Gbyte Summary of file sizes [bytes]: zero: 2 1 - 2 0 2 - 4 1 4 - 8 3 8 - 16 26 16 - 32 8 32 - 64 6 64 - 128 29 128 - 25611 256 - 51213 512 - 1024 17 1024 - 2k33 2k - 4k 45 4k - 8k9044 8k - 16k 60 16k - 32k41 32k - 64k19 64k - 128k 22 128k - 256k 12 256k - 512k 5 512k - 1024k 1218 ** 1024k - 2M16004 * 2M - 4M 46202 4M - 8M 0 8M - 16M 0 16M - 32M 0 32M - 64M 0 64M - 128M0 128M - 256M 0 256M - 512M 0 512M - 1024M 0 1024M - 2G0 2G - 4G 0 4G - 8G 0 8G - 16G 1 (2) Currently a much larger file system is being transferred, the same script (even the same incarnation, i.e. process) is now running close to 22 hours: 28549 files were investigated, total size is: 646.67 Gbyte Summary of file sizes [bytes]: zero: 4954 ** 1 - 2 0 2 - 4 0 4 - 8 1 8 - 161 16 - 32 0 32 - 64 0 64 - 128 1 128 - 256 0 256 - 512 9 512 - 1024 71 1024 - 2k 1 2k - 4k1095 ** 4k - 8k8449 * 8k - 16k 2217 16k - 32k 503 *** 32k - 64k 1 64k - 128k1 128k - 256k 1 256k - 512k 0 512k - 1024k 0 1024k - 2M0 2M - 4M 0 4M - 8M 16 8M - 16M 0 16M - 32M 0 32M - 64M 11218 64M - 128M0 128M - 256M 0 256M - 512M 0 512M - 1024M 0 1024M - 2G0 2G - 4G 5 4G - 8G 1 8G - 16G 3 16G - 32G 1 When watching zpool iostat I get this (30 second average, NOT the first output): capacity operationsbandwidth pool used avail read write read write -- - - - - - - atlashome 3.54T 17.3T137 0 4.28M 0 raidz2 833G 6.00T 1 0 30.8K 0 c0t0d0 - - 1 0 2.38K 0 c1t0d0 - - 1 0 2.18K 0 c4t0d0 - - 0 0 1.91K 0 c6t0d0 - - 0 0 1.76K 0 c7t0d0 - - 0 0 1.77K 0 c0t1d0 - - 0 0 1.79K 0 c1t1d0 - - 0 0 1.86K 0 c4t1d0 - - 0 0 1.97K 0 c5t1d0 - - 0 0 2.04K 0 c6t1d0 - - 1 0 2.25K 0 c7t1d0 - - 1 0 2.31K 0 c0t2d0 - - 1 0 2.21K 0 c1t2d0 - - 0 0 1.99K 0 c4t2d0 - - 0 0 1.99K 0 c5t2d0 - - 1 0 2.38K 0 raidz21.29T 5.52T 67 0 2.09M 0 c6t2d0 - - 58 0 143K 0 c7t2d0 - - 58 0 141K 0 c0t3d0 - - 53 0 131K 0 c1t3d0 - - 53 0 130K 0 c4t3d0 - - 58 0 143K 0 c5t3d0 - - 58 0 145K 0 c6t3d0 - - 59 0 147K 0 c7t3d0 - - 59 0 146K 0 c0t4d0 - - 59 0 145K 0 c1t4d0 - - 58 0 145K 0 c4t4d0 - - 58 0 145K 0 c6t4d0 - - 58 0 143K 0 c7t4d0 - - 58 0 143K 0 c0t5d0 - - 58 0 145K 0 c1t5d0 - - 58 0 144K 0 raidz21.43T 5.82T 69 0 2.16M 0 c4t5d0 - - 62 0 141K 0 c5t5d0 - - 60 0 138K 0 c6t5d0 - - 59 0 135K 0 c7t5d0 - - 60 0 138K 0 c0t6d0 - - 62
Re: [zfs-discuss] Improving zfs send performance
Hi Ross Ross Smith wrote: Thanks, that got it working. I'm still only getting 10MB/s, so it's not solved my problem - I've still got a bottleneck somewhere, but mbuffer is a huge improvement over standard zfs send / receive. It makes such a difference when you can actually see what's going on. I'm currently trying to investigate this a bit. One of our user's home directories is extremely slow to 'zfs send'. It started yesterday afternoon at about 1600+0200 and is still running and has only copied less than 50% of the whole tree: On the receiving side zfs get tells me: atlashome/BACKUP/XXX used 193G - atlashome/BACKUP/XXX available 17.2T - atlashome/BACKUP/XXX referenced 193G - atlashome/BACKUP/XXX compressratio 1.81x - So close 350 GB are transferred and about 500 GB to go. More later. Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving zfs send performance
Hi Richard, Richard Elling wrote: Since you are reading, it depends on where the data was written. Remember, ZFS dynamic striping != RAID-0. I would expect something like this if the pool was expanded at some point in time. No, the RAID was set-up in one go right after jumpstarting the box. (2) The disks should be able to perform much much faster than they currently output data at, I believe it;s 2008 and not 1995. X4500? Those disks are good for about 75-80 random iops, which seems to be about what they are delivering. The dtrace tool, iopattern, will show the random/sequential nature of the workload. I need to read about his a bit and will try to analyze it. (3) The four cores of the X4500 are dying of boredom, i.e. idle 95% all the time. Has anyone a good idea, where the bottleneck could be? I'm running out of ideas. I would suspect the disks. 30 second samples are not very useful to try and debug such things -- even 1 second samples can be too coarse. But you should take a look at 1 second samples to see if there is a consistent I/O workload. -- richard Without doing too much statistics (yet, if needed I can easily do that) it looks like these: capacity operationsbandwidth pool used avail read write read write -- - - - - - - atlashome 3.54T 17.3T256 0 7.97M 0 raidz2 833G 6.00T 0 0 0 0 c0t0d0 - - 0 0 0 0 c1t0d0 - - 0 0 0 0 c4t0d0 - - 0 0 0 0 c6t0d0 - - 0 0 0 0 c7t0d0 - - 0 0 0 0 c0t1d0 - - 0 0 0 0 c1t1d0 - - 0 0 0 0 c4t1d0 - - 0 0 0 0 c5t1d0 - - 0 0 0 0 c6t1d0 - - 0 0 0 0 c7t1d0 - - 0 0 0 0 c0t2d0 - - 0 0 0 0 c1t2d0 - - 0 0 0 0 c4t2d0 - - 0 0 0 0 c5t2d0 - - 0 0 0 0 raidz21.29T 5.52T133 0 4.14M 0 c6t2d0 - -117 0 285K 0 c7t2d0 - -114 0 279K 0 c0t3d0 - -106 0 261K 0 c1t3d0 - -114 0 282K 0 c4t3d0 - -118 0 294K 0 c5t3d0 - -125 0 308K 0 c6t3d0 - -126 0 311K 0 c7t3d0 - -118 0 293K 0 c0t4d0 - -119 0 295K 0 c1t4d0 - -120 0 298K 0 c4t4d0 - -120 0 291K 0 c6t4d0 - -106 0 257K 0 c7t4d0 - - 96 0 236K 0 c0t5d0 - -109 0 267K 0 c1t5d0 - -114 0 282K 0 raidz21.43T 5.82T123 0 3.83M 0 c4t5d0 - -108 0 242K 0 c5t5d0 - -104 0 236K 0 c6t5d0 - -104 0 239K 0 c7t5d0 - -107 0 245K 0 c0t6d0 - -108 0 248K 0 c1t6d0 - -106 0 245K 0 c4t6d0 - -108 0 250K 0 c5t6d0 - -112 0 258K 0 c6t6d0 - -114 0 261K 0 c7t6d0 - -110 0 253K 0 c0t7d0 - -109 0 248K 0 c1t7d0 - -109 0 246K 0 c4t7d0 - -108 0 243K 0 c5t7d0 - -108 0 244K 0 c6t7d0 - -106 0 240K 0 c7t7d0 - -109 0 244K 0 -- - - - - - - the iops vary between about 70 - 140, interesting bit is that the first raidz2 does not get any hits at all :( Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving zfs send performance
Hi again Brent Jones wrote: Scott, Can you tell us the configuration that you're using that is working for you? Were you using RaidZ, or RaidZ2? I'm wondering what the sweetspot is to get a good compromise in vdevs and usable space/performance Some time ago I made some tests to find this: (1) create a new zpool (2) Copy user's home to it (always the same ~ 25 GB IIRC) (3) zfs send to /dev/null (4) evaluate continue loop I did this for fully mirrored setups, raidz as well as raidz2, the results were mixed: https://n0.aei.uni-hannover.de/cgi-bin/twiki/view/ATLAS/ZFSBenchmarkTest#ZFS_send_performance_relevant_fo The culprit here might be that in retrospect this seemed like a good home filesystem, i.e. one which was quite fast. If you don't want to bother with the table: Mirrored setup never exceeded 58 MB/s and was getting faster the more small mirrors you used. RaidZ had its sweetspot with a configuration of '6 6 6 6 6 6 5 5', i.e. 6 or 5 disks per RaidZ and 8 vdevs RaidZ2 finally was best at '10 9 9 9 9', i.e. 5 vdevs but not much worse with only 3, i.e. what we are currently using to get more storage space (gains us about 2 TB/box). Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving zfs send performance
Hi again, Thomas Maier-Komor wrote: Carsten Aulbert schrieb: Hi Thomas, I don't know socat or what benefit it gives you, but have you tried using mbuffer to send and receive directly (options -I and -O)? I thought we tried that in the past and with socat it seemed faster, but I just made a brief test and I got (/dev/zero - remote /dev/null) 330 MB/s with mbuffer+socat and 430MB/s with mbuffer alone. Additionally, try to set the block size of mbuffer to the recordsize of zfs (usually 128k): receiver$ mbuffer -I sender:1 -s 128k -m 2048M | zfs receive sender$ zfs send blabla | mbuffer -s 128k -m 2048M -O receiver:1 We are using 32k since many of our user use tiny files (and then I need to reduce the buffer size because of this 'funny' error): mbuffer: fatal: Cannot address so much memory (32768*65536=21474836481544040742911). Does this qualify for a bug report? Thanks for the hint of looking into this again! Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Improving zfs send performance
Hi all, although I'm running all this in a Sol10u5 X4500, I hope I may ask this question here. If not, please let me know where to head to. We are running several X4500 with only 3 raidz2 zpools since we want quite a bit of storage space[*], but the performance we get when using zfs send is sometimes really lousy. Of course this depends what's in the file system, but when doing a few backups today I have seen the following: receiving full stream of atlashome/[EMAIL PROTECTED] into atlashome/BACKUP/[EMAIL PROTECTED] in @ 11.1 MB/s, out @ 11.1 MB/s, 14.9 GB total, buffer 0% full summary: 14.9 GByte in 45 min 42.8 sec - average of 5708 kB/s So, a mere 15 GB were transferred in 45 minutes, another user's home which is quite large (7TB) took more than 42 hours to be transferred. Since all this is going a 10 Gb/s network and the CPUs are all idle I would really like to know why * zfs send is so slow and * how can I improve the speed? Thanks a lot for any hint Cheers Carsten [*] we have some quite a few tests with more zpools but were not able to improve the speeds substantially. For this particular bad file system I still need to histogram the file sizes. -- Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics Callinstrasse 38, 30167 Hannover, Germany Phone/Fax: +49 511 762-17185 / -17193 http://www.top500.org/system/9234 | http://www.top500.org/connfam/6/list/31 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving zfs send performance
Hi Darren J Moffat wrote: What are you using to transfer the data over the network ? Initially just plain ssh which was way to slow, now we use mbuffer on both ends and socket transfer the data over via socat - I know that mbuffer already allows this, but in a few tests socat seemed to be faster. Sorry for not writing this into the first email. Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving zfs send performance
Hi Thomas, Thomas Maier-Komor wrote: Carsten, the summary looks like you are using mbuffer. Can you elaborate on what options you are passing to mbuffer? Maybe changing the blocksize to be consistent with the recordsize of the zpool could improve performance. Is the buffer running full or is it empty most of the time? Are you sure that the network connection is 10Gb/s all the way through from machine to machine? Well spotted :) right now plain mbuffer with plenty of buffer (-m 2048M) on both ends and I have not seen any buffer exceeding the 10% watermark level. The network connection are via Neterion XFrame II Sun Fire NICs then via CX4 cables to our core switch where both boxes are directly connected (WovenSystmes EFX1000). netperf tells me that the TCP performance is close to 7.5 GBit/s duplex and if I use cat /dev/zero | mbuffer | socat --- socat | mbuffer /dev/null I easily see speeds of about 350-400 MB/s so I think the network is fine. Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool replace is stuck
Hi, on a Solaris 10u5 box (X4500) with latest patches (Oct 8) one disk was marked as failed. We replaced it yesterday, I configured it via cfgadm and told ZFS to replace it with the replacement: cfgadm -c configure sata1/4 zpool replace atlashome c1t4d0 Initially it looked well, resilvering started, but when I looked a few hours later I found the zpool still degraded and the replacement disk was also marked as failed, but the resilvering looked complete (according to zpool status): zpool status s08:~# zpool status pool: atlashome state: DEGRADED scrub: resilver completed with 0 errors on Thu Oct 9 13:51:52 2008 config: NAME STATE READ WRITE CKSUM atlashome DEGRADED 0 0 0 raidz2 ONLINE 0 0 0 [...] raidz2 DEGRADED 0 0 0 [...]c0t4d0ONLINE 0 0 0 replacing DEGRADED 0 3.00K18 c1t4d0s0/o UNAVAIL 0 277 0 cannot open c1t4d0 *0 0 0 c4t4d0ONLINE 0 0 0 [...] * at that point this disk was OFFLINE IIRC, now it's marked ONLINE, see below I tried to get it back online with disconnecting the SATA port and then reconnecting it. Apparently that worked (the disk is still ONLINE after more than 12 hours), but I'm still stuck in the same place. ZFS seems to think that the replacement is still going on and I don't know how to continue. I'm currently backing up the files form that box (luckily only about 1 TB), but I would like to know how to solve this: (1) export/import the file system after the backup? (2) Destroying the pool and re-init it? (3) Anything else? Thanks a lot for a brief hint! Cheers Carsten -- Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics Callinstrasse 38, 30167 Hannover, Germany Phone/Fax: +49 511 762-17185 / -17193 http://www.top500.org/system/9234 | http://www.top500.org/connfam/6/list/31 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss