Re: [zfs-discuss] Disable ZIL - persistent
On 05 August, 2011 - Darren J Moffat sent me these 0,9K bytes: On 08/05/11 13:11, Edward Ned Harvey wrote: After a certain rev, I know you can set the sync property, and it takes effect immediately, and it's persistent across reboots. But that doesn't apply to Solaris 10. My question: Is there any way to make Disabled ZIL a normal mode of operations in solaris 10? Particularly: If I do this echo zil_disable/W0t1 | mdb -kw then I have to remount the filesystem. It's kind of difficult to do this automatically at boot time, and impossible (as far as I know) for rpool. The only solution I see is to write some startup script which applies it to filesystems other than rpool. Which feels kludgy. Is there a better way? echo set zfs:zil_disable = 1 /etc/system Or use if you don't want to zap /etc/system.. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD vs hybrid drive - any advice?
On 25 July, 2011 - Erik Trimble sent me these 2,0K bytes: On 7/25/2011 3:32 AM, Orvar Korvar wrote: How long have you been using a SSD? Do you see any performance decrease? I mean, ZFS does not support TRIM, so I wonder about long term effects... Frankly, for the kind of use that ZFS puts on a SSD, TRIM makes no impact whatsoever. TRIM is primarily useful for low-volume changes - that is, for a filesystem that generally has few deletes over time (i.e. rate of change is low). Using a SSD as a ZIL or L2ARC device puts a very high write load on the device (even as an L2ARC, there is a considerably higher write load than a typical filesystem use). SSDs in such a configuration can't really make use of TRIM, and depend on the internal SSD controller block re-allocation algorithms to improve block layout. Now, if you're using the SSD as primary media (i.e. in place of a Hard Drive), there is a possibility that TRIM could help. I honestly can't be sure that it would help, however, as ZFS's Copy-on-Write nature means that it tends to write entire pages of blocks, rather than just small blocks. Which is fine from the SSD's standpoint. You still need the flash erase cycle. On a related note: I've been using a OCZ Vertex 2 as my primary drive in a laptop, which runs Windows XP (no TRIM support). I haven't noticed any dropoff in performance in the year its be in service. I'm doing typical productivity laptop-ish things (no compiling, etc.), so it appears that the internal SSD controller is more than smart enough to compensate even without TRIM. Honestly, I think TRIM isn't really useful for anyone. It took too long to get pushed out to the OSes, and the SSD vendors seem to have just compensated by making a smarter controller able to do better reallocation. Which, to me, is the better ideal, in any case. Bullshit. I just got a OCZ Vertex 3, and the first fill was 450-500MB/s. Second and sequent fills are at half that speed. I'm quite confident that it's due to the flash erase cycle that's needed, and if stuff can be TRIM:ed (and thus flash erased as well), speed would be regained. Overwriting an previously used block requires a flash erase, and if that can be done in the background when the timing is not critical instead of just before you can actually write the block you want, performance will increase. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] monitoring ops
Matt Harrison iwasinnamuk...@genestate.com wrote: Hi list, I want to monitor the read and write ops/bandwidth for a couple of pools and I'm not quite sure how to proceed. I'm using rrdtool so I either want an accumulated counter or a gauge. According to the ZFS admin guide, running zpool iostat without any parameters should show the activity since boot. On my system (OSOL Average activity since boot... snv_133) it's only showing ops in the single digits for a system with a months uptime and many GB of transfers. So, is there a way to get this output correctly, or is there a better way to do this? Thanks ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zpool with data errors
On 21 June, 2011 - Todd Urie sent me these 5,9K bytes: I have a zpool that shows the following from a zpool status -v zpool name brsnnfs0104 [/var/spool/cron/scripts]# zpool status -v ABC0101 pool:ABC0101 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAME STATE READ WRITE CKSUM ABC0101 ONLINE 0 010 /dev/vx/dsk/ABC01dg/ABC0101_01 ONLINE 0 0 2 /dev/vx/dsk/ABC01dg/ABC0101_02 ONLINE 0 0 8 /dev/vx/dsk/ABC01dg/ABC0101_03 ONLINE 0 010 errors: Permanent errors have been detected in the following files: /clients/ABC0101/rep/local/bfm/web/htdocs/tmp/rscache/717b52282ea059452621587173561360 /clients/ ABC0101/rep/local/bfm/web/htdocs/tmp/rscache/6e6a9f37c4d13fdb3dcb8649272a2a49 /clients/ABC0101/rep/d0/prod1/reports/ReutersCMOLoad/ReutersCMOLoad. ABCntss001.20110620.141330.26496.ROLLBACK_FOR_UPDATE_COUPONS.html /clients/ ABC0101/rep/local/bfm/web/htdocs/tmp/G2_0.related_detail_loader.1308593666.54643.n5cpoli3355.data /clients/ ABC0101/rep/d0/prod1/reports/gp_reports/ALLMNG/20110429/F_OLPO82_A.gp. ABCIM_GA.nlaf.xml.gz /clients/ ABC0101/rep/d0/prod1/reports/gp_reports/ALLMNG/20110429/UNVLXCIAFI.gp. ABCIM_GA.nlaf.xml.gz /clients/ ABC0101/rep/d0/prod1/reports/gp_reports/ALLMNG/20110429/UNIVLEXCIA.gp.BARCRATING_ ABC.nlaf.xml.gz I think that a scrub at least has the possibility to clear this up. A quick search suggests that others have had some good experience with using scrub in similar circumstances. I was wondering if anyone could share some of their experiences, good and bad, so that I can assess the risk and probability of success with this approach. Also, any other ideas would certainly be appreciated. As you have no ZFS based redundancy, it can only detect that some blocks delivered from the devices (SAN I guess?) were broken according to the checksum. If you had raidz/mirror in zfs, it would have corrected the problems and written back correct data to the malfunctioning device. Now it does not. A scrub only reads the data and verifies that data matches checksums. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Server with 4 drives, how to configure ZFS?
On 21 June, 2011 - Nomen Nescio sent me these 0,4K bytes: Hello Marty! With four drives you could also make a RAIDZ3 set, allowing you to have the lowest usable space, poorest performance and worst resilver times possible. That's not funny. I was actually considering this :p 4-way mirror would be way more useful. But you have to admit, it would probably be somewhat reliable! /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wired write performance problem
On 08 June, 2011 - Donald Stahl sent me these 0,6K bytes: One day, the write performance of zfs degrade. The write performance decrease from 60MB/s to about 6MB/s in sequence write. Command: date;dd if=/dev/zero of=block bs=1024*128 count=1;date See this thread: http://www.opensolaris.org/jive/thread.jspa?threadID=139317tstart=45 And search in the page for: metaslab_min_alloc_size Try adjusting the metaslab size and see if it fixes your performance problem. And if pool usage is 90%, then there's another problem (change of finding free space algorithm). /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RealSSD C300 - Crucial CT064M4SSD2
On 08 June, 2011 - Eugen Leitl sent me these 0,5K bytes: Anyone running a Crucial CT064M4SSD2? Any good, or should I try getting a RealSSD C300, as long as these are still available? Haven't tried any of those, but how about one of these: OCZ Vertex3 (Sandforce SF-2281, sataIII, MLC, to be used for l2arc): shazoo:~# gdd if=/dev/rdsk/c0t5E83A97F98CEFE5Dd0s0 of=/dev/null bs=1024k count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 2.21005 s, 486 MB/s OCZ Vertex2 EX (Sandforce SF-1500, sataII, SLC and supercap, to be used for zil) shazoo:~# gdd if=/dev/rdsk/c0t5E83A97F1471E0A4d0s0 of=/dev/null bs=1024k count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 3.93114 s, 273 MB/s This is in a x4170m2 with Solaris10. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SATA disk perf question
On 01 June, 2011 - Paul Kraus sent me these 0,9K bytes: I figure this group will know better than any other I have contact with, is 700-800 I/Ops reasonable for a 7200 RPM SATA drive (1 TB Sun badged Seagate ST31000N in a J4400) ? I have a resilver running and am seeing about 700-800 writes/sec. on the hot spare as it resilvers. There is no other I/O activity on this box, as this is a remote replication target for production data. I have a the replication disabled until the resilver completes. 700-800 seq ones perhaps.. for random, you can divide by 10. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] changing vdev types
On 01 June, 2011 - Eric Sproul sent me these 0,8K bytes: On Wed, Jun 1, 2011 at 2:54 PM, Matt Harrison iwasinnamuk...@genestate.com wrote: Hi list, I've got a pool thats got a single raidz1 vdev. I've just some more disks in and I want to replace that raidz1 with a three-way mirror. I was thinking I'd just make a new pool and copy everything across, but then of course I've got to deal with the name change. Basically, what is the most efficient way to migrate the pool to a completely different vdev? Since you can't mix vdev types in a single pool, you'll have to create a new pool. But you can use zfs send/recv to move the datasets, so You can mix as much as you want to, but you can't remove a vdev (yet). /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Experiences with 10.000+ filesystems
On 31 May, 2011 - Khushil Dep sent me these 4,5K bytes: The adage that I adhere to with ZFS features is just because you can doesn't mean you should!. I would suspect that with that many filesystems the normal zfs-tools would also take an inordinate length of time to complete their operations - scale according to size. I've done a not too scientific test on reboot times for Solaris 10 vs 11 with regard to many filesystems... Quad Xeon machines with single raid10 and one boot environment. Using more be's with LU in sol10 will make the situation even worse, as it's LU that's taking time (re)mounting all filesystems over and over and over and over again. http://www8.cs.umu.se/~stric/tmp/zfs-many.png As the picture shows, don't try 1 filesystems with nfs on sol10. Creating more filesystems gets slower and slower the more you have as well. Generally snapshots are quick operations but 10,000 such operations would I believe take enough to time to complete as to present operational issues - breaking these into sets would alleviate some? Perhaps if you are starting to run into many thousands of filesystems you would need to re-examin your rationale in creating so many. On a different setup, we have about 750 datasets where we would like to use a single recursive snapshot, but when doing that all file access will be frozen for varying amounts of time (sometimes half an hour or way more). Splitting it up into ~30 subsets, doing recursive snapshots over those instead has decreased the total snapshot time greatly and cut the frozen time down to single digit seconds instead of minutes or hours. My 2c. YMMV. -- Khush On Tuesday, 31 May 2011 at 11:08, Gertjan Oude Lohuis wrote: Filesystem are cheap is one of ZFS's mottos. I'm wondering how far this goes. Does anyone have experience with having more than 10.000 ZFS filesystems? I know that mounting this many filesystems during boot while take considerable time. Are there any other disadvantages that I should be aware of? Are zfs-tools still usable, like 'zfs list', 'zfs get/set'. Would I run into any problems when snapshots are taken (almost) simultaneously from multiple filesystems at once? Regards, Gertjan Oude Lohuis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org (mailto:zfs-discuss@opensolaris.org) http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Experiences with 10.000+ filesystems
On 31 May, 2011 - Gertjan Oude Lohuis sent me these 0,9K bytes: On 05/31/2011 03:52 PM, Tomas Ögren wrote: I've done a not too scientific test on reboot times for Solaris 10 vs 11 with regard to many filesystems... http://www8.cs.umu.se/~stric/tmp/zfs-many.png As the picture shows, don't try 1 filesystems with nfs on sol10. Creating more filesystems gets slower and slower the more you have as well. Since all filesystem would be shared via NFS, this clearly is a nogo :). Thanks! On a different setup, we have about 750 datasets where we would like to use a single recursive snapshot, but when doing that all file access will be frozen for varying amounts of time What version of ZFS are you using? Like Matthew Ahrens said: version 27 has a fix for this. 22, Solaris 10. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Monitoring disk seeks
On 19 May, 2011 - Sa??o Kiselkov sent me these 0,6K bytes: Hi all, I'd like to ask whether there is a way to monitor disk seeks. I have an application where many concurrent readers (50) sequentially read a large dataset (10T) at a fairly low speed (8-10 Mbit/s). I can monitor read/write ops using iostat, but that doesn't tell me how contiguous the data is, i.e. when iostat reports 500 read ops, does that translate to 500 seeks + 1 read per seek, or 50 seeks + 10 reads, etc? Thanks! Get DTraceToolkit and check out the various things under Disk and FS, might help. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] fuser vs. zfs
On 23 November, 2005 - Benjamin Lewis sent me these 3,0K bytes: Hello, I'm running Solaris Express build 27a on an amd64 machine and fuser(1M) isn't behaving as I would expect for zfs filesystems. Various google and ... #fuser -c / /:[lots of other PIDs] 20617tm [others] 20412cm [others] #fuser -c /opt /opt: # Nothing at all for /opt. So it's safe to unmount? Nope: ... Has anyone else seen something like this? Try something less ancient, Solaris 10u9 reports it just fine for example. ZFS was pretty new-born when snv27 got out.. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] fuser vs. zfs
On 10 May, 2011 - Tomas Ögren sent me these 0,9K bytes: On 23 November, 2005 - Benjamin Lewis sent me these 3,0K bytes: Hello, I'm running Solaris Express build 27a on an amd64 machine and fuser(1M) isn't behaving as I would expect for zfs filesystems. Various google and ... #fuser -c / /:[lots of other PIDs] 20617tm [others] 20412cm [others] #fuser -c /opt /opt: # Nothing at all for /opt. So it's safe to unmount? Nope: ... Has anyone else seen something like this? Try something less ancient, Solaris 10u9 reports it just fine for example. ZFS was pretty new-born when snv27 got out.. And for someone who is able to read as well, that mail was from 2005 - when snv27 actually was less ancient ;) Seems like the moderator queue from yesteryears just got flushed.. Sorry for the noise from my side.. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] primarycache=metadata seems to force behaviour of secondarycache=metadata
On 09 May, 2011 - Richard Elling sent me these 5,0K bytes: of the pool -- not likely to be a winning combination. This isn't a problem for the ARC because it has memory bandwidth, which is, of course, always greater than I/O bandwidth. Slightly off topic, but we had an IBM RS/6000 43P with a PowerPC 604e cpu, which had about 60MB/s memory bandwidth (which is kind of bad for a 332MHz cpu) and its disks could do 70-80MB/s or so.. in some other machine.. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements
On 06 May, 2011 - Erik Trimble sent me these 1,8K bytes: If dedup isn't enabled, snapshot and data deletion is very light on RAM requirements, and generally won't need to do much (if any) disk I/O. Such deletion should take milliseconds to a minute or so. .. or hours. We've had problems on an old raidz2 that a recursive snapshot creation over ~800 filesystems could take quite some time, up until the sata-scsi disk box ate the pool. Now we're using raid10 on a scsi box, and it takes 3-15 minute or so, during which sync writes (NFS) are almost unusable. Using 2 fast usb sticks as l2arc, waiting for a Vertex2EX and a Vertex3 to arrive for ZILL2ARC testing. IO to the filesystems are quite low (50 writes, 500k data per sec average), but snapshot times goes waay up during backups. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] No write coalescing after upgrade to Solaris 11 Express
On 27 April, 2011 - Matthew Anderson sent me these 3,2K bytes: Hi All, I've run into a massive performance problem after upgrading to Solaris 11 Express from oSol 134. Previously the server was performing a batch write every 10-15 seconds and the client servers (connected via NFS and iSCSI) had very low wait times. Now I'm seeing constant writes to the array with a very low throughput and high wait times on the client servers. Zil is currently disabled. There is currently one failed disk that is being replaced shortly. Is there any ZFS tunable to revert Solaris 11 back to the behaviour of oSol 134? I attempted to remove Sol 11 and reinstall 134 but it keeps freezing during install which is probably another issue entirely... IOstat output is below. When running iostat -v 2 that level is writes OP's and throughput is very constant. capacity operationsbandwidth poolalloc free read write read write -- - - - - - - MirrorPool 12.2T 4.11T153 4.63K 6.06M 33.6M mirror1.04T 325G 11416 400K 2.80M c7t0d0 - - 5114 163K 2.80M c7t1d0 - - 6114 237K 2.80M mirror1.04T 324G 10374 426K 2.79M c7t2d0 - - 5108 190K 2.79M c7t3d0 - - 5107 236K 2.79M mirror1.04T 324G 15425 537K 3.15M c7t4d0 - - 7115 290K 3.15M c7t5d0 - - 8116 247K 3.15M mirror1.04T 325G 13412 572K 3.00M c7t6d0 - - 7115 313K 3.00M c7t7d0 - - 6116 259K 3.00M mirror1.04T 324G 13381 580K 2.85M c7t8d0 - - 7111 362K 2.85M c7t9d0 - - 5111 219K 2.85M mirror1.04T 325G 15408 654K 3.10M c7t10d0 - - 7122 336K 3.10M c7t11d0 - - 7123 318K 3.10M mirror1.04T 325G 14461 681K 3.22M c7t12d0 - - 8130 403K 3.22M c7t13d0 - - 6132 278K 3.22M mirror 749G 643G 1279 140K 1.07M c4t14d0 - - 0 0 0 0 c7t15d0 - - 1 83 140K 1.07M mirror1.05T 319G 18333 672K 2.74M c7t16d0 - - 11 96 406K 2.74M c7t17d0 - - 7 96 266K 2.74M mirror1.04T 323G 13353 540K 2.85M c7t18d0 - - 7 98 279K 2.85M c7t19d0 - - 6100 261K 2.85M mirror1.04T 324G 12459 543K 2.99M c7t20d0 - - 7118 285K 2.99M c7t21d0 - - 4119 258K 2.99M mirror1.04T 324G 11431 465K 3.04M c7t22d0 - - 5116 195K 3.04M c7t23d0 - - 6117 272K 3.04M c8t2d00 29.5G 0 0 0 0 Btw, this disk seems alone, unmirrored and a bit small..? cache - - - - - - c8t3d059.4G 3.88M113 64 6.51M 7.31M c8t1d059.5G48K 95 69 5.69M 8.08M Thanks -Matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)
On 27 April, 2011 - Edward Ned Harvey sent me these 0,6K bytes: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Erik Trimble (BTW, is there any way to get a measurement of number of blocks consumed per zpool? Per vdev? Per zfs filesystem?) *snip*. you need to use zdb to see what the current block usage is for a filesystem. I'd have to look up the particular CLI usage for that, as I don't know what it is off the top of my head. Anybody know the answer to that one? zdb -bb pool /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Read-only vdev
On 08 April, 2011 - Karl Wagner sent me these 3,5K bytes: Hi everyone. I was just wondering if there was a way to for a specific vdev in a pool to be read-only? I can think of several uses for this, but would need to know if it was possible before thinking them through properly. I can't think of any, so what are your uses? /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Assessing health/performance of individual drives in ZFS pool
On 07 April, 2011 - Russ Price sent me these 0,7K bytes: On 04/05/2011 03:01 PM, Tomas Ögren wrote: On 05 April, 2011 - Joe Auty sent me these 5,9K bytes: Has this changed, or are there any other techniques I can use to check the health of an individual SATA drive in my pool short of what ZFS itself reports? Through scsi compat layer.. socker:~# smartctl -a -d scsi /dev/rdsk/c0t0d0s0 Note that you can get more complete information by using -d sat,12 than by using -d scsi. This works for me on both the onboard AHCI ports and the SATA drives connected to my Intel SASUC8I (LSI-based) HBA. Excellent. Works fine on the internal disks of a HP DL160G6 for example. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Assessing health/performance of individual drives in ZFS pool
On 05 April, 2011 - Joe Auty sent me these 5,9K bytes: Hello, A while I was exploring running smartmontools in Solaris 10 on my SATA drives, but at the time SATA drives were not supported. Has this changed, or are there any other techniques I can use to check the health of an individual SATA drive in my pool short of what ZFS itself reports? Through scsi compat layer.. socker:~# smartctl -a -d scsi /dev/rdsk/c0t0d0s0 smartctl 5.40 2010-10-16 r3189 [i386-pc-solaris2.10] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net Serial number: WCAT26836798 Device type: disk Local Time is: Tue Apr 5 22:00:23 2011 MEST Device supports SMART and is Enabled Temperature Warning Disabled or Not Supported SMART Health Status: OK Current Drive Temperature: 29 C Error Counter logging not supported SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Default Completed - 293 - [- --] ... /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Investigating a hung system
On 25 February, 2011 - Mark Logan sent me these 0,6K bytes: Hi, I'm investigating a hung system. The machine is running snv_159 and was running a full build of Solaris 11. You cannot get any response from the console and you cannot ssh in, but it responds to ping. The output from ::arc shows: arc_meta_used = 3836 MB arc_meta_limit= 3836 MB arc_meta_max = 3951 MB Is it normal for arc_meta_used == arc_meta_limit? It means that it has cached as much metadata as it's allowed to during the current circumstances (arc size). Does this explain the hang? No.. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Performance
On 25 February, 2011 - David Blasingame Oracle sent me these 2,6K bytes: Hi All, In reading the ZFS Best practices, I'm curious if this statement is still true about 80% utilization. It happens at about 90% for me.. all of a sudden, the mail server got butt slow.. killed an old snapshot to get to 85% free or so, then it got snappy again. S10u9 sparc. from : http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide http://www.solarisinternals.com/wiki/index.php?title=ZFS_Best_Practices_Guideaction=editsection=12Storage Pool Performance Considerations . Keep pool space under 80% utilization to maintain pool performance. Currently, pool performance can degrade when a pool is very full and file systems are updated frequently, such as on a busy mail server. Full pools might cause a performance penalty, but no other issues. Dave ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best way/issues with large ZFS send?
On 16 February, 2011 - Richard Elling sent me these 1,3K bytes: On Feb 16, 2011, at 6:05 AM, Eff Norwood wrote: I'm preparing to replicate about 200TB of data between two data centers using zfs send. We have ten 10TB zpools that are further broken down into zvols of various sizes in each data center. One DC is primary and the other will be the replication target and there is plenty of bandwidth between them (10 gig dark fiber). Are there any gotchas that I should be aware of? Also, at what level should I be taking the snapshot to do the zfs send? At the primary pool level or at the zvol level? Since the targets are to be exact replicas, I presume at the primary pool level (e.g. tank) rather than for every zvol (e.g. tank/prod/vol1)? There is no such thing as a pool snapshot. There are only dataset snapshots. .. but you can make a single recursive snapshot call that affects all datasets. The trick to a successful snapshot+send strategy at this size is to start snapping early and often. You don't want to send 200TB, you want to send 2TB, 100 times :-) The performance is tends to be bursty, so the fixed record size of the zvols can work to your advantage for capacity planning. Also, a buffer of some sort can help smooth out the utilization, see the threads on ZFS and mbuffer. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sil3124 Sata controller for ZFS on Sparc OpenSolaris Nevada b130
On 08 February, 2011 - Robert Soubie sent me these 1,1K bytes: Le 08/02/2011 07:10, Jerry Kemp a écrit : As part of a small home project, I have purchased a SIL3124 hba in hopes of attaching an external drive/drive enclosure via eSATA. The host in question is an old Sun Netra T1 currently running OpenSolaris Nevada b130. The card in question is this Sil3124 card: http://www.newegg.com/product/product.aspx?item=N82E16816124003 although I did not purchase it from Newegg. I specifically purchased this card as I have seen specific reports of it working under Solaris/OpenSolaris distro's on several Solaris mailing lists. I use a non-eSata version of this card under Solaris Express 11 for a boot mirrored ZFS pool. And another one for a Windows 7 machine that does backups of the server. Bios and drivers are available from the Silicon Image site, but nothing for Solaris. The problem itself is sparc vs x86 and firmware for the card. AFAIK, there is no sata card with drivers for solaris sparc. Use a SAS card. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mix 4k wdd drives whit 512 WD drives ?
On 26 January, 2011 - Benji sent me these 0,8K bytes: Those WD20EARS emulate 512 bytes sectors, so yes you can freely mix and match them with other regular 512 bytes drives. Some have reported slower read/write speeds but nothing catastrophic. For some workloads, 3x slower than it should be. Or you can create a new 4K aligned pool (composed of only 4K drives!) to really take advantage of them. For that, you will need a modified zpool command to sets the ashift value of the pool to 12. A 4k aligned pool will work perfectly on a 512b aligned disk, it's just the other way that's bad. I guess ZFS could start defaulting to 4k, but ideally it should do the right thing depending on content (although that's hard for disks that are lying). /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] stupid ZFS question - floating point operations
On 22 December, 2010 - Jerry Kemp sent me these 1,0K bytes: I have a coworker, who's primary expertise is in another flavor of Unix. This coworker lists floating point operations as one of ZFS detriments. I's not really sure what he means specifically, or where he got this reference from. Then maybe ask him first? Guilty until proven innocent isn't the regular path... In an effort to refute what I believe is an error or misunderstanding on his part, I have spent time on Yahoo, Google, the ZFS section of OpenSolaris.org, etc. I really haven't turned up much of anything that would prove or disprove his comments. The one thing I haven't done is to go through the ZFS source code, but its been years since I have done any serious programming. If someone from Oracle, or anyone on this mailing list could point me towards any documentation, or give me a definitive word, I would sure appreciate it. If there were floating point operations going on within ZFS, at this point I am uncertain as to what they would be. TIA for any comments, Jerry ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How many files directories in a ZFS filesystem?
On 09 December, 2010 - David Strom sent me these 0,7K bytes: Looking for a little help, please. A contact from Oracle (Sun) suggested I pose the question to this email. We're using ZFS on Solaris 10 in an application where there are so many directory-subdirectory layers, and a lot of small files (~1-2Kb) that we ran out of inodes (over 30 million!). So, the zfs question is, how can we see how many files directories have been created in a zfs filesystem? Equivalent to df -o i on a UFS filesystm. Short of doing a find zfs-mount-point | wc. GNU df can show, and regular Solaris could too but chooses not to. statvfs() should be able to report as well. In ZFS, you will run out of inodes at the same time as you run out of space. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] accidentally added a drive?
On 05 December, 2010 - Chris Gerhard sent me these 0,3K bytes: Alas you are hosed. There is at the moment no way to shrink a pool which is what you now need to be able to do. back up and restore I am afraid. .. or add a mirror to that drive, to keep some redundancy. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to create a checkpoint?
On 08 November, 2010 - Peter Taps sent me these 0,7K bytes: Folks, My understanding is that there is a way to create a zfs checkpoint before doing any system upgrade or installing a new software. If there is a problem, one can simply rollback to the stable checkpoint. I am familiar with snapshots and clones. However, I am not clear on how to manage checkpoints. I would appreciate your help in how I can create, destroy and roll back to a checkpoint, and how I can list all the checkpoints. You probably refer to snapshots, as ZFS does not have checkpoints (and is pretty much the same as a snapshot). /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Unknown Space Gain
On 20 October, 2010 - Krunal Desai sent me these 1,5K bytes: Huh, I don't actually ever recall enabling that. Perhaps that is connected to the message I started getting every minute recently in the kernel buffer, Oct 20 12:20:49 megatron pcplusmp: [ID 805372 kern.info] pcplusmp: ide (ata) instance 3 irq 0xf vector 0x45 ioapic 0x2 intin 0xf is bound to cpu 0 Oct 20 12:21:49 megatron pcplusmp: [ID 805372 kern.info] pcplusmp: ide (ata) instance 3 irq 0xf vector 0x45 ioapic 0x2 intin 0xf is bound to cpu 1 I just disabled it (zfs set com.sun\:auto-snapshot=false tank, correct?), will see if the log messages disappear. Did the filesystem kill off some snapshots or something in an effort to free up space? Probably. zfs list -t all to see all the snapshots as well. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What is the 1000 bit?
On 19 October, 2010 - Linder, Doug sent me these 1,2K bytes: Nicolas Williams [mailto:nicolas.willi...@oracle.com] wrote: It's the sticky bit. Nowadays it's only useful on directories, and really it's generally only used with 777 permissions. The chmod(1) Thanks. It doesn't seem harmful. But it does make me wonder why it's showing up on my newly-created zpool. I literally created the pool with one command, created a file (mkfile) with the second command, and did an ls with the third. I can't imagine how I could have done anything to set that bit. Is this a ZFS weirdness? It's mkfile. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Moving camp, lock stock and barrel
On 11 October, 2010 - Harry Putnam sent me these 0,5K bytes: Harry Putnam rea...@newsguy.com writes: [...] I can't get X up ... it just went to a black screen, after seeing the main login screen, logging in to consol and calling: WHOOPS, omitted some information here... Calling: `startx /usr/bin/dbus-launch --exit-with-session gnome-session' from console. Which is how I've been starting X for some time. This thread started out way off-topic from ZFS discuss (the filesystem) and has continued off course. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Finding corrupted files
On 06 October, 2010 - Stephan Budach sent me these 2,1K bytes: Hi, I recently discovered some - or at least one corrupted file on one ofmy ZFS datasets, which caused an I/O error when trying to send a ZFDS snapshot to another host: zpool status -v obelixData pool: obelixData state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAME STATE READ WRITE CKSUM obelixData ONLINE 4 0 0 c4t21D023038FA8d0 ONLINE 0 0 0 c4t21D02305FF42d0 ONLINE 4 0 0 errors: Permanent errors have been detected in the following files: 0x949:0x12b9b9 obelixData/jvmprepr...@2010-10-02_2359:/DTP/Jobs/Mercedes-Benz/C_Klasse/RZ in CI vor ET 10.6.2010/13404_41_07008 Estate HandelsMarketing/Dealer_Launch_Invitations Fremddokumente/Dealer_Launch_S204/Images/Vorhang_Innen.eps obelixData/jvmprepr...@backupsnapshot_2010-10-05-08:/DTP/Jobs/Mercedes-Benz/C_Klasse/RZ in CI vor ET 10.6.2010/13404_41_07008 Estate HandelsMarketing/Dealer_Launch_Invitations Fremddokumente/Dealer_Launch_S204/Images/Vorhang_Innen.eps obelixData/jvmprepr...@2010-09-24_2359:/DTP/Jobs/Mercedes-Benz/C_Klasse/RZ in CI vor 6_210/13404_41_07008 Estate HandelsMarketing/Dealer_Launch_Invitations Fremddokumente/Dealer_Launch_S204/Images/Vorhang_Innen.eps /obelixData/JvMpreprint/DTP/Jobs/Mercedes-Benz/C_Klasse/RZ in CI vor ET 10.6.2010/13404_41_07008 Estate HandelsMarketing/Dealer_Launch_Invitations Fremddokumente/Dealer_Launch_S204/Images/Vorhang_Innen.eps Now, scrub would reveal corrupted blocks on the devices, but is there a way to identify damaged files as well? Is this a trick question or something? The filenames are right over your question..? /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver that never finishes
On 19 September, 2010 - Markus Kovero sent me these 0,5K bytes: Hi, The drives and the chassis are fine, what I am questioning is how can it be resilvering more data to a device than the capacity of the device? If data on pool has changed during resilver, resilver counter will not update accordingly, and it will show resilvering 100% for needed time to catch up. I believe this was fixed recently, by displaying how many blocks it has checked vs how many to check... /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS online device management
On 11 September, 2010 - besson3c sent me these 0,6K bytes: Hello, I found in the release notes for Solaris 10 9/10: Oracle Solaris ZFS online device management, which allows customers to make changes to filesystem configurations, without taking data offline. Can somebody kindly clarify what sort of filesystem configuration changes can me made this way? See below. Does this include, say, changing a 6 disk RAID-Z to two 3 disk RAID-Z sets striped? Nope. You can add/remove mirrors of disks online (assuming you start with a non-raidz vdev). You can expand a pool by adding more vdevs. You can not transform a raidz from one form to another. You can not remove a vdev. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] performance leakage when copy huge data
On 08 September, 2010 - Fei Xu sent me these 5,9K bytes: I dig deeper into it and might find some useful information. I attached an X25 SSD for ZIL to see if it helps. but no luck. I run IOstate -xnz for more details and got interesting result as below.(maybe too long) some explaination: 1. c2d0 is SSD for ZIL 2. c0t3d0, c0t20d0, c0t21d0, c0t22d0 is source pool. ... extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.30.01.20.0 0.0 0.00.00.1 0 0 c2d0 0.1 17.70.1 51.7 0.0 0.10.24.1 0 7 c3d0 0.12.10.0 79.8 0.0 0.00.14.0 0 0 c0t2d0 0.20.07.10.0 0.1 2.3 278.5 11365.1 1 46 c0t3d0 Service time here is crap. 11 seconds to reply. 0.12.20.0 79.9 0.0 0.00.13.7 0 0 c0t5d0 0.12.30.0 80.0 0.0 0.00.19.2 0 0 c0t6d0 0.12.50.0 80.1 0.0 0.00.13.8 0 0 c0t10d0 0.12.40.0 80.0 0.0 0.00.19.5 0 0 c0t11d0 1.90.0 133.00.0 0.1 2.8 60.2 1520.6 2 51 c0t20d0 1.5 seconds to reply. crap. extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device ... 0.70.0 39.10.0 0.0 0.6 64.0 884.1 1 10 c0t3d0 ... 2.10.0 135.80.0 0.1 5.2 67.8 2498.1 3 88 c0t21d0 ... extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device ... 3.50.0 246.80.0 0.0 0.86.3 229.8 1 20 c0t3d0 ... 0.70.0 29.20.0 0.0 0.60.0 911.0 0 12 c0t21d0 1.90.0 138.70.0 0.1 4.7 73.0 2428.6 2 66 c0t22d0 ... Service times here are crap. Disks are malfunctioning in some way. If your source disks can take seconds (or 10+ seconds) to reply, then of course your copy will be slow. Disk is probably having a hard time reading the data or something. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris 10u9
On 08 September, 2010 - Edward Ned Harvey sent me these 0,6K bytes: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of David Magda The 9/10 Update appears to have been released. Some of the more noticeable ZFS stuff that made it in: More at: http://docs.sun.com/app/docs/doc/821-1840/gijtg Awesome! Thank you. :-) Log device removal in particular, I feel is very important. (Got bit by that one.) Now when is dedup going to be ready? ;-) It's not in U9 at least: ... 16 stmf property support 17 Triple-parity RAID-Z 18 Snapshot user holds 19 Log device removal 20 Compression using zle (zero-length encoding) 21 Reserved 22 Received properties ... scratchy:~# zfs create -o dedup=on kaka/kex cannot create 'kaka/kex': 'dedup' is readonly scratchy:~# zfs set dedup=on kaka cannot set property for 'kaka': 'dedup' is readonly /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs lists discrepancy after added a new vdev to pool
On 27 August, 2010 - Darin Perusich sent me these 2,1K bytes: Hello All, I'm sure this has been discussed previously but I haven't been able to find an answer to this. I've added another raidz1 vdev to an existing storage pool and the increased available storage isn't reflected in the 'zfs list' output. Why is this? The system in question is runnning Solaris 10 5/09 s10s_u7wos_08, kernel Generic_139555-08. The system does not have the lastest patches which might be the cure. Thanks! Here's what I'm seeing. zpool create datapool raidz1 c1t50060E800042AA70d0 c1t50060E800042AA70d1 Just fyi, this is an inefficient variant of a mirror. More cpu required and lower performance. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Directory tree renaming -- disk usage
On 09 August, 2010 - David Dyer-Bennet sent me these 1,2K bytes: If I have a directory with a bazillion files in it (or, let's say, a directory subtree full of raw camera images, about 15MB each, totalling say 50GB) on a ZFS filesystem, and take daily snapshots of it (without altering it), the snapshots use almost no extra space, I know. If I now rename that directory, and take another snapshot, what happens? Do I get two copies of the unchanged data now, or does everything still reference the same original data (file content)? Seems like the new directory tree contains the same old files, same inodes and so forth, so it shouldn't be duplicating the data as I understand it; is that correct? The files hasn't changed, unless you rename the directory by creating a new, copy stuff over and remove the old. The only change is the name of the directory. This would, obviously, be fairly easy to test; and, if I removed the snapshots afterward, wouldn't take space permanently (have to make sure that the scheduler doesn't do one of my permanent snapshots during the test). But I'm interested in the theoretical answer in any case. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] vdev using more space
On 04 August, 2010 - Karl Rossing sent me these 5,4K bytes: Hi, We have a server running b134. The server runs xen and uses a vdev as the storage. The xen image is running nevada 134. I took a snapshot last night to move the xen image to another server. NAME USED AVAIL REFER MOUNTPOINT vpool/host/snv_130 32.8G 11.3G 37.7G - vpool/host/snv_...@2010-03-31 3.27G - 13.8G - vpool/host/snv_...@2010-08-03 436M - 37.7G - It's also worth noting that vpool/host/snv_130 is a clone at least two other snapshots. I then did a zfs send of vpool/host/snv_...@2010-08-03 and got a 39GB file. A zfs send of vpool/host/snv_...@2010-03-31 gave a file of 15GB. This is probably data + metadata or similar. I don't understand why the file is 39GB since df -h inside of the xen image drive vpool/host/snv_130 shows: Filesystem size used avail capacity Mounted on rpool/ROOT/snv_130 39G12G22G35%/ It would be nice if the zfs send file would be roughly the same size as the space used inside of xen machine. The filesystem on the inside might have touched all the blocks, but not informed the outer ZFS (because it can't) that some blocks are freed. One way of making it smaller is to enable compression on the outer zvol, disable compression on the inner filesystem and then fill the inner filesystem with null (dd if=/dev/zero of=file bs=1024k) and remove that file, then remove compression (if you want). This is just a temporary thing, as the filesystem will be used on the inside (with Copy on Write), the outer one will grow back again. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] When is the L2ARC refreshed if on a separate drive?
On 03 August, 2010 - valrh...@gmail.com sent me these 1,2K bytes: I'm running a mirrored pair of 2 TB SATA drives as my data storage drives on my home workstation, a Core i7-based machine with 10 GB of RAM. I recently added a sandforce-based 60 GB SSD (OCZ Vertex 2, NOT the pro version) as an L2ARC to the single mirrored pair. I'm running B134, with ZFS pool version 22, with dedup enabled. If I understand correctly, the dedup table should be in the L2ARC on the SSD, and I should have enough RAM to keep the references to that table in memory, and that this is therefore a well-performing solution. My question is what happens at power off. Does the cache device essentially get cleared, and the machine has to rebuild it when it boots? Or is it persistent. That is, should performance improve after a little while following a reboot, or is it always constant once it builds the L2ARC once? L2ARC is currently cleared at boot. There is an RFE to make it persistent. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] never ending resilver
On 05 July, 2010 - Roy Sigurd Karlsbakk sent me these 1,9K bytes: - Original Message - If you have one zpool consisting of only one large raidz2, then you have a slow raid. To reach high speed, you need maximum 8 drives in each raidz2. So one of the reasons it takes time, is because you have too many drives in your raidz2. Everything would be much faster if you split your zpool into two raidz2, each consisting of 7 or 8 drives. Then it would be fast. Keeping the VDEVs small is one thing, but this is about resilvering spending far more time than reported. The same applies to scrubbing at times. Would it be hard to rewrite the reporting mechanisms in ZFS to report something more likely, than just a first guess? ZFS scrub reports tremendous times at start, but slows down after it's worked it's way through the metadata. What ZFS is doing when the system still scrubs after 100 hours at 100% is beyond my knowledge. I believe it's something like this: * When starting, it notes the number of blocks to visit * .. visiting blocks ... * .. adding more data (which then will be beyond the original 100%) .. and visiting blocks ... * .. reaching the initial last block, which since then has gotten lots of new friends afterwards. http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6899970 /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool status output confusion
On 27 May, 2010 - Per Jorgensen sent me these 1,0K bytes: I get the following output when i run a zpool status , but i am a little confused of why c9t8d0 is more left align then the rest of the disks in the pool , what does it mean ? Because someone forced it in without redundancy (or created it as such). Your pool is bad, as c9t8d0 is without redundancy. If it fails, your pool is toast. zpool history should be able to tell when it happened at least. $ zpool status blmpool pool: blmpool state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM blmpool ONLINE 0 0 0 raidz2ONLINE 0 0 0 c9t0d0 ONLINE 0 0 0 c9t1d0 ONLINE 0 0 0 c9t3d0 ONLINE 0 0 0 c9t4d0 ONLINE 0 0 0 c9t5d0 ONLINE 0 0 0 c9t6d0 ONLINE 0 0 0 c9t7d0 ONLINE 0 0 0 c9t8d0ONLINE 0 0 0 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] creating a fast ZIL device for $200
On 26 May, 2010 - sensille sent me these 4,5K bytes: Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of sensille The basic idea: the main problem when using a HDD as a ZIL device are the cache flushes in combination with the linear write pattern of the ZIL. This leads to a whole rotation of the platter after each write, because after the first write returns, the head is already past the sector that will be written next. My idea goes as follows: don't write linearly. Track the rotation and write to the position the head will hit next. This might be done by a re-mapping layer or integrated into ZFS. This works only because ZIL device are basically write-only. Reads from this device will be horribly slow. The reason why hard drives are less effective as ZIL dedicated log devices compared to such things as SSD's, is because of the rotation of the hard drives; the physical time to seek a random block. There may be a possibility to use hard drives as dedicated log devices, cheaper than SSD's with possibly comparable latency, if you can intelligently eliminate the random seek. If you have a way to tell the hard drive Write this data, to whatever block happens to be available at minimum seek time. Thanks for rephrasing my idea :) The only thing I'd like to point out is that ZFS doesn't do random writes on a slog, but nearly linear writes. This might even be hurting performance more than random writes, because you always hit the worst case of one full rotation. A simple test would be to change write block X write block X+1 write block X+2 into write block X write block X+4 write block X+8 or something, so it might manage to send the command before the head has travelled over to block X+4 etc.. I guess basically, you want to do something like TCQ/NCQ, but without the Q.. placing writes optimally.. So you believe you can know the drive geometry, the instantaneous head position, and the next available physical block address in software? No need for special hardware? That's cool. I hope there aren't any gotchas as-yet undiscovered. Yes, I already did a mapping of several drives. I measured at least the track length, the interleave needed between two writes and the interleave if a track-to-track seek is involved. Of course you can always learn more about a disk, but that's a good starting point. Since X, X+1, X+2 seems to be the optimally worst case, try just skipping over a few blocks.. Double (or such) the performance for a single software tweak would be surely welcome. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] don't mount a zpool on boot
On 20 May, 2010 - John Andrunas sent me these 0,3K bytes: Can I make a pool not mount on boot? I seem to recall reading somewhere how to do it, but can't seem to find it now. zpool export thatpool zpool import thatpool when you want it back. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Very serious performance degradation
On 18 May, 2010 - Philippe sent me these 6,0K bytes: Hi, The 4 disks are Western Digital ATA 1TB (one is slighlty different) : 1 x ATA-WDC WD10EACS-00D-1A01-931.51GB 3 x ATA-WDC WD10EARS-00Y-0A80-931.51GB I've done lots of tests (speed tests + SMART reports) with each of these 4 disk on another system (another computer, running Windows 2003 x64), and everything was fine ! The 4 disks operate well, at 50-100 MB/s (tested with Hdtune). And the access time : 14ms The controller is an LSI Logic SAS 1068-IR (MPT BIOS 6.12.00.00 - 31/10/2006) Here are some stats : 1) cp of a big file to a ZFS filesystem (128K recordsize) : = iostat -x 30 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd0 0.00.00.00.0 0.0 0.00.0 0 0 sd1 0.30.3 17.62.3 0.0 0.0 19.5 0 0 sd2 11.56.0 350.1 154.5 0.0 0.3 19.5 0 4 sd3 12.55.7 351.4 154.5 0.0 0.5 27.1 0 5 sd4 15.96.3 615.1 153.8 0.0 1.3 58.2 0 8 sd5 15.18.1 600.4 150.7 0.0 7.6 326.7 0 31 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd0 0.00.00.00.0 0.0 0.00.0 0 0 sd1 41.30.0 5289.70.0 0.0 1.3 31.0 0 4 sd2 4.2 24.1 214.0 1183.0 0.0 0.5 19.4 0 4 sd3 3.7 23.6 227.2 1183.0 0.0 2.1 78.5 0 12 sd4 6.6 26.4 374.2 1179.4 0.0 10.1 306.5 0 35 sd5 4.3 31.0 369.6 973.3 0.0 22.0 622.0 0 96 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd0 0.00.00.00.0 0.0 0.00.0 0 0 sd1 17.10.0 2184.60.0 0.0 0.5 30.6 0 2 sd2 1.6 12.3 116.4 570.9 0.0 0.6 41.3 0 3 sd3 1.6 12.1 107.6 570.9 0.0 10.3 754.7 0 33 sd4 2.1 12.6 187.1 569.4 0.0 9.4 634.7 0 28 sd5 0.4 21.7 25.6 700.6 0.0 29.5 1338.1 0 96 Umm.. Service time of sd3..5 are waay too high to be good working disks. 21 writes shouldn't take 1.3 seconds. Some of your disks are not feeling well, possibly doing block-reallocation like mad all the time, or block recovery of some form. Service times should be closer to what sd1 and 2 are doing. sd2,3,4 seems to be getting about the same amount of read+write, but their service time is 15-20 times higher. This will lead to crap performance (and probably broken array in a while). /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Using WD Green drives?
On 17 May, 2010 - Dan Pritts sent me these 1,6K bytes: On Thu, May 13, 2010 at 06:09:55PM +0200, Roy Sigurd Karlsbakk wrote: 1. even though they're 5900, not 7200, benchmarks I've seen show they are quite good Minor correction, they are 5400rpm. Seagate makes some 5900rpm drives. The green drives have reasonable raw throughput rate, due to the extremely high platter density nowadays. however, due to their low spin speed, their average-access time is significantly slower than 7200rpm drives. For bulk archive data containing large files, this is less of a concern. Regarding slow reslivering times, in the absence of other disk activity, I think that should really be limited by the throughput rate, not the relatively slow random i/o performance...again assuming large files (and low fragmentation, which if the archive is write-and-never-delete is what i'd expect). One test i saw suggests 60MB/sec avg throughput on the 2TB drives. That works out to 9.25 hours to read the entire 2TB. At a conservative 50MB/sec it's 11 hours. This assumes that you have enough I/O bandwidth and CPU on the system to saturate all your disks. if there's other disk activity during a resilver, though, it turns into random i/o. Which is slow on these drives. Resilver does a whole lot of random io itself, not bulk reads.. It reads the filesystem tree, not block 0, block 1, block 2... You won't get 60MB/s sustained, not even close. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Using WD Green drives?
On 13 May, 2010 - Roy Sigurd Karlsbakk sent me these 2,9K bytes: - Brian broco...@vt.edu skrev: (1) They seem to have a firmware setting (that may not be modified depending on revision) that has to do with the drive parking the drive after 8 seconds of inactivity to save power. These drives are rated for a certain number of park/unpark operations -- I think 300,000. Using these drives in a NAS results in a lot of park/unpark. 8 seconds? is it really that low? Yes. My disk went through 180k in like 2-3 months.. Then I told smartd to poll the disk every 5 seconds to prevent it from falling asleep. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Problems (bug?) with slow bulk ZFS filesystem creation
On 10 May, 2010 - charles sent me these 0,8K bytes: Hi, This thread refers to Solaris 10, but it was suggested that I post it here as ZFS developers may well be more likely to respond. http://forums.sun.com/thread.jspa?threadID=5438393messageID=10986502#10986502 Basically after about ZFS 1000 filesystem creations the creation time slows down to around 4 seconds, and gets progressively worse. This is not the case for normal mkdir which creates thousands of directories very quickly. I wanted users home directories (60,000 of them) all to be individual ZFS file systems, but there seems to be a bug/limitation due to the prohibitive creation time. If you're going to share them over nfs, you'll be looking at even worse times. In my experience, you don't want to go over 1-2k filesystems due to various scalability problems, esp if you're doing NFS as well. It will be slow to create and slow when (re)booting, but other than that it might be ok.. Look into the zfs userquota/groupquota instead.. That's what I did, and it's partly because of these issues that the userquota/groupquota got implemented I guess. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Loss of L2ARC SSD Behaviour
On 06 May, 2010 - Bob Friesenhahn sent me these 0,6K bytes: On Wed, 5 May 2010, Edward Ned Harvey wrote: In the L2ARC (cache) there is no ability to mirror, because cache device removal has always been supported. You can't mirror a cache device, because you don't need it. How do you know that I don't need it? The ability seems useful to me. The gain is quite minimal.. If the first device fails (which doesn't happen too often I hope), then it will be read from the normal pool once and then stored in ARC/L2ARC again. It just behaves like a cache miss for that specific block... If this happens often enough to become a performance problem, then you should throw away that L2ARC device because it's broken beyond usability. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Loss of L2ARC SSD Behaviour
On 05 May, 2010 - Michael Sullivan sent me these 0,9K bytes: HI, I have a question I cannot seem to find an answer to. I know I can set up a stripe of L2ARC SSD's with say, 4 SSD's. I know if I set up ZIL on SSD and the SSD goes bad, the the ZIL will be relocated back to the spool. I'd probably have it mirrored anyway, just in case. However you cannot mirror the L2ARC, so... Given enough opensolaris.. Otherwise, your pool is screwed iirc. What I want to know, is what happens if one of those SSD's goes bad? What happens to the L2ARC? Is it just taken offline, or will it continue to perform even with one drive missing? L2ARC is a pure cache thing, if it gives bad data (checksum error), it will be ignored, if you yank it, it will be ignored. It's very safe to have crap hardware there (as long as they don't start messing up some bus or similar). They can be added/removed at any time as well. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance drop during scrub?
, and is believed to be clean. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance drop during scrub?
On 29 April, 2010 - Tomas Ögren sent me these 5,8K bytes: On 29 April, 2010 - Roy Sigurd Karlsbakk sent me these 10K bytes: I got this hint from Richard Elling, but haven't had time to test it much. Perhaps someone else could help? roy Interesting. If you'd like to experiment, you can change the limit of the number of scrub I/Os queued to each vdev. The default is 10, but that is too close to the normal limit. You can see the current scrub limit via: # echo zfs_scrub_limit/D | mdb -k zfs_scrub_limit: zfs_scrub_limit:10 you can change it with: # echo zfs_scrub_limit/W0t2 | mdb -kw zfs_scrub_limit:0xa = 0x2 # echo zfs_scrub_limit/D | mdb -k zfs_scrub_limit: zfs_scrub_limit:2 In theory, this should help your scenario, but I do not believe this has been exhaustively tested in the lab. Hopefully, it will help. -- richard If I'm reading the code right, it's only used when creating a new vdev (import, zpool create, maybe at boot).. So I took an alternate route: http://pastebin.com/hcYtQcJH (spa_scrub_maxinflight used to be 0x46 (70 decimal) due to 7 devices * zfs_scrub_limit(10) = 70..) With these lower numbers, our pool is much more responsive over NFS.. But taking snapshots is quite bad.. A single recursive snapshot over ~800 filesystems took about 45 minutes, with NFS operations taking 5-10 seconds.. Snapshots usually take 10-30 seconds.. scrub: scrub in progress for 0h40m, 0.10% done, 697h29m to go scrub: scrub in progress for 1h41m, 2.10% done, 78h35m to go This is chugging along.. The server is a Fujitsu RX300 with a Quad Xeon 1.6GHz, 6G ram, 8x400G SATA through a U320SCSI-SATA box - Infortrend A08U-G1410, Sol10u8. Should have enough oompf, but when you combine snapshot with a scrub/resilver, sync performance gets abysmal.. Should probably try adding a ZIL when u9 comes, so we can remove it again if performance goes crap. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance drop during scrub?
On 29 April, 2010 - Richard Elling sent me these 2,5K bytes: With these lower numbers, our pool is much more responsive over NFS.. But taking snapshots is quite bad.. A single recursive snapshot over ~800 filesystems took about 45 minutes, with NFS operations taking 5-10 seconds.. Snapshots usually take 10-30 seconds.. scrub: scrub in progress for 0h40m, 0.10% done, 697h29m to go scrub: scrub in progress for 1h41m, 2.10% done, 78h35m to go This is chugging along.. The server is a Fujitsu RX300 with a Quad Xeon 1.6GHz, 6G ram, 8x400G SATA through a U320SCSI-SATA box - Infortrend A08U-G1410, Sol10u8. slow disks == poor performance I know they're not fast, but they're not should take 10-30 seconds to create a directory. They do perfectly well in all combinations, except when a scrub comes along (or sometimes when a snapshot feels like taking 45 minutes instead of 4.5 seconds). iostat says the disks aren't 100% busy, the storage box itself doesn't seem to be busy, yet with zfs they go downhill in some conditions.. Should have enough oompf, but when you combine snapshot with a scrub/resilver, sync performance gets abysmal.. Should probably try adding a ZIL when u9 comes, so we can remove it again if performance goes crap. A separate log will not help. Try faster disks. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question about du and compression
On 29 April, 2010 - Roy Sigurd Karlsbakk sent me these 1,2K bytes: Hi all Is there a good way to do a du that tells me how much data is there in case I want to move it to, say, an USB drive? Most filesystems don't have compression, but we're using it on (most of) our zfs filesystems, and it can be troublesome for someone that wants to copy a set of data to somewhere to find it's twice as big as reported by du. GNU du has --apparent-size which reports the file size instead of how much disk space it uses.. compression and sparse files will make this differ, and you can't really tell them apart. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance drop during scrub?
On 28 April, 2010 - Eric D. Mudama sent me these 1,6K bytes: On Wed, Apr 28 at 1:34, Tonmaus wrote: Zfs scrub needs to access all written data on all disks and is usually disk-seek or disk I/O bound so it is difficult to keep it from hogging the disk resources. A pool based on mirror devices will behave much more nicely while being scrubbed than one based on RAIDz2. Experience seconded entirely. I'd like to repeat that I think we need more efficient load balancing functions in order to keep housekeeping payload manageable. Detrimental side effects of scrub should not be a decision point for choosing certain hardware or redundancy concepts in my opinion. While there may be some possible optimizations, i'm sure everyone would love the random performance of mirror vdevs, combined with the redundancy of raidz3 and the space of a raidz1. However, as in all systems, there are tradeoffs. To scrub a long lived, full pool, you must read essentially every sector on every component device, and if you're going to do it in the order in which your transactions occurred, it'll wind up devolving to random IO eventually. You can choose to bias your workloads so that foreground IO takes priority over scrub, but then you've got the cases where people complain that their scrub takes too long. There may be knobs for individuals to use, but I don't think overall there's a magic answer. We have one system with a raidz2 of 8 SATA disks.. If we start a scrub, then you can kiss any NFS performance goodbye.. A single mkdir or creating a file can take 30 seconds.. Single write()s can take 5-30 seconds.. Without the scrub, it's perfectly fine. Local performance during scrub is fine. NFS performance becomes useless. This means we can't do a scrub, because doing so will basically disable the NFS service for a day or three. If the scrub would be less agressive and take a week to perform, it would probably not kill the performance as bad.. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mac OS X clients with ZFS server
On 22 April, 2010 - Rich Teer sent me these 1,1K bytes: Hi all, I have a server running SXCE b130 and I use ZFS for all file systems. I also have a couple of workstations running the same OS, and all is well. But I also have a MacBook Pro laptop running Snow Leopard (OS X 10.6.3), and I have troubles creating files on exported ZFS file systems. From the laptop, I can read and write existing files on the exported ZFS file systems just fine, but I can't create new ones. My understanding is that Mac OS makes extensive use of file attributes so I was wondering if this might be the cause of the problem (I know ZFS supports file attributes, but I wonder if I have to utter some magic incantation to get them working properly with Mac OS). I've noticed some issues with copying files to an smb share from macosx clients like the last week.. haven't had time to investigate it fully, but it sure seems EA related.. Copying a file from smb to smb (via the client) works as long as the file hasn't gotten any EA yet.. If I for instance do the hide file ext, then it's not working anymore. Enabling EA on a file works, but creating one with EA doesn't.. So it seems like a Finder bug.. Copying via terminal (and cp) works. At the moment I have a workaround: I use sftp to copy the files from the laptop to the server. But this is a pain in the ass and I'm sure there's a way to make this just work properly! /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Identifying what zpools are exported
On 21 April, 2010 - Justin Lee Ewing sent me these 0,3K bytes: So I can obviously see what zpools I have imported... but how do I see pools that have been exported? Kind of like being able to see deported volumes using vxdisk -o alldgs list. 'zpool import' /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Secure delete?
On 12 April, 2010 - Bob Friesenhahn sent me these 0,9K bytes: On Sun, 11 Apr 2010, James Van Artsdalen wrote: OpenSolaris needs support for the TRIM command for SSDs. This command is issued to an SSD to indicate that a block is no longer in use and the SSD may erase it in preparation for future writes. There does not seem to be very much `need' since there are other ways that a SSD can know that a block is no longer in use so it can be erased. In fact, ZFS already uses an algorithm (COW) which is friendly for SSDs. Zfs is designed for high thoughput, and TRIM does not seem to improve throughput. Perhaps it is most useful for low-grade devices like USB dongles and compact flash. For flash to overwrite a block, it needs to clear it first.. so yes, clearing it out in the background (after erasing) instead of just before the timing critical write(), you can make stuff go faster. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Secure delete?
On 12 April, 2010 - David Magda sent me these 0,7K bytes: On Mon, April 12, 2010 10:48, Tomas Ögren wrote: On 12 April, 2010 - Bob Friesenhahn sent me these 0,9K bytes: Zfs is designed for high thoughput, and TRIM does not seem to improve throughput. Perhaps it is most useful for low-grade devices like USB dongles and compact flash. For flash to overwrite a block, it needs to clear it first.. so yes, clearing it out in the background (after erasing) instead of just before the timing critical write(), you can make stuff go faster. Except that ZFS does not overwrite blocks because it is copy-on-write. So CoW will enable infinite storage, so you never have to write on the same place again? Cool. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] L2ARC L2_Size kstat fluctuate
On 09 April, 2010 - Abdullah Al-Dahlawi sent me these 27K bytes: Hi all I ran an OLTP-Filebench workload I set Arc max size = 2 gb l2arc ssd device size = 32gb workingset(dataset) = 10gb , 10 files , 1gb each after running the workload for 6 hours and monitoring kstat , I have noticed that l2_size from kstat has reached 10gb which is great . however, l2_size started to drop all the way to 7gb which means that the workload will go back to the HDD to retirive some data that are no longer on l2arc device . I understand that l2arc size reflected by zpool iostat is much larger becuase of COW and l2_size from kstat is the actual size of l2arc data. so can any one tell me why I am loosing my workingset from l2_size actual data !!! Maybe the data in the l2arc was invalidated, because the original data was rewritten? /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] L2ARC L2_Size kstat fluctuate
On 09 April, 2010 - Abdullah Al-Dahlawi sent me these 5,3K bytes: Hi Tomas I understand from previous post http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg36914.html that if the data gets invalidated, the l2arc size that is shown by zpool iostat is the one that changed (always growing because of COW) not the actual size shown by kstat which represent the size of the up to date data in l2arc. My only conclusion here to this fluctuation in kstat l2_size is the fact that data has indeed invalidated and did not made it back to l2arc from the tail of ARC !!! Am I right Sounds plausible. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] compression property not received
On 08 April, 2010 - Cindy Swearingen sent me these 2,6K bytes: Hi Daniel, D'oh... I found a related bug when I looked at this yesterday but I didn't think it was your problem because you didn't get a busy message. See this RFE: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6700597 Solaris 10 'man zfs', under 'receive': -uFile system that is associated with the received stream is not mounted. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] L2ARC Workingset Size
On 08 April, 2010 - Abdullah Al-Dahlawi sent me these 12K bytes: Hi Richard Thanks for your comments. OK ZFS is COW, I understand, but, this also means a waste of valuable space of my L2ARC SSD device, more than 60% of the space is consumed by COW !!!. I do not get it ? The rest can and will be used if L2ARC needs it. It's not wasted, it's just a number that doesn't match what you think it should be. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] L2ARC Workingset Size
On 02 April, 2010 - Abdullah Al-Dahlawi sent me these 128K bytes: Hi all I ran a workload that reads writes within 10 files each file is 256M, ie, (10 * 256M = 2.5GB total Dataset Size). I have set the ARC max size to 1 GB on etc/system file In the worse case, let us assume that the whole dataset is hot, meaning my workingset size= 2.5GB My SSD flash size = 8GB and being used for L2ARC No slog is used in the pool My File system record size = 8K , meaning 2.5% of 8GB is used for L2ARC Directory in ARC. which ultimately mean that available ARC is 1024M - 204.8M = 819.2M Available ARC (Am I Right ?) Seems about right. Now the Question ... After running the workload for 75 minutes, I have noticed that L2ARC device has grown to 6 GB !!! No, 6GB of the area has been touched by Copy on Write, not all of it is in use anymore though. What is in L2ARC beyond my 2.5GB Workingset ?? something else is has been added to L2ARC [ snip lots of data ] This is your last one: module: zfs instance: 0 name: arcstatsclass:misc c 1073741824 c_max 1073741824 c_min 134217728 [...] l2_size 2632226304 l2_write_bytes 6486009344 Roughly 6GB has been written to the device, and slightly less than 2.5GB is actually in use. p 775528448 /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Change a zpool to raidz
On 12 March, 2010 - Erik Trimble sent me these 0,7K bytes: Ian Garbutt wrote: I was wondering if there is any way of converting a zpool which only have one LUN in there to a raidz zpool that was 3 or more LUNS in it? Thanks No. Adding, removing, or otherwise changing disks in a RAIDZ is not possible without destroying data in the pool. You'll have to copy the data from the single LUN pool somewhere else, destroy the pool, then recreate it as a RAIDZ with the 3 LUNs. What you can do is: Create a new pool with lun2,lun3 and a sparse file the same size as lun23. Get rid of the file. Copy data over from lun1 (old single lun thing) to the raidz (lun2,lun3,missingfile) Destroy old pool replace missingfile with lun1 With this method, the pool is lacking redundancy between step 4 and 5, but requires no extra space. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] full backup == scrub?
On 08 March, 2010 - Chris Banal sent me these 0,8K bytes: Assuming no snapshots. Do full backups (ie. tar or cpio) eliminate the need for a scrub? No, it won't read redundant copies of the data, which a scrub will. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Snapshot recycle freezes system activity
On 08 March, 2010 - Miles Nordin sent me these 1,8K bytes: gm == Gary Mills mi...@cc.umanitoba.ca writes: gm destroys the oldest snapshots and creates new ones, both gm recursively. I'd be curious if you try taking the same snapshots non-recursively instead, does the pause go away? According to my testing, that would give you a much longer period of slightly slower, but shorter period of per filesystem reallyslowness, given recursive snapshots over lots of independent filesystems. Because recursive snapshots are special: they're supposed to atomically synchronize the cut-point across all the filesystems involved, AIUI. I don't see that recursive destroys should be anything special though. From my experiences on a homedir file server with about 700 filesystems and ~65 snapshots on each, giving about 45k snapshots.. In the beginning, the snapshots took zero time to create.. Now when we have snapshots spanning over a year, it's not as fast. We then turned to only doing daily snapshots (for online backups in addition to regular backups), but they could take up to 45 minutes sometimes with regular nfs work being abysmal. So we started tuning some stuff, and doing hourly snapshots actually helped (probably keeping some data structures warm in ARC). Down to 2-3 minutes or so for a recursive snapshot. So we tried adding 2x 4GB USB sticks (Kingston Data Traveller Mini Slim) as metadata L2ARC and that seems to have pushed the snapshot times down to about 30 seconds. http://www.acc.umu.se/~stric/tmp/snaptimes.png y axis is mmss, so a value of 450 is 4 minutes, 50 seconds.. not all linear ;) x axis is just snapshot number, higher == newer.. Large spikes are snapshots at the same time as daily backups. In snapshot 67..100 in the picture, I removed the L2ARC USB sticks and the times increased and started fluctuating.. I'll give it a few days and put the L2ARC back.. Even cheap $10 USB sticks can help it seems. gm Is it destroying old snapshots or creating new ones that gm causes this dead time? sortof seems like you should tell us this, not the other way around. :) Seriously though, isn't that easy to test? And I'm curious myself too. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Snapshot recycle freezes system activity
On 08 March, 2010 - Bill Sommerfeld sent me these 0,4K bytes: On 03/08/10 12:43, Tomas Ögren wrote: So we tried adding 2x 4GB USB sticks (Kingston Data Traveller Mini Slim) as metadata L2ARC and that seems to have pushed the snapshot times down to about 30 seconds. Out of curiosity, how much physical memory does this system have? System Memory: Physical RAM: 6134 MB Free Memory : 190 MB LotsFree: 94 MB ARC Size: Current Size: 1890 MB (arcsize) Target Size (Adaptive): 2910 MB (c) Min Size (Hard Limit):638 MB (zfs_arc_min) Max Size (Hard Limit):5110 MB (zfs_arc_max) ARC Size Breakdown: Most Recently Used Cache Size: 67%1959 MB (p) Most Frequently Used Cache Size:32%950 MB (c-p) It does some mail server stuff as well. The two added USB sticks grew to about 3.2GB of metadata L2ARC, totally about 6.5M files on the system. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wildcards to zfs list
On 07 March, 2010 - David Dyer-Bennet sent me these 1,1K bytes: There isn't some syntax I'm missing to use wildcards in zfs list to list snapshots, is there? I find nothing in the man page, and nothing I've tried works (yes, I do understand that normally wildcards are expanded by the shell, and I don't expect bash to have zfs-specific stuff like that in it by default). Given that bash passes through wildcards that don't expand to anything (or you can always force it with quoting), zfs list *could* use those to filter the snapshot list; that would be convenient. In the meantime, I can split what the user enters at the @ and use grep to filter the output from zfs list. zfs list -t snapshot ? Add -o name -H if you only want the names.. (I'm running 2009.06, which is based on snv_111b, so if this capability has appeared since then in some form, I'd really like to know; I'll be updating to the next stable release.) -- David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to verify ecc for ram is active and enabled?
On 03 March, 2010 - casper@sun.com sent me these 0,8K bytes: Is there a method to view the status of the rams ecc single or double bit errors? I would like to confirm that ecc on my xeon e5520 and ecc ram are performing their role since memtest is ambiguous. I am running memory test on a p6t6 ws, e5520 xeon, 2gb samsung ecc modules and this is what is on the screen: Chipset: Core IMC (ECC : Detect / Correct) However, further down ECC is identified as being off. Yet there is a column for ECC Errs. I don't know how to interpret this. Is ECC active or not? Off but only disabled by memtest, I believe. Memtest doesn't want potential errors to be hidden by ECC, so it disables ECC to see them if they occur. You can enable it in the memtest menu. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Who is using ZFS ACL's in production?
On 02 March, 2010 - Carson Gaspar sent me these 0,5K bytes: I strongly suggest that folks who are thinking about this examine what NetApp does when exporting NTFS security model qtrees via NFS. It constructs a mostly bogus set of POSIX permission info based on the ACL. All access is enforced based on the actual ACL. Sadly for NFSv3 clients there is no way to see what the actual ACL is, but it is properly enforced. ZFS recently stopped doing something similar to this (faking POSIX draft ACLs), because it can cause data (ACL) corruption. Client sees a faked ACL over NFS, modifies it and sends it back.. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Storage system with 72 GB memory constantly has 11 GB free memory
On 26 February, 2010 - Ronny Egner sent me these 0,6K bytes: Dear All, our storage system running opensolaris b133 + ZFS has a lot of memory for caching. 72 GB total. While testing we observed free memory never falls below 11 GB. Even if we create a ram disk free memory drops below 11 GB but will be 11 GB shortly after (i assume ARC cache is shrunken in this context). As far as i know ZFS is designed to use all memory except 1 GB for caching http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c#arc_init http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c#arc_reclaim_needed So you have a max limit which it won't try to go past, but also a keep this much free for the rest of the system. Both are a bit too protective for a pure ZFS/NFS server in my opinion (but can be tuned). You can check most variables with f.ex: echo freemem/D | mdb -k On one server here, I have in /etc/system: * http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Limiting_the_ARC_Cache * about 7.8*1024*1024*1024, must be physmem*pagesize (206*4096=8446861312 right now) set zfs:zfs_arc_max = 835000 set zfs:zfs_arc_meta_limit = 70 * some tuning set ncsize = 50 set nfs:nrnode = 5 And I've done runtime modifications to swapfs_minfree to force usage of another chunk of memory. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Freeing unused space in thin provisioned zvols
On 26 February, 2010 - Lutz Schumann sent me these 2,2K bytes: Hello list, ZFS can be used in both file level (zfs) and block level access (zvol). When using zvols, those are always thin provisioned (space is allocated on first write). We use zvols with comstar to do iSCSI and FC access - and exuse me in advance - but this may also be a more comstar related question then. When reading from a freshly created zvol, no data comes from disk. All reads are satisfied by ZFS and comstar returns 0's (I guess) for all reads. Now If a virtual machine writes to the zvol, blocks are allocated on disk. Reads are now partial from disk (for all blocks written) and from ZFS layer (all unwritten blocks). If the virtual machine (which may be vmware / xen / hyperv) deletes blocks / frees space within the zvol, this also means a write - usually in meta data area only. Thus the underlaying Storage system does not know which blocks in a zvol are really used. So reducing size in zvols is really difficult / not possible. Even if one deletes everything in guest, the blocks keep allocated. If one zeros out all blocks, even more space is allocated. For the same purpose TRIM (ATA) / PUNCH (SCSI) has bee introduced. With this commands the guest can tell the storage, which blocks are not used anymore. Those commands are not available in Comstar today :( However I had the idea that comstar can get the same result in the way vmware did it some time ago with vmware tools. Idea: - If the guest writes a block with 0's only, the block is freed again - if someone reads this block again - it wil get the same 0's it would get if the 0's would be written - The checksum of a all 0 block dan be hard coded for SHA1 / Flecher, so the comparison for is this a 0 only block is easy. With this in place, a host wishing to free thin provisioned zvol space can fill the unused blocks wirth 0s easity with simple tools (e.g. dd if=/dev/zero of=/MYFILE bs=1M; rm /MYFILE) and the space is freed again on the zvol side. Does anyone know why this is not incorporated into ZFS ? What you can do until this is to enable compression (like lzjb) on the zvol, then do your dd dance in the client, then you can disable the compression again. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Observations about compressability of metadata L2ARC
Hello. I got an idea.. How about creating an ramdisk, making a pool out of it, then making compressed zvols and add those as l2arc.. Instant compressed arc ;) So I did some tests with secondarycache=metadata... capacity operationsbandwidth pool used avail read write read write -- - - - - - - ftp 5.07T 1.78T198 17 11.3M 1.51M raidz21.72T 571G 58 5 3.78M 514K ... raidz21.64T 656G 75 6 3.78M 524K ... raidz21.70T 592G 64 5 3.74M 512K ... cache - - - - - - /dev/zvol/dsk/ramcache/ramvol 84.4M 7.62M 4 17 45.4K 233K /dev/zvol/dsk/ramcache/ramvol2 84.3M 7.71M 4 17 41.5K 233K /dev/zvol/dsk/ramcache/ramvol384M 8M 4 18 42.0K 236K /dev/zvol/dsk/ramcache/ramvol4 84.8M 7.25M 3 17 39.1K 225K /dev/zvol/dsk/ramcache/ramvol5 84.9M 7.08M 3 14 38.0K 193K NAME RATIO COMPRESS ramcache/ramvol 1.00x off ramcache/ramvol2 4.27x lzjb ramcache/ramvol3 6.12xgzip-1 ramcache/ramvol4 6.77x gzip ramcache/ramvol5 6.82xgzip-9 This was after 'find /ftp' had been running for about 1h, along with all the background noise of its regular nfs serving tasks. I took an image of the uncompressed one (ramvol) and ran that through regular gzip and got 12-14x compression, probably due to smaller block size (default 8k) in the zvols.. So I tried with both 8k and 64k.. After not running that long (but at least filled), I got: NAME RATIO COMPRESS VOLBLOCK ramcache/ramvol 1.00x off8K ramcache/ramvol2 5.57x lzjb8K ramcache/ramvol3 7.56x lzjb 64K ramcache/ramvol4 7.35xgzip-18K ramcache/ramvol5 11.68xgzip-1 64K Not sure how to measure the cpu usage of the various compression levels for (de)compressing this data.. It does show that having metadata in ram compressed could be a big win though, if you have cpu cycles to spare.. Thoughts? /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se - 070-5858487 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] l2arc current usage (population size)
On 21 February, 2010 - Felix Buenemann sent me these 0,7K bytes: Am 20.02.10 03:22, schrieb Tomas Ögren: On 19 February, 2010 - Christo Kutrovsky sent me these 0,5K bytes: How do you tell how much of your l2arc is populated? I've been looking for a while now, can't seem to find it. Must be easy, as this blog entry shows it over time: http://blogs.sun.com/brendan/entry/l2arc_screenshots And follow up, can you tell how much of each data set is in the arc or l2arc? kstat -m zfs (p, c, l2arc_size) arc_stat.pl is good, but doesn't show l2arc.. zpool iostat -v poolname would also do the trick for l2arc. No, it will show how much of the disk has been visited (dirty blocks) but not how much it occupies right now. At least very obvious difference if you add a zvol as cache.. If it had supported TRIM or similar, they would probably be about the same though. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] l2arc current usage (population size)
On 21 February, 2010 - Richard Elling sent me these 1,3K bytes: On Feb 21, 2010, at 9:18 AM, Tomas Ögren wrote: On 21 February, 2010 - Felix Buenemann sent me these 0,7K bytes: Am 20.02.10 03:22, schrieb Tomas Ögren: On 19 February, 2010 - Christo Kutrovsky sent me these 0,5K bytes: How do you tell how much of your l2arc is populated? I've been looking for a while now, can't seem to find it. Must be easy, as this blog entry shows it over time: http://blogs.sun.com/brendan/entry/l2arc_screenshots And follow up, can you tell how much of each data set is in the arc or l2arc? kstat -m zfs (p, c, l2arc_size) arc_stat.pl is good, but doesn't show l2arc.. zpool iostat -v poolname would also do the trick for l2arc. No, it will show how much of the disk has been visited (dirty blocks) but not how much it occupies right now. At least very obvious difference if you add a zvol as cache.. If it had supported TRIM or similar, they would probably be about the same though. Don't confuse the ZIL with L2ARC. TRIM will do little for L2ARC devices. I was mostly thinking about the telling the backing device that block X isn't in use anymore, not the performance part.. If I have an L2ARC backed by a zvol without compression, the used size will grow until it's full, even if L2ARC doesn't use all of it currently. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] l2arc current usage (population size)
On 19 February, 2010 - Christo Kutrovsky sent me these 0,5K bytes: Hello, How do you tell how much of your l2arc is populated? I've been looking for a while now, can't seem to find it. Must be easy, as this blog entry shows it over time: http://blogs.sun.com/brendan/entry/l2arc_screenshots And follow up, can you tell how much of each data set is in the arc or l2arc? kstat -m zfs (p, c, l2arc_size) arc_stat.pl is good, but doesn't show l2arc.. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance - napp-it + benchmarks
On 18 February, 2010 - Günther sent me these 1,1K bytes: hellobr there is a new beta v. 0.220 of napp-it, the free webgui for nexenta(core) 3 br new:br -bonnie benchmarks included a href=http://www.napp-it.org/bench.png; target=_blanksee screenshot/abr -bug fixesbr br if you look at the benchmark screenshot:br -pool daten: zfs3 of 7 x wd 2TB raid edition (WD2002FYPS), dedup and compress enabledbr -pool z3ssdcache: zfs3 of 4 sas Seagate 15k/s (ST3146855SS) edition, dedup and compress enabled + ssd read cache (supertalent ultradrive 64GB) br i was surprised about the seqential write/ rewrite result. the wd 2 TB drives performs very well only in sequential write of characters but are horrible bad in blockwise write/ rewrite the 15k sas drives with ssd read cache performs 20 x better (10MB/s - 200 MB/s) Most probably due to lack of ram to hold the dedup tables, which your second version fixes with an l2arc. Try the same test without dedup or same l2arc in both, instead of comparing apples to canoes. brbr downlaod:br http://www.napp-it.orgbr br howto setupbr http://www.napp-it.org/napp-it.pdfbr br gea -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] improve meta data performance
On 18 February, 2010 - Chris Banal sent me these 1,8K bytes: We have a SunFire X4500 running Solaris 10U5 which does about 5-8k nfs ops of which about 90% are meta data. In hind sight it would have been significantly better to use a mirrored configuration but we opted for 4 x (9+2) raidz2 at the time. We can not take the downtime necessary to change the zpool configuration. We need to improve the meta data performance with little to no money. Does anyone have any suggestions? Is there such a thing as a Sun supported NVRAM PCI-X card compatible with the X4500 which can be used as an L2ARC? See if it helps sticking a few cheap USB sticks in there, and set secondarycache=metadata.. For instance Kingston DT Slim Mini are not that bad performers and cost close to nothing. I've got two in a server here, and reading random 4k blocks they do 1500 iops each which is probably more than your current disks. Or if you can stick an Intel X25-M/E in there through SATA/SAS. You can add/remove L2ARCs at will and they don't need to be 100% reliable either, so if you add several of them they will be raid0'd for performance. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to get a list of changed files between two snapshots?
On 03 February, 2010 - Frank Cusack sent me these 0,7K bytes: On February 3, 2010 12:04:07 PM +0200 Henu henrik.he...@tut.fi wrote: Is there a possibility to get a list of changed files between two snapshots? Currently I do this manually, using basic file system functions offered by OS. I scan every byte in every file manually and it ^^^ On February 3, 2010 10:11:01 AM -0500 Ross Walker rswwal...@gmail.com wrote: Not a ZFS method, but you could use rsync with the dry run option to list all changed files between two file systems. That's exactly what the OP is already doing ... rsync by default compares metadata first, and only checks through every byte if you add the -c (checksum) flag. I would say rsync is the best tool here. The find -newer blah suggested in other posts won't catch newer files with an old timestamp (which could happen for various reasons, like being copied with kept timestamps from somewhere else). /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS compressed ration inconsistency
On 01 February, 2010 - antst sent me these 0,6K bytes: Probably I'm missing something here, but what I see on my system zfs list -o used,ratio,compression,name export/home/user 89.6G 2.86xgzip-4 export/home/user cmsmaster ~ # du -hs /export/home/user/ 90G /export/home/user/ du -hsb /export/home/user/ 380781942931/export/home/user/ 89.6G*2.86=256.26G is way too far from 354.63G reported by du. What's wrong? -b, --bytes equivalent to `--apparent-size --block-size=1' --apparent-size print apparent sizes, rather than disk usage; /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS ARC
On 01 February, 2010 - tester sent me these 0,4K bytes: Hi, I have heard references to ARC releasing memory when the demand is high. Can someone please point me to the code path from the point of such a detection to ARC release? http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c#arc_reclaim_needed /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Filesystem Quotas
On 20 January, 2010 - Mr. T Doodle sent me these 1,0K bytes: I currently have one filesystem / (root), is it possible to put a quota on let's say /var? Or would I have to move /var to it's own filesystem in the same pool? Only filesystems can have different settings. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool
On 20 January, 2010 - Richard Elling sent me these 2,7K bytes: Hi Lutz, On Jan 20, 2010, at 3:17 AM, Lutz Schumann wrote: Hello, we tested clustering with ZFS and the setup looks like this: - 2 head nodes (nodea, nodeb) - head nodes contain l2arc devices (nodea_l2arc, nodeb_l2arc) This makes me nervous. I suspect this is not in the typical QA test plan. - two external jbods - two mirror zpools (pool1,pool2) - each mirror is a mirror of one disk from each jbod - no ZIL (anyone knows a well priced SAS SSD ?) We want active/active and added the l2arc to the pools. - pool1 has nodea_l2arc as cache - pool2 has nodeb_l2arc as cache Everything is great so far. One thing to node is that the nodea_l2arc and nodea_l2arc are named equally ! (c0t2d0 on both nodes). What we found is that during tests, the pool just picked up the device nodeb_l2arc automatically, altought is was never explicitly added to the pool pool1. This is strange. Each vdev is supposed to be uniquely identified by its GUID. This is how ZFS can identify the proper configuration when two pools have the same name. Can you check the GUIDs (using zdb) to see if there is a collision? Reproducable: itchy:/tmp/blah# mkfile 64m 64m disk1 itchy:/tmp/blah# zfs create -V 64m rpool/blahcache itchy:/tmp/blah# zpool create blah /tmp/blah/disk1 itchy:/tmp/blah# zpool add blah cache /dev/zvol/dsk/rpool/blahcache itchy:/tmp/blah# zpool status blah pool: blah state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM blah ONLINE 0 0 0 /tmp/blah/disk1ONLINE 0 0 0 cache /dev/zvol/dsk/rpool/blahcache ONLINE 0 0 0 errors: No known data errors itchy:/tmp/blah# zpool export blah itchy:/tmp/blah# zdb -l /dev/zvol/dsk/rpool/blahcache LABEL 0 version=15 state=4 guid=6931317478877305718 itchy:/tmp/blah# zfs destroy rpool/blahcache itchy:/tmp/blah# zfs create -V 64m rpool/blahcache itchy:/tmp/blah# dd if=/dev/zero of=/dev/zvol/dsk/rpool/blahcache bs=1024k count=64 64+0 records in 64+0 records out 67108864 bytes (67 MB) copied, 0.559299 seconds, 120 MB/s itchy:/tmp/blah# zpool import -d /tmp/blah pool: blah id: 16691059548146709374 state: ONLINE action: The pool can be imported using its name or numeric identifier. config: blah ONLINE /tmp/blah/disk1ONLINE cache /dev/zvol/dsk/rpool/blahcache itchy:/tmp/blah# zdb -l /dev/zvol/dsk/rpool/blahcache LABEL 0 LABEL 1 LABEL 2 LABEL 3 itchy:/tmp/blah# zpool import -d /tmp/blah blah itchy:/tmp/blah# zpool status pool: blah state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM blah ONLINE 0 0 0 /tmp/blah/disk1ONLINE 0 0 0 cache /dev/zvol/dsk/rpool/blahcache ONLINE 0 0 0 errors: No known data errors itchy:/tmp/blah# zdb -l /dev/zvol/dsk/rpool/blahcache LABEL 0 version=15 state=4 guid=6931317478877305718 ... It did indeed overwrite my formerly clean blahcache. Smells like a serious bug. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se -- richard We had a setup stage when pool1 was configured on nodea with nodea_l2arc and pool2 was configured on nodeb without a l2arc. Then we did a failover. Then pool1 pickup up the (until then) unconfigured nodeb_l2arc. Is this intended ? Why is a L2ARC device automatically picked up if the device name is the same ? In a later stage we had both pools configured with the corresponding l2arc device. (po...@nodea with nodea_l2arc and po...@nodeb with nodeb_l2arc). Then we also did a failover. The l2arc device of the pool failing over was marked as too many corruptions instead of missing. So from this tests it looks like ZFS just picks up the device with the same name and replaces the l2arc without looking at the device signatures to only consider devices beeing part of a pool. We have not tested with a data disk as c0t2d0 but if the same behaviour
Re: [zfs-discuss] need a few suggestions for a poor man's ZIL/SLOG device
On 06 January, 2010 - Thomas Burgess sent me these 5,8K bytes: I think the confusing part is that the 64gb version seems to use a different controller all together It does. I couldn't find any SNV125-S2/40's in stock so i got 3 SNV125-S2/64's thinking it would be the same,m only bigger.looks like it was stupid on my part. now i understand why i got such a good deal. well i have yet to try them...maybe they won't be so bad...on newegg they get a lot of good ratings. either way i doubt using them for the rpool will hurt me...just a little more expensive than the compact flash cards i was going to get. I've ordered a 40G which should be coming in a week or so, I'll do some ZIL/L2ARC testing with it and report back. Random 4k writes seems to be quite alright; http://benchmarkreviews.com/index.php?option=com_contenttask=viewid=392Itemid=60limit=1limitstart=6 /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file expiration date/time
On 30 December, 2009 - Dennis Yurichev sent me these 0,7K bytes: Hi. Why each file can't have also expiration date/time field, e.g., date/time when operation system will delete it automatically? This could be usable for backups, camera raw files, internet browser cached files, etc. Using extended attributes + cron, you could provide the same service yourself and other similar (or not) things people would like to do without developers providing it for you in the fs.. Start at 'man fsattr' /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] FW: ARC not using all available RAM?
On 21 December, 2009 - Tristan Ball sent me these 4,5K bytes: Richard Elling wrote: On Dec 20, 2009, at 12:25 PM, Tristan Ball wrote: I've got an opensolaris snv_118 machine that does nothing except serve up NFS and ISCSI. The machine has 8G of ram, and I've got an 80G SSD as L2ARC. The ARC on this machine is currently sitting at around 2G, the kernel is using around 5G, and I've got about 1G free. ... What I'm trying to find out is is my ARC relatively small because... 1) ZFS has decided that that's all it needs (the workload is fairly random), and that adding more wont gain me anything.. 2) The system is using so much ram for tracking the L2ARC, that the ARC is being shrunk (we've got an 8K record size) 3) There's some other memory pressure on the system that I'm not aware of that is periodically chewing up then freeing the ram. 4) There's some other memory management feature that's insisting on that 1G free. My bet is on #4 ... http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c#arc_reclaim_needed See line 1956 .. I tried some tuning on a pure nfs server (although s10u8) here, and got it to use a bit more of the last 1GB out of 8G.. I think it was swapfs_minfree that I poked with a sharp stick. No idea if anything else that relies on it could break, but the machine has been fine for a few weeks here now and using more memory for ARC.. ;) /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs allow - internal error
On 09 December, 2009 - Andrew Robert Nicols sent me these 1,6K bytes: I've just done a fresh install of Solaris 10 u8 (2009.10) onto a Thumper. Running zfs allow gives the following delightful output: -bash-3.00$ zfs allow internal error: /usr/lib/zfs/pyzfs.py not found I've confirmed it on a second thumper, also running Solaris 10 u8 installed about 2 months ago. Has anyone else seen this? Yes. You haven't got SUNWPython installed, which is wrongly marked as belonging to the GNOME2 cluster. Install SUNWPython and SUNWPython-share and it'll work. Some ZFS stuff (userspace, allow, ..) started using python in u8. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ARC Ghost lists, why have them and how much ram is used to keep track of them? [long]
Hello. We have a file server running S10u8 which is a disk backend to a caching ftp/http frontend cluster (homebrew) which currently has about 4.4TB of data which obviously doesn't fit in the 8GB of ram the machine has. arc_summary currently says: System Memory: Physical RAM: 8055 MB Free Memory : 1141 MB LotsFree: 124 MB ARC Size: Current Size: 3457 MB (arcsize) Target Size (Adaptive): 3448 MB (c) Min Size (Hard Limit):878 MB (zfs_arc_min) Max Size (Hard Limit):7031 MB (zfs_arc_max) ARC Size Breakdown: Most Recently Used Cache Size: 93%3231 MB (p) Most Frequently Used Cache Size: 6%217 MB (c-p) ... CACHE HITS BY CACHE LIST: Anon:3%377273490 [ New Customer, First Cache Hit ] Most Recently Used: 9%1005243026 (mru) [ Return Customer ] Most Frequently Used: 81%9113681221 (mfu) [ Frequent Customer ] Most Recently Used Ghost:2%284232070 (mru_ghost) [ Return Customer Evicted, Now Back ] Most Frequently Used Ghost: 3%361458550 (mfu_ghost) [ Frequent Customer Evicted, Now Back ] And some info from echo ::arc | mdb -k: arc_meta_used = 2863 MB arc_meta_limit= 3774 MB arc_meta_max = 4343 MB Now to the questions.. As I've understood it, ARC keeps a list of newly evicted data from the ARC in the ghost lists, for example to be used for L2ARC (or?). In mdb -k: ARC_mfu_ghost::print ... arcs_lsize = [ 0x2341ca00, 0x4b61d200 ] arcs_size = 0x6ea39c00 ... ARC_mru_ghost::print arcs_lsize = [ 0x65646400, 0xd24e00 ] arcs_size = 0x6636b200 ARC_mru::print arcs_lsize = [ 0x2b9ae600, 0x38646e00 ] arcs_size = 0x758ae800 ARC_mfu::print arcs_lsize = [ 0, 0x4d200 ] arcs_size = 0x1043a000 Does this mean that currently, 1770MB+1635MB is wasted just for statistics, and 1880+260MB is used for actual cached data, or does these numbers just refer to how much data they keep stats for? So basically, what is the point of the ghost lists and how much ram are they actually using? Also, since this machine just has 2 purposes in life - sharing data over nfs and taking backups of the same data, I'd like to get those 1141MB of free memory to be actually used.. Can I set zfs_arc_max (can't find any runtime tunable, only /etc/system one, right?) to 8GB. If it runs out of memory, it'll set no_grow and shrink a little, right? Currently, data can use all of ARC if it wants, but metadata can use a maximum of $arc_meta_max. Since there's no chance of caching all of the data, but there's a high chance of caching a large proportion of the metadata, I'd like reverse limits; limit data size to 1GB or so (due to buffers currently being handled, setting primarycache=metadata will give crap performance in my testing) and let metadata take as much as it'd like.. Is there a chance of getting something like this? /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] rquota didnot show userquota (Solaris 10)
On 26 November, 2009 - Willi Burmeister sent me these 1,7K bytes: Hi, we have a new fileserver running on X4275 hardware with Solaris 10U8. On this fileserver we created one test dir with quota and mounted these on another Solaris 10 system. Here the quota command didnot show the used quota. Does this feature only work with OpenSolaris or is it intended to work on Solaris 10? ZFS userspace quota doesn't support rquotad reporting. (.. yet?) /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS user quota, userused updates?
On 20 October, 2009 - Matthew Ahrens sent me these 2,2K bytes: The user/group used can be out of date by a few seconds, same as the used and referenced properties. You can run sync(1M) to wait for these values to be updated. However, that doesn't seem to be the problem you are encountering here. Can you send me the output of: zfs list zpool1/sd01_mail zfs get all zpool1/sd01_mail zfs userspace -t all zpool1/sd01_mail ls -ls /export/sd01/mail zdb -vvv zpool1/sd01_mail On a related note, there is a way to still have quota used even after all files are removed, S10u8/SPARC: # zfs create rpool/quotatest # zfs set userqu...@stric=5m rpool/quotatest # zfs userspace -t all rpool/quotatest TYPE NAME USED QUOTA POSIX Group root 3K none POSIX User root 3K none POSIX User stric 0 5M # chmod a+rwt /rpool/quotatest stric% cd /rpool/quotatest;tar jxvf /somewhere/gimp-2.2.10.tar.bz2 ... wait and it will start getting Disc quota exceeded, might have to help it by running 'sync' in another terminal stric% sync stric% rm -rf gimp-2.2.10 stric% sync ... now it's all empty.. but... # zfs userspace -t all rpool/quotatest TYPE NAME USED QUOTA POSIX Group root 3K none POSIX Group tdb 3K none POSIX User root 3K none POSIX User stric3K 5M Can be repeated for even more lost blocks, I seem to get between 3 and 5 kB each time. I tried this last night, and when I got back in the morning, it had gone down to zero again. Haven't done any more verifying than that. It doesn't seem to trigger if I just write a big file with dd which gets me into DQE, but unpacking a tarball seems to trigger it. My tests has been as above. Output from all of the above + zfs list, zfs get all, zfs userspace, ls -l and zdb -vvv is at: http://www.acc.umu.se/~stric/tmp/zfs-userquota.txt /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS user quota, userused updates?
On 20 October, 2009 - Matthew Ahrens sent me these 0,7K bytes: Tomas Ögren wrote: On a related note, there is a way to still have quota used even after all files are removed, S10u8/SPARC: In this case there are two directories that have not actually been removed. They have been removed from the namespace, but they are still open, eg due to some process's working directory being in them. Only a few processes in total were involved in this dir.. cd into the fs, untar the tarball, remove it all, cd out, run sync. Quota usage still remains. This is confirmed by your zdb output, there are 2 directories on the delete queue. You can force it to be flushed by unmounting and re-mounting your filesystem. .. which isn't such a good workaround for a busy home directory server which I will use this in shortly... I have to say a big thank you for this userquota anyway, because I tried the one fs per user way first, and it just didn't scale to our 3-4000 users, but I still want to use ZFS. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Interesting bug with picking labels when expanding a slice where a pool lives
Hi. We've got some test machines which amongst others has zpools in various sizes and placements scribbled all over the disks. 0. HP DL380G3, Solaris10u8, 2x16G disks; c1t0d0 c1t1d0 1. Took a (non-emptied) disk, created a 2GB slice0 and a ~14GB (to the last cyl) slice7. 2. zpool create striclek c1t1d0s0 3. zdb -l /dev/rdsk/c1t1d0s0 shows 4 labels, each with the same guid and only c1t1d0s0 as vdev. All is well. 4. format, increase slice0 from 2G to 16G. remove slice7. label. 5. zdb -l /dev/rdsk/c1t1d0s0 shows 2 labels from the correct guid c1t1d0s0, it also shows 2 labels from some old guid (from an rpool which was abandoned long ago) belonging to a mirror(c1t0d0s0,c1t1d0s0). c1t0d0s0 is current boot disk with other rpool and other guid. 6. zpool export striclek;zpool import shows guid from the working pool, but that it's missing devices (although only lists c1t1d0s0 - ONLINE) 7. zpool import striclek doesn't work. zpool import theworkingguid doesn't work. If I resize the slice back to 2GB, all 4 labels shows the workingguid and import works again. Questions: * Why does 'zpool import' show the guid from label 0/1, but wants vdev conf as specified by label 2/3? * Is there no timestamp or such, so it would prefer label 0/1 as they are brand new and ignore label 2/3 which are waaay old. I can agree to being forced to scribble zeroes/junk all over the slice7 space which we're expanding to in step 4.. But stuff shouldn't fail this way IMO.. Maybe comparing timestamps and see that label 2/3 aren't so hot anymore and ignore them, or something.. zdb -l and zpool import dumps at: http://www.acc.umu.se/~stric/tmp/zdb-dump/ /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interesting bug with picking labels when expanding a slice where a pool lives
On 19 October, 2009 - Cindy Swearingen sent me these 2,4K bytes: Hi Tomas, I think you are saying that you are testing what happens when you increase a slice under a live ZFS storage pool and then reviewing the zdb output of the disk labels. Increasing a slice under a live ZFS storage pool isn't supported and might break your pool. It also happens on a non-live pool, that is, if I export, increase the slice and then try to import. r...@ramses:~# zpool export striclek r...@ramses:~# format Searching for disks...done ... increase c1t1d0s0 r...@ramses:~# zpool import striclek cannot import 'striclek': one or more devices is currently unavailable .. which is the way to increase a pool within a disk/device if I'm not mistaken.. Like if the storage comes off a SAN and you resize the LUN.. I think you are seeing some remnants of some old pools on your slices with zdb since this is how zpool import is able to import pools that have been destroyed. Yep, that's exactly what I see. The issue is that the newgood labels aren't trusted anymore (it also looks at old ones) and also that zpool import picks information from different labels and presents it as one piece of info. If I was using some SAN and my lun got increased, and the new storage space had some old scrap data on it, I could get hit by the same issue. Maybe I missed the point. Let me know. Cindy On 10/19/09 12:41, Tomas Ögren wrote: Hi. We've got some test machines which amongst others has zpools in various sizes and placements scribbled all over the disks. 0. HP DL380G3, Solaris10u8, 2x16G disks; c1t0d0 c1t1d0 1. Took a (non-emptied) disk, created a 2GB slice0 and a ~14GB (to the last cyl) slice7. 2. zpool create striclek c1t1d0s0 3. zdb -l /dev/rdsk/c1t1d0s0 shows 4 labels, each with the same guid and only c1t1d0s0 as vdev. All is well. 4. format, increase slice0 from 2G to 16G. remove slice7. label. 5. zdb -l /dev/rdsk/c1t1d0s0 shows 2 labels from the correct guid c1t1d0s0, it also shows 2 labels from some old guid (from an rpool which was abandoned long ago) belonging to a mirror(c1t0d0s0,c1t1d0s0). c1t0d0s0 is current boot disk with other rpool and other guid. 6. zpool export striclek;zpool import shows guid from the working pool, but that it's missing devices (although only lists c1t1d0s0 - ONLINE) 7. zpool import striclek doesn't work. zpool import theworkingguid doesn't work. If I resize the slice back to 2GB, all 4 labels shows the workingguid and import works again. Questions: * Why does 'zpool import' show the guid from label 0/1, but wants vdev conf as specified by label 2/3? * Is there no timestamp or such, so it would prefer label 0/1 as they are brand new and ignore label 2/3 which are waaay old. I can agree to being forced to scribble zeroes/junk all over the slice7 space which we're expanding to in step 4.. But stuff shouldn't fail this way IMO.. Maybe comparing timestamps and see that label 2/3 aren't so hot anymore and ignore them, or something.. zdb -l and zpool import dumps at: http://www.acc.umu.se/~stric/tmp/zdb-dump/ /Tomas ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Strange problem with liveupgrade on zfs (10u7 and u8)
On 14 October, 2009 - Brian sent me these 4,3K bytes: I am have a strange problem with liveupgrade of ZFS boot environment. I found a similar discussion on the zones-discuss, but, this happens for me on installs with and without zones, so I do not think it is related to zones. I have been able to reproduce this on both sparc (ldom) and x86 (phsyical). I was originally trying to luupdate to u8, but, this is easily reproducible with 3 simple steps: lucreate, luactivate, reboot. ... [b]lucreate -n sol10alt[/b] Noticed the following warning during lucreate: WARNING: split filesystem / file system type zfs cannot inherit mount point options - from parent filesystem / file type - because the two file systems have different types. Got the same warning and the same end result, was planning on filing it with Sun yesterday but haven't have time to do that yet. I got it on sparc (physical) too. I didn't install LU from the u8 iso, but it was patched with latest LU patches through PCA. [b]luactivate sol10alt[/b] If you lumount, comment out those rpool/ROOT/ thingies, then luumont here, it'll work too. [b]/usr/sbin/shutdown -g0 -i6 -y[/b] /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Terrible ZFS performance on a Dell 1850 w/ PERC 4e/Si (Sol10U6)
On 09 October, 2009 - Brandon Hume sent me these 2,0K bytes: I've got a mail machine here that I built using ZFS boot/root. It's been having some major I/O performance problems, which I posted once before... but that post seems to have disappeared. Now I've managed to obtain another identical machine, and I've built it in the same way as the original. Running Solaris 10 U6, I've got it fully patched as of 2009/10/06. It's using a mirrored disk via the PERC (LSI Megaraid) controller. The main problem seems to be ZFS. If I do the following on a UFS filesystem: # /usr/bin/time dd if=/dev/zero of=whee.bin bs=1024000 count=x ... then I get real times of the following: x time 128 35. 4 256 1:01.8 512 2:19.8 Is this minutes:seconds.millisecs ? if so, you're looking at 3-4MB/s .. I would say something is wrong. It's all very linear and fairly decent. Decent?! However, if I then destroy that filesystem and recreate it using ZFS (no special options or kernel variables set) performance degrades substantially. With the same dd, I get: x time 128 3:45.3 256 6:52.7 512 15:40.4 0.5MB/s .. that's floppy speed :P So basically a 6.5x loss across the board. I realize that a simple 'dd' is an extremely weak test, but real-world use on these machines shows similar problems... long delays logging in, and running a command that isn't cached can take 20-30 seconds (even something as simple as 'psrinfo -vp'). Ironically, the machine works just fine for simple email, because the files are small and very transient and thus can exist quite easily just in memory. But more complex things, like a local copy of our mailmaps, cripples the machine. .. because something is messed up, and for some reason ZFS seems to feel worse than UFS.. I'm about to rebuild the machine with the RAID controller in passthrough mode, and I'll see what that accomplishes. Most of the machines here are Linux and use the hardware RAID1, so I was/am hesitant to break standard that way. Does anyone have any experience or suggestions for trying to make ZFS boot+root work fine on this machine? Check for instance 'iostat -xnzmp 1' while doing this and see if any disk is behaving badly, high service times etc.. Even your speedy 3-4MB/s is nowhere close to what you should be getting, unless you've connected a bunch of floppy drives to your PERC.. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best way to convert checksums
On 02 October, 2009 - Ray Clark sent me these 4,4K bytes: Data security. I migrated my organization from Linux to Solaris driven away from Linux by the the shortfalls of fsck on TB size file systems, and towards Solaris by the features of ZFS. [...] Before taking rather disruptive actions to correct this, I decided to question my original decision and found schlie's post stating that a bug in fletcher2 makes it essentially a one bit parity on the entire block: http://opensolaris.org/jive/thread.jspa?threadID=69655tstart=30 While this is twice as good as any other file system in the world that has NO such checksum, this does not provide the security I migrated for. Especially given that I did not know what caused the original data loss, it is all I have to lean on. ... That post refers to bug 6740597 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6740597 which also refers to http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=2178540 So it seems like it's fixed in snv114 and s10u8, which won't help your s10u4 unless you update.. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris License with ZFS USER quotas?
On 28 September, 2009 - Jorgen Lundman sent me these 1,7K bytes: Hello list, We are unfortunately still experiencing some issues regarding our support license with Sun, or rather our Sun Vendor. We need ZFS User quotas. (That's not the zfs file-system quota) which first appeared in svn_114. We would like to run something like svn_117 (don't really care which version per-se, that is just the one version we have done the most testing with). But our Vendor will only support Solaris 10. After weeks of wrangling, they have reluctantly agreed to let us run OpenSolaris 2009.06. (Which does not have ZFS User quotas). When I approach Sun-Japan directly I just get told that they don't speak English. When my Japanese colleagues approach Sun-Japan directly, it is suggested to us that we stay with our current Vendor. * Will there be official Solaris 10, or OpenSolaris releases with ZFS User quotas? (Will 2010.02 contain ZFS User quotas?) http://sparcv9.blogspot.com/2009/08/solaris-10-update-8-1009-is-comming.html which is in no way official, says it'll be in 10u8 which should be coming within a month. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss