Re: [zfs-discuss] Petabyte pool?
On Fri, Mar 15, 2013 at 06:09:34PM -0700, Marion Hakanson wrote: Greetings, Has anyone out there built a 1-petabyte pool? I've been asked to look into this, and was told low performance is fine, workload is likely to be write-once, read-occasionally, archive storage of gene sequencing data. Probably a single 10Gbit NIC for connectivity is sufficient. We've had decent success with the 45-slot, 4U SuperMicro SAS disk chassis, using 4TB nearline SAS drives, giving over 100TB usable space (raidz3). Back-of-the-envelope might suggest stacking up eight to ten of those, depending if you want a raw marketing petabyte, or a proper power-of-two usable petabyte. I get a little nervous at the thought of hooking all that up to a single server, and am a little vague on how much RAM would be advisable, other than as much as will fit (:-). Then again, I've been waiting for something like pNFS/NFSv4.1 to be usable for gluing together multiple NFS servers into a single global namespace, without any sign of that happening anytime soon. So, has anyone done this? Or come close to it? Thoughts, even if you haven't done it yourself? Thanks and regards, Marion We've come close: admin@mes-str-imgnx-p1:~$ zpool list NAME SIZE ALLOC FREECAP DEDUP HEALTH ALTROOT datapool 978T 298T 680T30% 1.00x ONLINE - syspool278G 104G 174G37% 1.00x ONLINE - Using a Dell R720 head unit, plus a bunch of Dell MD1200 JBODs dual pathed to a couple of LSI SAS switches. Using Nexenta but no reason you couldn't do this w/ $whatever. We did triple parity and our vdev membership is set up such that we can lose up to three JBODs and still be functional (one vdev member disk per JBOD). This is with 3TB NL-SAS drives. Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Petabyte pool?
On Fri, Mar 15, 2013 at 06:31:11PM -0700, Marion Hakanson wrote: rvandol...@esri.com said: We've come close: admin@mes-str-imgnx-p1:~$ zpool list NAME SIZE ALLOC FREECAP DEDUP HEALTH ALTROOT datapool 978T 298T 680T30% 1.00x ONLINE - syspool278G 104G 174G37% 1.00x ONLINE - Using a Dell R720 head unit, plus a bunch of Dell MD1200 JBODs dual pathed to a couple of LSI SAS switches. Thanks Ray, We've been looking at those too (we've had good luck with our MD1200's). How many HBA's in the R720? Thanks and regards, Marion We have qty 2 LSI SAS 9201-16e HBA's (Dell resold[1]). Ray [1] http://accessories.us.dell.com/sna/productdetail.aspx?c=usl=ens=hiedcs=65sku=a4614101 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [discuss] Hardware Recommendations: SAS2 JBODs
On Tue, Nov 13, 2012 at 03:08:04PM -0500, Peter Tripp wrote: Hi folks, I'm in the market for a couple of JBODs. Up until now I've been relatively lucky with finding hardware that plays very nicely with ZFS. All my gear currently in production uses LSI SAS controllers (3801e, 9200-16e, 9211-8i) with backplanes powered by LSI SAS expanders (Sun x4250, Sun J4400, etc). But I'm in the market for SAS2 JBODs to support a large number 3.5inch SAS disks (60+ 3TB disks to start). I'm aware of potential issues with SATA drives/interposers and the whole SATA Tunneling Protocol (STP) nonsense, so I'm going to stick to a pure SAS setup. Also, since I've had trouble with in the past with daisy-chained SAS JBODs I'll probably stick with one SAS 4x cable (SFF8088) per JBOD and unless there were a compelling reason for multi-pathing I'd probably stick to a single controller. If possible I'd rather buy 20 packs of enterprise SAS disks with 5yr warranties and have the JBOD come with empty trays, but would also consider buying disks with the JBOD if the price wasn't too crazy. Does anyone have any positive/negative experiences with any of the following with ZFS: * SuperMicro SC826E16-R500LPB (2U 12 drives, dual 500w PS, single LSI SAS2X28 expander) * SuperMicro SC846BE16-R920B (4U 24 drives, dual 920w PS, single unknown expander) * Dell PowerVault MD 1200 (2U 12 drives, dual 600w PS, dual unknown expanders) * HP StorageWorks D2600 (2U 12 drives, dual 460w PS, single/dual unknown expanders) I'm leaning towards the SuperMicro stuff, but every time I order SuperMicro gear there's always something missing or wrongly configured so some of the cost savings gets eaten up with my time figuring out where things went wrong and returning/ordering replacements. The Dell/HP gear I'm sure is fine, but buying disks from them gets pricey quick. The last time I looked they charged $150 extra per disk for when the only added value was a proprietary sled a shorter warranty (3yr vs 5yr). I'm open to other JBOD vendors too, was just really just curious what folks were using when they needed more than two dozen 3.5 SAS disks for use with ZFS. Thanks -Peter We've had good experiences with the Dell MD line. It's been MD1200 up until now, but are keeping our eyes on their MD3260 (60-bay). You're right in that their costs are higher for disks and such, but since we are a big Dell shop it simplifies support significantly for us and we have quick turnaround on parts anywhere in the world. If that weren't a significant issue I'd go SuperMicro or DataON. We used SuperMicro for quite a while with mixed experiences. Best bet was to find a chassis that work and stick with it as long as possible. :) Even if you're not using Nexenta, their HCL is valuable for finding HW that is likely to work for you. Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] IOzone benchmarking
On Thu, May 03, 2012 at 07:35:45AM -0700, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Ray Van Dolson System is a 240x2TB (7200RPM) system in 20 Dell MD1200 JBODs. 16 vdevs of 15 disks each -- RAIDZ3. NexentaStor 3.1.2. I think you'll get better, both performance reliability, if you break each of those 15-disk raidz3's into three 5-disk raidz1's. Here's why: Obviously, with raidz3, if any 3 of 15 disks fail, you're still in operation, and on the 4th failure, you're toast. Obviously, with raidz1, if any 1 of 5 disks fail, you're still in operation, and on the 2nd failure, you're toast. So it's all about computing the probability of 4 overlapping failures in the 15-disk raidz3, or 2 overlapping failures in a smaller 5-disk raidz1. In order to calculate that, you need to estimate the time to resilver any one failed disk... In ZFS, suppose you have a record of 128k, and suppose you have a 2-way mirror vdev. Then each disk writes 128k. If you have a 3-disk raidz1, then each disk writes 64k. If you have a 5-disk raidz1, then each disk writes 32k. If you have a 15-disk raidz3, then each disk writes 10.6k. Assuming you have a machine in production, and you are doing autosnapshots. And your data is volatile. Over time, it serves to fragment your data, and after a year or two of being in production, your resilver will be composed almost entirely of random IO. Each of the non-failed disks must read their segment of the stripe, in order to reconstruct the data that will be written to the new good disk. If you're in the 15-disk raidz3 configuration... Your segment size is approx 3x smaller, which means approx 3x more IO operations. Another way of saying that... Assuming the amount of data you will write to your pool is the same regardless of which architecture you chose... For discussion purposes, let's say you write 3T to your pool. And let's momentarily assume you whole pool will be composed of 15 disks, in either a single raidz3, or in 3x 5-disk raidz1. If you use one big raidz3, then the 3T will require at least 24million 128k records to hold it all, and each 128k record will be divided up onto all the disks. If you use the smaller raidz1, then only 1T will get written to each vdev, and you will only need 8million records on each disk. Thus, to resilver the large vdev, you will require 3x more IO operations. Worse still, on each IO request, you have to wait for the slowest of all disks to return. If you were in a 2-way mirror situation, your seek time would be the average seek time of a single disk. But if you were in an infinite-disk situation, your seek time would be the worst case seek time on every single IO operation, which is about 2x longer than the average seek time. So not only do you have 3x more seeks to perform, you have up to 2x longer to wait upon each seek... Now, to put some numbers on this... A single 1T disk can sustain (let's assume) 1.0 Gbit/sec read/write sequential. This means resilvering the entire disk sequentially, including unused space, (which is not what ZFS does) would require 2.2 hours. In practice, on my 1T disks, which are in a mirrored configuration, I find resilvering takes 12 hours. I would expect this to be ~4 days if I were using 5-disk raidz1, and I would expect it to be ~12 days if I were using 15-disk raidz3. Your disks are all 2T, so you should double all the times I just wrote. Your raidz3 should be able to resilver a single disk in approx 24 days. Your raidz5 should be able to do one in ~ 8 days. If you were using mirrors, ~ 1 day. Suddenly the prospect of multiple failures overlapping don't seem so unlikely. Ed, thanks for taking the time to write this all out. Definitely food for thought. Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] IOzone benchmarking
I'm trying to run some IOzone benchmarking on a new system to get a feel for baseline performance. Unfortunately, the system has a lot of memory (144GB), but I have some time so am approaching my runs as follows: Throughput: iozone -m -t 8 -T -r 128k -o -s 36G -R -b bigfile.xls IOPS: iozone -O -i 0 -i 1 -i 2 -e -+n -r 128K -s 288G iops.txt Not sure what I gain/lose by using threads or not. Am I off on this? System is a 240x2TB (7200RPM) system in 20 Dell MD1200 JBODs. 16 vdevs of 15 disks each -- RAIDZ3. NexentaStor 3.1.2. Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] IOzone benchmarking
On Tue, May 01, 2012 at 03:21:05AM -0700, Gary Driggs wrote: On May 1, 2012, at 1:41 AM, Ray Van Dolson wrote: Throughput: iozone -m -t 8 -T -r 128k -o -s 36G -R -b bigfile.xls IOPS: iozone -O -i 0 -i 1 -i 2 -e -+n -r 128K -s 288G iops.txt Do you expect to be reading or writing 36 or 288Gb files very often on this array? The largest file size I've used in my still lengthy benchmarks was 16Gb. If you use the sizes you've proposed, it could take several days or weeks to complete. Try a web search for iozone examples if you want more details on the command switches. -Gary The problem is this box has 144GB of memory. If I go with a 16GB file size (which I did), then memory and caching influences the results pretty severely (I get around 3GB/sec for writes!). Obviously, I could yank RAM for purposes of benchmarking. :) Thanks, Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] IOzone benchmarking
On Tue, May 01, 2012 at 07:18:18AM -0700, Bob Friesenhahn wrote: On Mon, 30 Apr 2012, Ray Van Dolson wrote: I'm trying to run some IOzone benchmarking on a new system to get a feel for baseline performance. Unfortunately, benchmarking with IOzone is a very poor indicator of what performance will be like during normal use. Forcing the system to behave like it is short on memory only tests how the system will behave when it is short on memory. Testing multi-threaded synchronous writes with IOzone might actually mean something if it is representative of your work-load. Bob Sounds like IOzone may not be my best option here (though it does produce pretty graphs). bonnie++ actually gave me more realistic sounding numbers, and I've been reading good thigns about fio. Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS on Linux vs FreeBSD
On Wed, Apr 25, 2012 at 05:48:57AM -0700, Paul Archer wrote: This may fall into the realm of a religious war (I hope not!), but recently several people on this list have said/implied that ZFS was only acceptable for production use on FreeBSD (or Solaris, of course) rather than Linux with ZoL. I'm working on a project at work involving a large(-ish) amount of data, about 5TB, working its way up to 12-15TB eventually, spread among a dozen or so nodes. There may or may not be a clustered filesystem involved (probably gluster if we use anything). I've been looking at ZoL as the primary filesystem for this data. We're a Linux shop, so I'd rather not switch to FreeBSD, or any of the Solaris-derived distros--although I have no problem with them, I just don't want to introduce another OS into the mix if I can avoid it. So, the actual questions are: Is ZoL really not ready for production use? If not, what is holding it back? Features? Performance? Stability? If not, then what kind of timeframe are we looking at to get past whatever is holding it back? I can't comment directly on experiences with ZoL as I haven't used it, but it does seem to be under active development. That can be a good thing or a bad thing. :) I for one would be hesitant to use it for anything production based solely on the youngness of the effort. That said, might be worthwhile to check out the ZoL mailing lists and bug reports to see what types of issues the early adopters are running into and whether or not they are showstoppers for you or you are willing to accept the risks. For your size requierements and your intent to use Gluster, it sounds like ext4 or xfs would be entirely suitable and are obviously more mature on Linux at this point. Regardless, curious to hear which way you end up going and how things work out. Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Unable to allocate dma memory for extra SGL
Hi all; We have a Solaris 10 U9 x86 instance running on Silicon Mechanics / SuperMicro hardware. Occasionally under high load (ZFS scrub for example), the box becomes non-responsive (it continues to respond to ping but nothing else works -- not even the local console). Our only solution is to hard reset after which everything comes up normally. Logs are showing the following: Jan 8 09:44:08 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0): Jan 8 09:44:08 prodsys-dmz-zfs2MPT SGL mem alloc failed Jan 8 09:44:08 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0): Jan 8 09:44:08 prodsys-dmz-zfs2Unable to allocate dma memory for extra SGL. Jan 8 09:44:08 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0): Jan 8 09:44:08 prodsys-dmz-zfs2Unable to allocate dma memory for extra SGL. Jan 8 09:44:10 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0): Jan 8 09:44:10 prodsys-dmz-zfs2Unable to allocate dma memory for extra SGL. Jan 8 09:44:10 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0): Jan 8 09:44:10 prodsys-dmz-zfs2MPT SGL mem alloc failed Jan 8 09:44:11 prodsys-dmz-zfs2 rpcmod: [ID 851375 kern.warning] WARNING: svc_cots_kdup no slots free I am able to resolve the last error by adjusting upwards the duplicate request cache sizes, but have been unable to find anything on the MPT SGL errors. Anyone have any thoughts on what this error might be? At this point, we are simply going to apply patches to this box (we do see an outstanding mpt patch): 147150 -- 01 R-- 124 SunOS 5.10_x86: mpt_sas patch 147702 -- 03 R-- 21 SunOS 5.10_x86: mpt patch But we have another identically configured box at the same patch level (admittedly with slightly less workload, though it also undergoes monthly zfs scrubs) which does not experience this issue. Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Unable to allocate dma memory for extra SGL
On Tue, Jan 10, 2012 at 06:23:50PM -0800, Hung-Sheng Tsao (laoTsao) wrote: how is the ram size what is the zpool setup and what is your hba and hdd size and type Hmm, actually this system has only 6GB of memory. For some reason I though it had more. The controller is an LSISAS2008 (which oddly enough dose not seem to be recognized by lsiutil). There are 23x1TB disks (SATA interface, not SAS unfortunately) in the system. Three RAIDZ2 vdevs of seven disks each and one spare comprises a single zpool with two zfs file systems mounted (no deduplication or compression in use). There are two internally mounted Intel X-25E's -- these double as the rootpool and ZIL devices. There is an 80GB X-25M mounted to the expander along with the 1TB drives operating as L2ARC. On Jan 10, 2012, at 21:07, Ray Van Dolson rvandol...@esri.com wrote: Hi all; We have a Solaris 10 U9 x86 instance running on Silicon Mechanics / SuperMicro hardware. Occasionally under high load (ZFS scrub for example), the box becomes non-responsive (it continues to respond to ping but nothing else works -- not even the local console). Our only solution is to hard reset after which everything comes up normally. Logs are showing the following: Jan 8 09:44:08 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0): Jan 8 09:44:08 prodsys-dmz-zfs2MPT SGL mem alloc failed Jan 8 09:44:08 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0): Jan 8 09:44:08 prodsys-dmz-zfs2Unable to allocate dma memory for extra SGL. Jan 8 09:44:08 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0): Jan 8 09:44:08 prodsys-dmz-zfs2Unable to allocate dma memory for extra SGL. Jan 8 09:44:10 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0): Jan 8 09:44:10 prodsys-dmz-zfs2Unable to allocate dma memory for extra SGL. Jan 8 09:44:10 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0): Jan 8 09:44:10 prodsys-dmz-zfs2MPT SGL mem alloc failed Jan 8 09:44:11 prodsys-dmz-zfs2 rpcmod: [ID 851375 kern.warning] WARNING: svc_cots_kdup no slots free I am able to resolve the last error by adjusting upwards the duplicate request cache sizes, but have been unable to find anything on the MPT SGL errors. Anyone have any thoughts on what this error might be? At this point, we are simply going to apply patches to this box (we do see an outstanding mpt patch): 147150 -- 01 R-- 124 SunOS 5.10_x86: mpt_sas patch 147702 -- 03 R-- 21 SunOS 5.10_x86: mpt patch But we have another identically configured box at the same patch level (admittedly with slightly less workload, though it also undergoes monthly zfs scrubs) which does not experience this issue. Ray Thanks, Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS + Dell MD1200's - MD3200 necessary?
We are looking at building a storage platform based on Dell HW + ZFS (likely Nexenta). Going Dell because they can provide solid HW support globally. Are any of you using the MD1200 JBOD with head units *without* an MD3200 in front? We are being told that the MD1200's won't daisy chain unless the MD3200 is involved. We would be looking to use some sort of LSI-based SAS controller on the Dell front-end servers. Looking to confirm from folks who have this deployed in the wild. Perhaps you'd be willing to describe your setup as well and anything we might need to take into consideration (thinking best option for getting ZIL/L2ARC devices into Dell R510 head units for example in a supported manner). Thanks, Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + Dell MD1200's - MD3200 necessary?
On Thu, Jan 05, 2012 at 06:07:33PM -0800, Craig Morgan wrote: Ray, If you are intending to go Nexenta then speak to your local Nexenta SE, we've got HSL qualified solutions which cover our h/w support and we've explicitly qualed some MD1200 configs with Dell for certain deployments to guarantee support via both Dell h/w support and ourselves. If you don't know who that would be drop me a line and I'll find someone local to you … We tend to go with the LSI cards, but even there there are some issues with regard to Dell supply or over the counter. HTH Craig Hi Craig; Yep, we are doing this. Just trying to sanity check the suggested config against what folks are doing in the wild as our Dell partner doesn't seem to think it should/can be done without the MD3200. They may have alterior motives of course. :) Thanks, Ray On 6 Jan 2012, at 01:28, Ray Van Dolson wrote: We are looking at building a storage platform based on Dell HW + ZFS (likely Nexenta). Going Dell because they can provide solid HW support globally. Are any of you using the MD1200 JBOD with head units *without* an MD3200 in front? We are being told that the MD1200's won't daisy chain unless the MD3200 is involved. We would be looking to use some sort of LSI-based SAS controller on the Dell front-end servers. Looking to confirm from folks who have this deployed in the wild. Perhaps you'd be willing to describe your setup as well and anything we might need to take into consideration (thinking best option for getting ZIL/L2ARC devices into Dell R510 head units for example in a supported manner). Thanks, Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resolving performance issue w/ deduplication (NexentaStor)
On Fri, Dec 30, 2011 at 05:57:47AM -0800, Hung-Sheng Tsao (laoTsao) wrote: now s11 support shadow migration, just for this purpose, AFAIK not sure nexentaStor support shadow migration Does not appear that it does (at least the shadow property is not in NexentaStor's zfs man page). Thanks for the pointer. Ray On Dec 30, 2011, at 2:03, Ray Van Dolson rvandol...@esri.com wrote: On Thu, Dec 29, 2011 at 10:59:04PM -0800, Fajar A. Nugraha wrote: On Fri, Dec 30, 2011 at 1:31 PM, Ray Van Dolson rvandol...@esri.com wrote: Is there a non-disruptive way to undeduplicate everything and expunge the DDT? AFAIK, no zfs send/recv and then back perhaps (we have the extra space)? That should work, but it's disruptive :D Others might provide better answer though. Well, slightly _less_ disruptive perhaps. We can zfs send to another file system on the same system, but different set of disks. We then disable NFS shares on the original, do a final zfs send to sync, then share out the new undeduplicated file system with the same name. Hopefully the window here is short enough that NFS clients are able to recover gracefully. We'd then wipe out the old zpool, recreate and do the reverse to get data back onto it.. Thanks, Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resolving performance issue w/ deduplication (NexentaStor)
Thanks for you response, Richard. On Fri, Dec 30, 2011 at 09:52:17AM -0800, Richard Elling wrote: On Dec 29, 2011, at 10:31 PM, Ray Van Dolson wrote: Hi all; We have a dev box running NexentaStor Community Edition 3.1.1 w/ 24GB (we don't run dedupe on production boxes -- and we do pay for Nexenta licenses on prd as well) RAM and an 8.5TB pool with deduplication enabled (1.9TB or so in use). Dedupe ratio is only 1.26x. Yes, this workload is a poor fit for dedup. The box has an SLC-based SSD as ZIL and a 300GB MLC SSD as L2ARC. The box has been performing fairly poorly lately, and we're thinking it's due to deduplication: # echo ::arc | mdb -k | grep arc_meta arc_meta_used = 5884 MB arc_meta_limit= 5885 MB This can be tuned. Since you are on the community edition and thus have no expectation of support, you can increase this limit yourself. In the future, the limit will be increased OOB. For now, add something like the following to the /etc/system file and reboot. *** Parameter: zfs:zfs_arc_meta_limit ** Description: sets the maximum size of metadata stored in the ARC. ** Metadata competes with real data for ARC space. ** Release affected: NexentaStor 3.0, 3.1, not needed for 4.0 ** Validation: none ** When to change: for metadata-intensive or deduplication workloads ** having more metadata in the ARC can improve performance. ** Stability: NexentaStor issue #7151 seeks to change the default ** value to be larger than 1/4 of arc_max. ** Data type: integer ** Default: 1/4 of arc_max (bytes) ** Range: 1 to arc_max ** Changed by: YOUR_NAME_HERE ** Change date: TODAYS_DATE ** *set zfs:zfs_arc_meta_limit = 1000 If we wanted to this on a running system, would the following work? # echo arc_meta_limit/Z 0x27100 | mdb -kw (To up arc_meta_limit to 10GB) arc_meta_max = 5888 MB # zpool status -D ... DDT entries 24529444, size 331 on disk, 185 in core So, not only are we using up all of our metadata cache, but the DDT table is taking up a pretty significant chunk of that (over 70%). ARC sizing is as follows: p = 15331 MB c = 16354 MB c_min = 2942 MB c_max = 23542 MB size = 16353 MB I'm not really sure how to determine how many blocks are on this zpool (is it the same as the # of DDT entries? -- deduplication has been on since pool creation). If I use a 64KB block size average, I get about 31 million blocks, but DDT entries are 24 million …. The zpool status -D output shows the number of blocks. zdb -DD and zdb -bb | grep 'bp count both do not complete (zdb says I/O error). Probably because the pool is in use and is quite busy. Yes, zdb is not expected to produce correct output for imported pools. Without the block count I'm having a hard time determining how much memory we _should_ have. I can only speculate that it's more at this point. :) If I assume 24 million blocks is about accurate (from zpool status -D output above), then at 320 bytes per block we're looking at about 7.1GB for DDT table size. That is the on-disk calculation. Use the in-core number for memory consumption. RAM needed if DDT is completely in ARC = 4,537,947,140 bytes (+) We do have L2ARC, though I'm not sure how ZFS decides what portion of the DDT stays in memory and what can go to L2ARC -- if all of it went to L2ARC, then the references to this information in arc_meta would be (at 176 bytes * 24million blocks) around 4GB -- which again is a good chuck of arc_meta_max. Some of the data might already be in L2ARC. But L2ARC access is always slower than RAM access by a few orders of magnitude. Given that our dedupe ratio on this pool is fairly low anyways, am looking for strategies to back out. Should we just disable deduplication and then maybe bump up the size of the arc_meta_max? Maybe also increase the size of arc.size as well (8GB left for the system seems higher than we need)? The arc_size is dynamic, but limited by another bug in Solaris to effectively 7/8 of RAM (fixed in illumos). Since you are unsupported, you can try to add the following to /etc/system along with the tunable above. *** Parameter: swapfs_minfree ** Description: sets the minimum space reserved for the rest of the ** system as swapfs grows. This value is also used to calculate the ** dynamic upper limit of the ARC size. ** Release affected: NexentaStor 3.0, 3.1, not needed for 4.0 ** Validation: none ** When to change: the default setting of physmem/8 caps the ARC to ** approximately 7/8 of physmem, a value usually much smaller than ** arc_max. Choosing a lower limit for swapfs_minfree can allow the ** ARC to grow above 7/8 of physmem. ** Data
[zfs-discuss] Resolving performance issue w/ deduplication (NexentaStor)
Hi all; We have a dev box running NexentaStor Community Edition 3.1.1 w/ 24GB (we don't run dedupe on production boxes -- and we do pay for Nexenta licenses on prd as well) RAM and an 8.5TB pool with deduplication enabled (1.9TB or so in use). Dedupe ratio is only 1.26x. The box has an SLC-based SSD as ZIL and a 300GB MLC SSD as L2ARC. The box has been performing fairly poorly lately, and we're thinking it's due to deduplication: # echo ::arc | mdb -k | grep arc_meta arc_meta_used = 5884 MB arc_meta_limit= 5885 MB arc_meta_max = 5888 MB # zpool status -D ... DDT entries 24529444, size 331 on disk, 185 in core So, not only are we using up all of our metadata cache, but the DDT table is taking up a pretty significant chunk of that (over 70%). ARC sizing is as follows: p = 15331 MB c = 16354 MB c_min = 2942 MB c_max = 23542 MB size = 16353 MB I'm not really sure how to determine how many blocks are on this zpool (is it the same as the # of DDT entries? -- deduplication has been on since pool creation). If I use a 64KB block size average, I get about 31 million blocks, but DDT entries are 24 million zdb -DD and zdb -bb | grep 'bp count both do not complete (zdb says I/O error). Probably because the pool is in use and is quite busy. Without the block count I'm having a hard time determining how much memory we _should_ have. I can only speculate that it's more at this point. :) If I assume 24 million blocks is about accurate (from zpool status -D output above), then at 320 bytes per block we're looking at about 7.1GB for DDT table size. We do have L2ARC, though I'm not sure how ZFS decides what portion of the DDT stays in memory and what can go to L2ARC -- if all of it went to L2ARC, then the references to this information in arc_meta would be (at 176 bytes * 24million blocks) around 4GB -- which again is a good chuck of arc_meta_max. Given that our dedupe ratio on this pool is fairly low anyways, am looking for strategies to back out. Should we just disable deduplication and then maybe bump up the size of the arc_meta_max? Maybe also increase the size of arc.size as well (8GB left for the system seems higher than we need)? Is there a non-disruptive way to undeduplicate everything and expunge the DDT? zfs send/recv and then back perhaps (we have the extra space)? Thanks, Ray [1] http://markmail.org/message/db55j6zetifn4jkd ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resolving performance issue w/ deduplication (NexentaStor)
On Thu, Dec 29, 2011 at 10:59:04PM -0800, Fajar A. Nugraha wrote: On Fri, Dec 30, 2011 at 1:31 PM, Ray Van Dolson rvandol...@esri.com wrote: Is there a non-disruptive way to undeduplicate everything and expunge the DDT? AFAIK, no zfs send/recv and then back perhaps (we have the extra space)? That should work, but it's disruptive :D Others might provide better answer though. Well, slightly _less_ disruptive perhaps. We can zfs send to another file system on the same system, but different set of disks. We then disable NFS shares on the original, do a final zfs send to sync, then share out the new undeduplicated file system with the same name. Hopefully the window here is short enough that NFS clients are able to recover gracefully. We'd then wipe out the old zpool, recreate and do the reverse to get data back onto it.. Thanks, Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS in front of MD3000i
We're setting up ZFS in front of an MD3000i (and attached MD1000 expansion trays). The rule of thumb is to let ZFS manage all of the disks, so we wanted to expose each MD3000i spindle via a JBOD mode of some sort. Unfortunately, it doesn't look like the MD3000i this (though this[1] post seems to reference an Enhanced JBOD mode), so we decided to create a whole bunch of RAID0 1-disk LUNs and expose those. Great.. except that the MD3000i only lets you create 16 LUNs and we have 44 disks total. :) Anyone tried this? I guess our best bet will be to just do all the RAID stuff on the MD3000i and export one LUN to ZFS. Ray [1] http://don.blogs.smugmug.com/2007/10/01/dell-md3000-great-das-db-storage/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Replacement for X25-E
On Thu, Sep 22, 2011 at 12:46:42PM -0700, Brandon High wrote: On Tue, Sep 20, 2011 at 12:21 AM, Markus Kovero markus.kov...@nebula.fi wrote: Hi, I was wondering do you guys have any recommendations as replacement for Intel X25-E as it is being EOL’d? Mainly as for log device. The Intel 311 seems like a good fit. It's a 20gb SLC device intended to act as a cache device with the Z68 chipset. It seems to perform similarly to the X-25E as well (3300 IOPS for random writes). Perhaps the drive can be overprovisioned as well? My impression was that Intel was classifying the 3xx series as non-Enterprise however. Even with the SLC. I'm not sure what its rated lifetime is (1PB of data written?). Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Replacement for X25-E
On Thu, Sep 22, 2011 at 01:21:26PM -0700, Brandon High wrote: On Thu, Sep 22, 2011 at 12:53 PM, Ray Van Dolson rvandol...@esri.com wrote: It seems to perform similarly to the X-25E as well (3300 IOPS for random writes). Perhaps the drive can be overprovisioned as well? My impression was that Intel was classifying the 3xx series as non-Enterprise however. Even with the SLC. I don't think the 311 has any over-provisioning (other than the 7% from GB - GiB conversion). I believe it is an X25-E with only 5 channels populated. The upcoming enterprise models are MLC based and have greater over-provisioning AFAIK. The 20GB 311 only costs ~ $100 though. The 100GB Intel 710 costs ~ $650. The 311 is a good choice for home or budget users, and it seems that the 710 is much bigger than it needs to be for slog devices. My thoughts exactly. If the 311 is aimed at home users (wear-wise in _addition_ to marketing wise), then it doesn't really seem there is a suitable Intel replacement for the X-25E as far as an slog device is concerned. The drives are all way too big. :) We are currently looking at using the 320 or 710 overprovisioned (though the latter is likely more than we want to spend). Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Replacement for X25-E
On Thu, Sep 22, 2011 at 01:34:09PM -0700, Bob Friesenhahn wrote: On Thu, 22 Sep 2011, Brandon High wrote: The 20GB 311 only costs ~ $100 though. The 100GB Intel 710 costs ~ $650. The 311 is a good choice for home or budget users, and it seems that the 710 is much bigger than it needs to be for slog devices. Much too big is a good thing if it results in much more space available for wear-leveling. If the device is designed well, it should last longer. Bob Of course, at $650 a pop, if you're buying two Intel 710 100GB drives for either increased performance or redundancy, you could basically afford a DDRdrive... Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intel 320 as ZIL?
On Fri, Aug 12, 2011 at 06:53:22PM -0700, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Ray Van Dolson For ZIL, I suppose we could get the 300GB drive and overcommit to 95%! What kind of benefit does that offer? I suppose, if you have a 300G drive and the OS can only see 30G of it, then the drive can essentially treat all the other 290G as having been TRIM'd implicitly, even if your OS doesn't support TRIM. It is certainly conceivable this could make a big difference. Perhaps this is it. Pulled the recommendation from Intel's Solid-State Drive 320 Series in Server Storage Applications whitepaper. Section 4.1: A small reduction in an SSD’s usable capacity can provide a large increase in random write performance and endurance. All Intel SSDs have more NAND capacity than what is available for user data. The unused capacity is called spare capacity. This area is reserved for internal operations. The larger the spare capacity, the more efficiently the SSD can perform random write operations and the higher the random write performance. On the Intel SSD 320 Series, the spare capacity reserved at the factory is 7% to 11% (depending on the SKU) of the full NAND capacity. For better random write performance and endurance, the spare capacity can be increased by reducing the usable capacity of the drive; this process is called over-provisioning. Have you already tested it? Anybody? Or is it still just theoretical performance enhancement, compared to using a normal sized drive in a normal mode? Haven't yet tested it, but hope to shortly. Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intel 320 as ZIL?
On Mon, Aug 15, 2011 at 01:38:36PM -0700, Brandon High wrote: On Thu, Aug 11, 2011 at 1:00 PM, Ray Van Dolson rvandol...@esri.com wrote: Are any of you using the Intel 320 as ZIL? It's MLC based, but I understand its wear and performance characteristics can be bumped up significantly by increasing the overprovisioning to 20% (dropping usable capacity to 80%). Intel recently added the 311, a small SLC-based drive for use as a temp cache with their Z68 platform. It's limited to 20GB, but it might be a better fit for use as a ZIL than the 320. -B Looks interesting... specs around the same as the old X-25E. We have heard however, that Intel will be announcing a true successor to their X-25E line shortly. Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intel 320 as ZIL?
On Thu, Aug 11, 2011 at 09:17:38PM -0700, Cooper Hubbell wrote: Which 320 series drive are you targeting, specifically? The ~$100 80GB variant should perform as well as the more expensive versions if your workload is more random from what I've seen/read. ESX NFS-attached datastore activity. Probably up to 100 VM's (about the same as we did with the X-25E). Larger drives would let us set overcommit pretty high :) For ZIL, I suppose we could get the 300GB drive and overcommit to 95%! Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Intel 320 as ZIL?
Are any of you using the Intel 320 as ZIL? It's MLC based, but I understand its wear and performance characteristics can be bumped up significantly by increasing the overprovisioning to 20% (dropping usable capacity to 80%). Anyone have experience with this? Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intel 320 as ZIL?
On Thu, Aug 11, 2011 at 01:10:07PM -0700, Ian Collins wrote: On 08/12/11 08:00 AM, Ray Van Dolson wrote: Are any of you using the Intel 320 as ZIL? It's MLC based, but I understand its wear and performance characteristics can be bumped up significantly by increasing the overprovisioning to 20% (dropping usable capacity to 80%). A log device doesn't have to be larger than a few GB, so that shouldn't be a problem. I've found even low cost SSDs make a huge difference to the NFS write performance of a pool. We've been using the X-25E (SLC-based). It's getting hard to find, and since we're trying to stick to Intel drives (Nexenta certifies them), and Intel doesn't have a new SLC drive available until late September, we're hoping an overprovisioned 320 could fill the gap until then and perform at least as well as the X-25E. Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Adjusting HPA from Solaris on Intel 320 SSD's
Is there a way to tweak the HPA (Host Protected Area) on an Intel 320 SSD using native Solaris commands? In this case, we'd like to shrink the usable space so as to improve performance per recommendation in Intel Solid-State Drive 320 Series in Server Storage Applications section 4.1. hdparm on Linux is referenced, and it may be doable via the Intel Solid State Drive Toolbox, but would be great to be able to tweak and query this from Solaris / OpenSolaris / NexentaStor. Did come across this[1] thread from 2007, but it's not clear if 'format' or some other utility gained this functionality since. Thanks, Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Should Intel X25-E not be used with a SAS Expander?
On Thu, Jun 02, 2011 at 11:19:25AM -0700, Josh Simon wrote: I don't believe this to be the reason since there are other SATA (single-port) SSD drives listed as approved in that same document. Upon further research I found some interesting links that may point to a potentially different reason for not using the Intel X25-E with a SAS Expander: http://gdamore.blogspot.com/2010/08/why-sas-sata-is-not-such-great-idea.html Update: At a significant account, I can say that we (meaning Nexenta) have verified that SAS/SATA expanders combined with high loads of ZFS activity have proven conclusively to be highly toxic. So, if you're designing an enterprise storage solution, please consider using SAS all the way to the disk drives, and just skip those cheaper SATA options. You may think SATA looks like a bargain, but when your array goes offline during ZFS scrub or resilver operations because the expander is choking on cache sync commands, you'll really wish you had spent the extra cash up front. Really. and http://gdamore.blogspot.com/2010/12/update-on-sata-expanders.html This sounds like it will affect a lot of people since so many are using SATA SSD for their log devices connected to SAS expanders. Thanks, Josh Simon Yup; reset storms affected us as well (we were using the X-25 series for ZIL/L2ARC). Only the ZIL drives were impacted, but it was a large impact :) Our solution was to move the SSD's off of the expander and remount internally attached via one of the LSI SAS ports directly (we also had problems with running the drives directly off the on-board SATA ports on our SuperMicro motherboards -- occasionally the entire zpool would freeze up). Ray On 06/02/2011 01:25 PM, Jim Klimov wrote: 2011-06-02 18:40, Josh Simon пишет: I was just doing some storage research and came across this http://www.nexenta.com/corp/images/stories/pdfs/hardware-supported.pdf. In that document for Nexenta (an opensolaris variant) it states that you should not use Intel X25-E SSDSA2SH032G1 SSD with a SAS Expander. Can anyone tell me why? This seems to be a very common drive people deploy in ZFS pools. I believe one reason is that these are single-port devices - and as such do not support failover to another SAS path. //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Should Intel X25-E not be used with a SAS Expander?
On Thu, Jun 02, 2011 at 11:39:13AM -0700, Donald Stahl wrote: Yup; reset storms affected us as well (we were using the X-25 series for ZIL/L2ARC). Only the ZIL drives were impacted, but it was a large impact :) What did you see with your reset storm? Were there log errors in /var/adm/messages or did you need to check the controller loogs with something like lsi util? Yep, /var/adm/messages had Unit Attention errors. Ref: http://markmail.org/message/5rmfzvqwlmosh2oh Did the reset workaround in the blog post help? We re-architected before reading the blog post, so I'm unsure if it would have helped or not. In any case, moving the SSD's internal lets us use additional hot-swappable data disks, so it was beneficial in other areas as well. The expanders you were using were SAS/SATA expanders? Or SAS expanders with adapters on the drive to allow the use of SATA disks? The expander was a SuperMicro SAS-846EL1 which is a SAS expander but has SFF-8482 connectors to provide compatability with SATA drives. I've been using 4 X-25E's with Promise J610sD SAS shelves and the AAMUX adapters and have yet to have a problem. It definitely seemed itermittent, and various suggestions we received indicated we might need to downgrade our backplane/expander's firmware. Never did try that, but it wouldn't surprise me if behavior was better/worse on different backplanes... Our solution was to move the SSD's off of the expander and remount internally attached via one of the LSI SAS ports directly (we also had problems with running the drives directly off the on-board SATA ports on our SuperMicro motherboards -- occasionally the entire zpool would freeze up). I'm surprised you had problems with the internal SATA ports as well- any idea what was causing the problems there? Nope. I posted this: http://mail.opensolaris.org/pipermail/zfs-discuss/2010-October/045625.html But got no responses. We resolved the NFS errors (which I believe were coincidental), but the watchdog port issues kept reoccurring without rhyme or reason. The box itself wouldn't lock up, but the zpool would become non-resopnsive and we'd have to hard reset. This was all production stuff, so as soon as we were able to, we ditched using the SATA ports entirely instead of pursuing a fix with Sun. Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Tuning disk failure detection?
We recently had a disk fail on one of our whitebox (SuperMicro) ZFS arrays (Solaris 10 U9). The disk began throwing errors like this: May 5 04:33:44 dev-zfs4 scsi: [ID 243001 kern.warning] WARNING: /pci@0,0/pci8086,3410@9/pci15d9,400@0 (mpt_sas0): May 5 04:33:44 dev-zfs4mptsas_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31110610 And errors for the drive were incrementing in iostat -En output. Nothing was seen in fmdump. Unfortunately, it took about three hours for ZFS (or maybe it was MPT) to decide the drive was actually dead: May 5 07:41:06 dev-zfs4 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/disk@g5000c5002cbc76c0 (sd4): May 5 07:41:06 dev-zfs4drive offline During this three hours the I/O performance on this server was pretty bad and caused issues for us. Once the drive failed completely, ZFS pulled in a spare and all was well. My question is -- is there a way to tune the MPT driver or even ZFS itself to be more/less aggressive on what it sees as a failure scenario? I suppose this would have been handled differently / better if we'd been using real Sun hardware? Our other option is to watch better for log entries similar to the above and either alert someone or take some sort of automated action .. I'm hoping there's a better way to tune this via driver or ZFS settings however. Thanks, Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Tuning disk failure detection?
On Tue, May 10, 2011 at 02:42:40PM -0700, Jim Klimov wrote: In a recent post r-mexico wrote that they had to parse system messages and manually fail the drives on a similar, though different, occasion: http://opensolaris.org/jive/message.jspa?messageID=515815#515815 Thanks Jim, good pointer. It sounds like our use of SATA disks is likely the problem and we'd have better error reporting with SAS or some of the nearline SAS drives (SATA drives with a real SAS controller on them). Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Tuning disk failure detection?
On Tue, May 10, 2011 at 03:57:28PM -0700, Brandon High wrote: On Tue, May 10, 2011 at 9:18 AM, Ray Van Dolson rvandol...@esri.com wrote: My question is -- is there a way to tune the MPT driver or even ZFS itself to be more/less aggressive on what it sees as a failure scenario? You didn't mention what drives you had attached, but I'm guessing they were normal desktop drives. I suspect (but can't confirm) that using enterprise drives with TLER / ERC / CCTL would have reported the failure up the stack faster than a consumer drive. The drives will report an error after 7 seconds rather than retry for several minutes. You may be able to enable the feature on your drives, depending on the manufacturer and firmware revision. -B Yup, shoulda included that. These are regular SATA drives -- supposedly Enterprise whatever that gives us (most likely a higher MTBF number). We'll probably look at going with nearline SAS drives (only increases cost slightly) and write a small SEC rule on our syslog server to watch for 0x3000 errors on servers with SATA disks only so we can at least be alerted more quickly. Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Deduplication Memory Requirements
On Wed, May 04, 2011 at 08:49:03PM -0700, Edward Ned Harvey wrote: From: Tim Cook [mailto:t...@cook.ms] That's patently false. VM images are the absolute best use-case for dedup outside of backup workloads. I'm not sure who told you/where you got the idea that VM images are not ripe for dedup, but it's wrong. Well, I got that idea from this list. I said a little bit about why I believed it was true ... about dedup being ineffective for VM's ... Would you care to describe a use case where dedup would be effective for a VM? Or perhaps cite something specific, instead of just wiping the whole thing and saying patently false? I don't feel like this comment was productive... We use dedupe on our VMware datastores and typically see 50% savings, often times more. We do of course keep like VM's on the same volume (at this point nothing more than groups of Windows VM's, Linux VM's and so on). Note that this isn't on ZFS (yet), but we hope to begin experimenting with it soon (using NexentaStor). Apologies for devolving the conversation too much in the NetApp direction -- simply was a point of reference for me to get a better understanding of things on the ZFS side. :) Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Permanently using hot spare?
Have a failed drive on a ZFS pool (three RAIDZ2 vdevs, one hot spare). The hot spare kicked in and all is well. Is it possible to just make that hot spare disk -- already silvered into the pool -- as a permanent part of the pool? We could then throw in a new disk and mark it as a spare and avoid what would seem to be an unnecessary resilver (twice, once when the spare is brought in and again when we replace the failed disk). This document[1] seems to make it sound like it can be done, but I'm not really seeing how... Can I add the spare disk to the pool when it's already in use? Probably not... Note this is on Solaris 10 U9. Thanks, Ray [1] http://dlc.sun.com/osol/docs/content/ZFSADMIN/gayrd.html#gcvcw ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Permanently using hot spare?
On Thu, May 05, 2011 at 03:13:06PM -0700, TianHong Zhao wrote: Just detach the faulty disk, then the spare will become the normal disk once it's finished resilvering. #zfs detach pool fault_device_name Then you need to the new spare : #zfs add pool new_spare_device There seems to be a new feature in illumos project to support a zpool property like spare promotion, which would not require the manual detach operation. Tianhong Thanks! Great tip. Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Deduplication Memory Requirements
There are a number of threads (this one[1] for example) that describe memory requirements for deduplication. They're pretty high. I'm trying to get a better understanding... on our NetApps we use 4K block sizes with their post-process deduplication and get pretty good dedupe ratios for VM content. Using ZFS we are using 128K record sizes by default, which nets us less impressive savings... however, to drop to a 4K record size would theoretically require that we have nearly 40GB of memory for only 1TB of storage (based on 150 bytes per block for the DDT). This obviously becomes prohibitively higher for 10+ TB file systems. I will note that our NetApps are using only 2TB FlexVols, but would like to better understand ZFS's (apparently) higher memory requirements... or maybe I'm missing something entirely. Thanks, Ray [1] http://markmail.org/message/wile6kawka6qnjdw ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Deduplication Memory Requirements
On Wed, May 04, 2011 at 12:29:06PM -0700, Erik Trimble wrote: On 5/4/2011 9:57 AM, Ray Van Dolson wrote: There are a number of threads (this one[1] for example) that describe memory requirements for deduplication. They're pretty high. I'm trying to get a better understanding... on our NetApps we use 4K block sizes with their post-process deduplication and get pretty good dedupe ratios for VM content. Using ZFS we are using 128K record sizes by default, which nets us less impressive savings... however, to drop to a 4K record size would theoretically require that we have nearly 40GB of memory for only 1TB of storage (based on 150 bytes per block for the DDT). This obviously becomes prohibitively higher for 10+ TB file systems. I will note that our NetApps are using only 2TB FlexVols, but would like to better understand ZFS's (apparently) higher memory requirements... or maybe I'm missing something entirely. Thanks, Ray I'm not familiar with NetApp's implementation, so I can't speak to why it might appear to use less resources. However, there are a couple of possible issues here: (1) Pre-write vs Post-write Deduplication. ZFS does pre-write dedup, where it looks for duplicates before it writes anything to disk. In order to do pre-write dedup, you really have to store the ENTIRE deduplication block lookup table in some sort of fast (random) access media, realistically Flash or RAM. The win is that you get significantly lower disk utilization (i.e. better I/O performance), as (potentially) much less data is actually written to disk. Post-write Dedup is done via batch processing - that is, such a design has the system periodically scan the saved data, looking for duplicates. While this method also greatly benefits from being able to store the dedup table in fast random storage, it's not anywhere as critical. The downside here is that you see much higher disk utilization - the system must first write all new data to disk (without looking for dedup), and then must also perform significant I/O later on to do the dedup. Makes sense. (2) Block size: a 4k block size will yield better dedup than a 128k block size, presuming reasonable data turnover. This is inherent, as any single bit change in a block will make it non-duplicated. With 32x the block size, there is a much greater chance that a small change in data will require a large loss of dedup ratio. That is, 4k blocks should almost always yield much better dedup ratios than larger ones. Also, remember that the ZFS block size is a SUGGESTION for zfs filesystems (i.e. it will use UP TO that block size, but not always that size), but is FIXED for zvols. (3) Method of storing (and data stored in) the dedup table. ZFS's current design is (IMHO) rather piggy on DDT and L2ARC lookup requirements. Right now, ZFS requires a record in the ARC (RAM) for each L2ARC (cache) entire, PLUS the actual L2ARC entry. So, it boils down to 500+ bytes of combined L2ARC RAM usage per block entry in the DDT. Also, the actual DDT entry itself is perhaps larger than absolutely necessary. So the addition of L2ARC doesn't necessarily reduce the need for memory (at least not much if you're talking about 500 bytes combined)? I was hoping we could slap in 80GB's of SSD L2ARC and get away with only 16GB of RAM for example. I suspect that NetApp does the following to limit their resource usage: they presume the presence of some sort of cache that can be dedicated to the DDT (and, since they also control the hardware, they can make sure there is always one present). Thus, they can make their code completely avoid the need for an equivalent to the ARC-based lookup. In addition, I suspect they have a smaller DDT entry itself. Which boils down to probably needing 50% of the total resource consumption of ZFS, and NO (or extremely small, and fixed) RAM requirement. Honestly, ZFS's cache (L2ARC) requirements aren't really a problem. The big issue is the ARC requirements, which, until they can be seriously reduced (or, best case, simply eliminated), really is a significant barrier to adoption of ZFS dedup. Right now, ZFS treats DDT entries like any other data or metadata in how it ages from ARC to L2ARC to gone. IMHO, the better way to do this is simply require the DDT to be entirely stored on the L2ARC (if present), and not ever keep any DDT info in the ARC at all (that is, the ARC should contain a pointer to the DDT in the L2ARC, and that's it, regardless of the amount or frequency of access of the DDT). Frankly, at this point, I'd almost change the design to REQUIRE a L2ARC device in order to turn on Dedup. Thanks for you response, Eric. Very helpful. Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman
Re: [zfs-discuss] Deduplication Memory Requirements
On Wed, May 04, 2011 at 02:55:55PM -0700, Brandon High wrote: On Wed, May 4, 2011 at 12:29 PM, Erik Trimble erik.trim...@oracle.com wrote: I suspect that NetApp does the following to limit their resource usage: they presume the presence of some sort of cache that can be dedicated to the DDT (and, since they also control the hardware, they can make sure there is always one present). Thus, they can make their code AFAIK, NetApp has more restrictive requirements about how much data can be dedup'd on each type of hardware. See page 29 of http://media.netapp.com/documents/tr-3505.pdf - Smaller pieces of hardware can only dedup 1TB volumes, and even the big-daddy filers will only dedup up to 16TB per volume, even if the volume size is 32TB (the largest volume available for dedup). NetApp solves the problem by putting rigid constraints around the problem, whereas ZFS lets you enable dedup for any size dataset. Both approaches have limitations, and it sucks when you hit them. -B That is very true, although worth mentioning you can have quite a few of the dedupe/SIS enabled FlexVols on even the lower-end filers (our FAS2050 has a bunch of 2TB SIS enabled FlexVols). The FAS2050 of course has a fairly small memory footprint... I do like the additional flexibility you have with ZFS, just trying to get a handle on the memory requirements. Are any of you out there using dedupe ZFS file systems to store VMware VMDK (or any VM tech. really)? Curious what recordsize you use and what your hardware specs / experiences have been. Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Deduplication Memory Requirements
On Wed, May 04, 2011 at 03:49:12PM -0700, Erik Trimble wrote: On 5/4/2011 2:54 PM, Ray Van Dolson wrote: On Wed, May 04, 2011 at 12:29:06PM -0700, Erik Trimble wrote: (2) Block size: a 4k block size will yield better dedup than a 128k block size, presuming reasonable data turnover. This is inherent, as any single bit change in a block will make it non-duplicated. With 32x the block size, there is a much greater chance that a small change in data will require a large loss of dedup ratio. That is, 4k blocks should almost always yield much better dedup ratios than larger ones. Also, remember that the ZFS block size is a SUGGESTION for zfs filesystems (i.e. it will use UP TO that block size, but not always that size), but is FIXED for zvols. (3) Method of storing (and data stored in) the dedup table. ZFS's current design is (IMHO) rather piggy on DDT and L2ARC lookup requirements. Right now, ZFS requires a record in the ARC (RAM) for each L2ARC (cache) entire, PLUS the actual L2ARC entry. So, it boils down to 500+ bytes of combined L2ARC RAM usage per block entry in the DDT. Also, the actual DDT entry itself is perhaps larger than absolutely necessary. So the addition of L2ARC doesn't necessarily reduce the need for memory (at least not much if you're talking about 500 bytes combined)? I was hoping we could slap in 80GB's of SSD L2ARC and get away with only 16GB of RAM for example. It reduces *somewhat* the need for RAM. Basically, if you have no L2ARC cache device, the DDT must be stored in RAM. That's about 376 bytes per dedup block. If you have an L2ARC cache device, then the ARC must contain a reference to every DDT entry stored in the L2ARC, which consumes 176 bytes per DDT entry reference. So, adding a L2ARC reduces the ARC consumption by about 55%. Of course, the other benefit from a L2ARC is the data/metadata caching, which is likely worth it just by itself. Great info. Thanks Erik. For dedupe workloads on larger file systems (8TB+), I wonder if makes sense to use SLC / enterprise class SSD (or better) devices for L2ARC instead of lower-end MLC stuff? Seems like we'd be seeing more writes to the device than in a non-dedupe scenario. Thanks, Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Deduplication Memory Requirements
On Wed, May 04, 2011 at 04:51:36PM -0700, Erik Trimble wrote: On 5/4/2011 4:44 PM, Tim Cook wrote: On Wed, May 4, 2011 at 6:36 PM, Erik Trimble erik.trim...@oracle.com wrote: On 5/4/2011 4:14 PM, Ray Van Dolson wrote: On Wed, May 04, 2011 at 02:55:55PM -0700, Brandon High wrote: On Wed, May 4, 2011 at 12:29 PM, Erik Trimble erik.trim...@oracle.com wrote: I suspect that NetApp does the following to limit their resource usage: they presume the presence of some sort of cache that can be dedicated to the DDT (and, since they also control the hardware, they can make sure there is always one present). Thus, they can make their code AFAIK, NetApp has more restrictive requirements about how much data can be dedup'd on each type of hardware. See page 29 of http://media.netapp.com/documents/tr-3505.pdf - Smaller pieces of hardware can only dedup 1TB volumes, and even the big-daddy filers will only dedup up to 16TB per volume, even if the volume size is 32TB (the largest volume available for dedup). NetApp solves the problem by putting rigid constraints around the problem, whereas ZFS lets you enable dedup for any size dataset. Both approaches have limitations, and it sucks when you hit them. -B That is very true, although worth mentioning you can have quite a few of the dedupe/SIS enabled FlexVols on even the lower-end filers (our FAS2050 has a bunch of 2TB SIS enabled FlexVols). Stupid question - can you hit all the various SIS volumes at once, and not get horrid performance penalties? If so, I'm almost certain NetApp is doing post-write dedup. That way, the strictly controlled max FlexVol size helps with keeping the resource limits down, as it will be able to round-robin the post-write dedup to each FlexVol in turn. ZFS's problem is that it needs ALL the resouces for EACH pool ALL the time, and can't really share them well if it expects to keep performance from tanking... (no pun intended) On a 2050? Probably not. It's got a single-core mobile celeron CPU and 2GB/ram. You couldn't even run ZFS on that box, much less ZFS+dedup. Can you do it on a model that isn't 4 years old without tanking performance? Absolutely. Outside of those two 2000 series, the reason there are dedup limits isn't performance. --Tim Indirectly, yes, it's performance, since NetApp has plainly chosen post-write dedup as a method to restrict the required hardware capabilities. The dedup limits on Volsize are almost certainly driven by the local RAM requirements for post-write dedup. It also looks like NetApp isn't providing for a dedicated DDT cache, which means that when the NetApp is doing dedup, it's consuming the normal filesystem cache (i.e. chewing through RAM). Frankly, I'd be very surprised if you didn't see a noticeable performance hit during the period that the NetApp appliance is performing the dedup scans. Yep, when the dedupe process runs, there is a drop in performance (hence we usually schedule it to run off-peak hours). Obviously this is a luxury that wouldn't be an option in every environment... During normal operations outside of the dedupe period we haven't noticed a performance hit. I don't think we hit the filer too hard however -- it's acting as a VMware datastore and only a few of the VM's have higher I/O footprints. It is a 2050C however so we spread the load across the two filer heads (although we occasionally run everything on one head when performing maintenance on the other). Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] detach configured log devices?
On Wed, Mar 16, 2011 at 09:33:58AM -0700, Jim Mauro wrote: With ZFS, Solaris 10 Update 9, is it possible to detach configured log devices from a zpool? I have a zpool with 3 F20 mirrors for the ZIL. They're coming up corrupted. I want to detach them, remake the devices and reattach them to the zpool. Yup, as long as your zpool has been updated to the correct version. Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Good SLOG devices?
On Tue, Mar 01, 2011 at 08:03:42AM -0800, Roy Sigurd Karlsbakk wrote: Hi I'm running OpenSolaris 148 on a few boxes, and newer boxes are getting installed as we speak. What would you suggest for a good SLOG device? It seems some new PCI-E-based ones are hitting the market, but will those require special drivers? Cost is obviously alsoo an issue here Vennlige hilsener / Best regards roy What type of workload are you looking to handle? We've had good luck with pairs of Intel X-25E's for VM datastore duty. We also have a DDRrive X1 which is probably the best option out there currently and will handle workloads the X-25E's can't. I believe a lot of folks here use the Vertex SLC-based SF-15 SSD's also. Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Good SLOG devices?
On Tue, Mar 01, 2011 at 09:56:35AM -0800, Roy Sigurd Karlsbakk wrote: a) do you need an SLOG at all? Some workloads (asynchronous ones) will never benefit from an SLOG. We're planning to use this box for CIFS/NFS, so we'll need an SLOG to speed things up. b) form factor. at least one manufacturer uses a PCIe card which is not compliant with the PCIe form-factor and will not fit in many cases -- especially typical 1U boxes. The box is 4U with some 7 8x PCIe slots, so I think it should do fine c) driver support. That was why I asked here in the first place... d) do they really just go straight to ram/flash, or do they have an on-device SAS or SATA bus? Some PCIe devices just stick a small flash device on a SAS or SATA controller. I suspect that those devices won't see a lot of benefit relative to an external drive (although they could theoretically drive that private SAS/SATA bus at much higher rates than an external bus -- but I've not checked into it.) The other thing with PCIe based devices is that they consume an IO slot, which may be precious to you depending on your system board and other I/O needs. As I mentioned above, we have sufficient slots. As for the SATA/SAS onboard controller, that was the reason I asked here in the first place. So - do anyone know a good device for this? X25-E is rather old now, so there should be better ones available.. I think the OCZ Vertex 2 EX (SLC) is fairly highly regarded: http://www.ocztechnology.com/ocz-vertex-2-ex-series-sata-ii-2-5-ssd.html Note that if you're using an LSI backplane (probably are if you're using SuperMicro hardware), they have tended to certify only against the X-25E. Other drives should work fine, but just an FYI. This page (maybe a little dated, I'm not sure) has some pretty good info: http://www.nexenta.org/projects/site/wiki/About_suggested_NAS_SAN_Hardware Vennlige hilsener / Best regards roy Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] multipath used inadvertantly?
I'm troubleshooting an existing Solaris 10U9 server (x86 whitebox) and noticed its device names are extremely hair -- very similar to the multipath device names: c0t5000C50026F8ACAAd0, etc, etc. mpathadm seems to confirm: # mpathadm list lu /dev/rdsk/c0t50015179591CE0C1d0s2 Total Path Count: 1 Operational Path Count: 1 # ps -ef | grep mpath root 245 1 0 Jan 05 ? 16:38 /usr/lib/inet/in.mpathd -a The system is SuperMicro based with an LSI SAS2008 controller in it. To my knowledge it has no multipath capabilities (or at least not as its wired up currently). The mpt_sas driver is in use per prtconf and modinfo. My questions are: - What scenario would the multipath driver get loaded up at installation time for this LSI controller? I'm guessing this is what happened? - If I disabled mpathd would I get the shorter disk device names back again? How would this impact existing zpools that are already on the system tied to these disks? I have a feeling doing this might be a little bit painful. :) I tried to glean the original device names from stmsboot -L, but it didn't show any mappings... Thanks, Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [storage-discuss] multipath used inadvertantly?
Thanks Torrey. I definitely see that multipathing is enabled... I mainly want to understand whether or not there are installation scenarios where multipathing is enabled by default (if the mpt driver thinks it can support it will it enable mpathd at install time?) as well as the consequences of disabling it now... It looks to me as if disabling it will result in some pain. :) Ray On Tue, Feb 15, 2011 at 01:24:20PM -0800, Torrey McMahon wrote: in.mpathd is the IP multipath daemon. (Yes, it's a bit confusing that mpathadm is the storage multipath admin tool. ) If scsi_vhci is loaded in the kernel you have storage multipathing enabled. (Check with modinfo.) On 2/15/2011 3:53 PM, Ray Van Dolson wrote: I'm troubleshooting an existing Solaris 10U9 server (x86 whitebox) and noticed its device names are extremely hair -- very similar to the multipath device names: c0t5000C50026F8ACAAd0, etc, etc. mpathadm seems to confirm: # mpathadm list lu /dev/rdsk/c0t50015179591CE0C1d0s2 Total Path Count: 1 Operational Path Count: 1 # ps -ef | grep mpath root 245 1 0 Jan 05 ? 16:38 /usr/lib/inet/in.mpathd -a The system is SuperMicro based with an LSI SAS2008 controller in it. To my knowledge it has no multipath capabilities (or at least not as its wired up currently). The mpt_sas driver is in use per prtconf and modinfo. My questions are: - What scenario would the multipath driver get loaded up at installation time for this LSI controller? I'm guessing this is what happened? - If I disabled mpathd would I get the shorter disk device names back again? How would this impact existing zpools that are already on the system tied to these disks? I have a feeling doing this might be a little bit painful. :) I tried to glean the original device names from stmsboot -L, but it didn't show any mappings... Thanks, Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [storage-discuss] multipath used inadvertantly?
Thanks Cindy. Are you (or anyone else reading) aware of a way to disable MPxIO at install time? I imagine there's no harm* in leaving MPxIO enabled with single-pathed devices -- we'll likely just keep this in mind for future installs. Thanks, Ray * performance penalty -- we do see errors in our logs from time to time from mpathd letting us know disks have only one path On Tue, Feb 15, 2011 at 01:50:47PM -0800, Cindy Swearingen wrote: Hi Ray, MPxIO is on by default for x86 systems that run the Solaris 10 9/10 release. On my Solaris 10 9/10 SPARC system, I see this: # stmsboot -L stmsboot: MPxIO is not enabled stmsboot: MPxIO disabled You can use the stmsboot CLI to disable multipathing. You are prompted to reboot the system after disabling MPxIO. See stmsboot.1m for more info. With an x86 whitebox, I would export your ZFS storage pools first, but maybe it doesn't matter if the system is rebooted. ZFS should be able to identify the devices by their internal device IDs but I can't speak for unknown hardware. When you make hardware changes, always have current backups. Thanks, Cindy On 02/15/11 14:32, Ray Van Dolson wrote: Thanks Torrey. I definitely see that multipathing is enabled... I mainly want to understand whether or not there are installation scenarios where multipathing is enabled by default (if the mpt driver thinks it can support it will it enable mpathd at install time?) as well as the consequences of disabling it now... It looks to me as if disabling it will result in some pain. :) Ray On Tue, Feb 15, 2011 at 01:24:20PM -0800, Torrey McMahon wrote: in.mpathd is the IP multipath daemon. (Yes, it's a bit confusing that mpathadm is the storage multipath admin tool. ) If scsi_vhci is loaded in the kernel you have storage multipathing enabled. (Check with modinfo.) On 2/15/2011 3:53 PM, Ray Van Dolson wrote: I'm troubleshooting an existing Solaris 10U9 server (x86 whitebox) and noticed its device names are extremely hair -- very similar to the multipath device names: c0t5000C50026F8ACAAd0, etc, etc. mpathadm seems to confirm: # mpathadm list lu /dev/rdsk/c0t50015179591CE0C1d0s2 Total Path Count: 1 Operational Path Count: 1 # ps -ef | grep mpath root 245 1 0 Jan 05 ? 16:38 /usr/lib/inet/in.mpathd -a The system is SuperMicro based with an LSI SAS2008 controller in it. To my knowledge it has no multipath capabilities (or at least not as its wired up currently). The mpt_sas driver is in use per prtconf and modinfo. My questions are: - What scenario would the multipath driver get loaded up at installation time for this LSI controller? I'm guessing this is what happened? - If I disabled mpathd would I get the shorter disk device names back again? How would this impact existing zpools that are already on the system tied to these disks? I have a feeling doing this might be a little bit painful. :) I tried to glean the original device names from stmsboot -L, but it didn't show any mappings... Thanks, Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] cfgadm MPxIO aware yet in Solaris 10 U9?
I just replaced a failing disk on one of my servers running Solaris 10 U9. The system was MPxIO enabled and I now have the old device hanging around in the cfgadm list. I understand from searching around that cfgadm may not be MPxIO aware -- at least not in Solaris 10. I see a fix was pushed to OpenSolaris but I'm hoping someone can confirm whether or not this is in Sol10U9 yet or what my other options are (short of rebooting) to clean this old device out. Maybe luxadm can do it... FYI, my zpool replace triggered resilver completed, so the disk is no longer tied to the zpool. Thanks, Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] native ZFS on Linux
On Sat, Feb 12, 2011 at 09:18:26AM -0800, David E. Anderson wrote: I see that Pinguy OS, an uber-Ubuntu o/s, includes native ZFS support. Any pointers to more info on this? Probably using this[1]. Ray [1] http://kqstor.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Fwd: native ZFS on Linux
On Sat, Feb 12, 2011 at 09:36:25AM -0800, David E. Anderson wrote: went to IRC for the distro, will post when I have more info. I think this is kqstor-based Both projects would be complementary I would think. kqstor is definitely working on the POSIX layer that the zfsonlinux project lacks currently. Ray -- Forwarded message -- From: C. Bergström codest...@osunix.org Date: 2011/2/12 Subject: Re: [zfs-discuss] native ZFS on Linux To: Cc: zfs-discuss@opensolaris.org Ray Van Dolson wrote: On Sat, Feb 12, 2011 at 09:18:26AM -0800, David E. Anderson wrote: I see that Pinguy OS, an uber-Ubuntu o/s, includes native ZFS support. Any pointers to more info on this? Probably using this[1]. doubtful.. It's more likely based on http://zfsonlinux.org/ Why not post to the distro mailing list or look at the source though? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Looking for 3.5 SSD for ZIL
On Thu, Dec 23, 2010 at 07:35:29AM -0800, Deano wrote: If anybody does know of any source to the secure erase/reformatters, I’ll happily volunteer to do the port and then maintain it. I’m currently in talks with several SSD and driver chip hardware peeps with regard getting datasheets for some SSD products etc. for the purpose of better support under the OI/Solaris driver model but these things can take a while to obtain, so if anybody knows of existing open source versions I’ll jump on it. Thanks, Deano A tool to help the end user know *when* they should run the reformatter tool would be helpful too. I know we can just wait until performance degrades, but it would be nice to see what % of blocks are in use, etc. Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Looking for 3.5 SSD for ZIL
On Wed, Dec 22, 2010 at 05:43:35AM -0800, Jabbar wrote: Hello, I was thinking of buying a couple of SSD's until I found out that Trim is only supported with SATA drives. I'm not sure if TRIM will work with ZFS. I was concerned that with trim support the SSD life and write throughput will get affected. Doesn't anybody have any thoughts on this? Have been using X-25E's as ZIL for over a year. Cheap enough to replace a drive when they last that long... (still not seeing any reason to replace our current batch yet either). Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Moving rpool disks
We need to move the disks comprising our mirrored rpool on a Solaris 10 U9 x86_64 (not SPARC) system. We'll be relocating both drives to a different controller in the same system (should go from c1* to c0*). We're curious as to what the best way is to go about this? We'd love to be able to just relocate the disks and update the system BIOS to boot off the drives in their new location and have everything magically work. However, we're thinking we may need to touch GRUB config files (though maybe not since rpool is referenced in the config file) or at least re-run grub-install or something to update the MBR on both of these drives. Any advice? Thanks, Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] X4540 RIP
On Mon, Nov 08, 2010 at 11:51:02PM -0800, matthew patton wrote: I have this with 36 2TB drives (and 2 separate boot drives). http://www.colfax-intl.com/jlrid/SpotLight_more_Acc.asp?L=134S=58B=2267 That's just a Supermicro SC847. http://www.supermicro.com/products/chassis/4U/?chs=847 Stay away from the 24 port expander backplanes. I've gone thru several and they still don't work right - timeout and dropped drives under load. The 12-port works just fine connected to a variety of controllers. If you insist on the 24-port expander backplane, use a non-expander equipped LSI controller to drive it. What do you mean by non-expander equipped LSI controller? I got fed up with the 24-port expander board and went with -A1 (all independent) and that's worked much more reliably. Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] NFS/SATA lockups (svc_cots_kdup no slots free sata port time out)
I have a Solaris 10 U8 box (142901-14) running as an NFS server with a 23 disk zpool behind it (three RAIDZ2 vdevs). We have a single Intel X-25E SSD operating as an slog ZIL device attached to a SATA port on this machine's motherboard. The rest of the drives are in a hot-swap enclosure. Infrequently (maybe once every 4-6 weeks), the zpool on the box stops responding and although we can still SSH in and manage the server, there appears to be no way to get the zpool to function again until we hard reset. shutdown -i6 -g0 -y simply hangs forever trying to call 'sync'. The logs show the following: Oct 19 11:42:42 dev-zfs1 rpcmod: [ID 851375 kern.warning] WARNING: svc_cots_kdup no slots free Oct 19 11:42:50 dev-zfs1 last message repeated 189 times Oct 19 11:42:51 dev-zfs1 rpcmod: [ID 851375 kern.warning] WARNING: svc_cots_kdup no slots free Oct 19 11:42:55 dev-zfs1 last message repeated 99 times Oct 19 11:42:56 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xfe840f453b68 timed out Oct 19 11:42:56 dev-zfs1 rpcmod: [ID 851375 kern.warning] WARNING: svc_cots_kdup no slots free Oct 19 11:44:00 dev-zfs1 last message repeated 1128 times Oct 19 11:44:01 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xfe83dffad0e8 timed out Oct 19 11:44:02 dev-zfs1 rpcmod: [ID 851375 kern.warning] WARNING: svc_cots_kdup no slots free Oct 19 11:45:05 dev-zfs1 last message repeated 1108 times Oct 19 11:45:06 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xbe00a008 timed out Oct 19 11:45:06 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xac7bc7e8 timed out Oct 19 11:45:06 dev-zfs1 rpcmod: [ID 851375 kern.warning] WARNING: svc_cots_kdup no slots free Oct 19 11:46:10 dev-zfs1 last message repeated 1091 times Oct 19 11:46:11 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xb9438008 timed out Oct 19 11:47:16 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xb03452a8 timed out Oct 19 11:48:21 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xfe83dfa5cd20 timed out Oct 19 11:49:26 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xb6eaf2a0 timed out Oct 19 11:50:31 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xfe83dfa5c380 timed out Oct 19 11:51:36 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xfe83ca418b68 timed out Oct 19 11:52:41 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xfe83fff758c0 timed out Oct 19 11:53:46 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xb1144548 timed out Oct 19 11:54:51 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xfe83dffad9a8 timed out Oct 19 11:55:56 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xfe83e8cd18c0 timed out Oct 19 11:57:01 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xfe83c43659a8 timed out Oct 19 11:58:06 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xb9136468 timed out Oct 19 11:59:11 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xfe83e9f147e0 timed out Oct 19 12:00:16 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xb1be7d20 timed out Oct 19 12:01:21 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xfe83dfa5fee0 timed out Oct 19 12:02:26 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xbe6f7e08 timed out Oct 19 12:03:31 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xb903c380 timed out Oct 19 12:04:36 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xfe83eee6f8c8 timed out Oct 19 12:05:41 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xb04b7000 timed out Oct 19 12:06:46 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xfe83fff7dd28 timed out Oct 19 12:07:51 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xb94389a8 timed out Oct 19 12:08:56 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xae0ff388 timed out Oct 19 12:10:01 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xfe84158032a8 timed out Oct 19 12:11:06 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xfe83f07f7e00 timed out Oct 19 12:11:25 dev-zfs1 power: [ID 199196 kern.notice] NOTICE: Power Button pressed 2 times,
Re: [zfs-discuss] Multiple SLOG devices per pool
On Tue, Oct 12, 2010 at 08:49:00PM -0700, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Ray Van Dolson I have a pool with a single SLOG device rated at Y iops. If I add a second (non-mirrored) SLOG device also rated at Y iops will my zpool now theoretically be able to handle 2Y iops? Or close to that? Yes. But we're specifically talking about sync mode writes. Not async, and not read. And we're not comparing apples to oranges etc, not measuring an actual number of IOPS, because of aggregation etc. But I don't think that's what you were asking. I don't think you are trying to quantify the number of IOPS. I think you're trying to confirm the qualitative characteristic, If I have N slogs, I will write N times faster than a single slog. And that's a simple answer. Yes. Thanks. :) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Bursty writes - why?
On Tue, Oct 12, 2010 at 12:09:44PM -0700, Eff Norwood wrote: The NFS client in this case was VMWare ESXi 4.1 release build. What happened is that the file uploader behavior was changed in 4.1 to prevent I/O contention with the VM guests. That means when you go to upload something to the datastore, it only sends chunks of the file instead of streaming it all at once like it did in ESXi 4.0. To end users, something appeared to be broken because file uploads now took 95 seconds instead of 30. Turns out that is by design in 4.1. This is the behavior *only* for the uploader and not for the VM guests. Their I/O is as expected. Interesting. I have to say as a side note, the DDRdrive X1s make a day and night difference with VMWare. If you use VMWare via NFS, I highly recommend the X1s as the ZIL. Otherwise the VMWare O_SYNC (Stable = FSYNC) will kill your performance dead. We also tried SSDs as the ZIL which worked ok until they got full, then performance tanked. As I have posted before, SSDs as your ZIL - don't do it! -- We run SSD's as ZIL here exclusively on what I'd consider fairly busy VMware datastores and have never encountered this. How would one know how full their SSD being used as ZIL is? I was under the impression that even using a full 32GB X-25E was overkill spacewise for typical ZIL functionality... Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Multiple SLOG devices per pool
I have a pool with a single SLOG device rated at Y iops. If I add a second (non-mirrored) SLOG device also rated at Y iops will my zpool now theoretically be able to handle 2Y iops? Or close to that? Thanks, Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS disk space monitoring with SNMP
Hey folks; Running on Solaris 10 U9 here. How do most of you monitor disk usage / capacity on your large zpools remotely via SNMP tools? Net SNMP seems to be using a 32-bit unsigned integer (based on the MIB) for hrStorageSize and friends, and thus we're not able to get accurate numbers for sizes 2TB. Looks like potentially later versions of Net-SNMP deal with this (though I'm not sure on that), but the version of Net-SNMP with Solaris 10 is of course, not bleeding edge. :) Thanks, Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS disk space monitoring with SNMP
On Fri, Oct 01, 2010 at 03:00:16PM -0700, Volker A. Brandt wrote: Hello Ray, hello list! Running on Solaris 10 U9 here. How do most of you monitor disk usage / capacity on your large zpools remotely via SNMP tools? Net SNMP seems to be using a 32-bit unsigned integer (based on the MIB) for hrStorageSize and friends, and thus we're not able to get accurate numbers for sizes 2TB. Looks like potentially later versions of Net-SNMP deal with this (though I'm not sure on that), but the version of Net-SNMP with Solaris 10 is of course, not bleeding edge. :) Sorry to be a lamer, but me too... Has anyone integrated an SNMP-based ZFS monitoring with their favorite management tool? I am looking for disk usage warnings, but I am also interested in OFFLINE messages, or nonzero values for READ/WRITE/CKSUM errors. Casual googling did not turn up anything that looked promising. There is an older ex-Sun download of an SNMP kit, but to be candid I haven't really looked at it yet. Note that I'm sure we could extend Net-SNMP and configure a custom OID to gather and present the information we're interested in. Totally willing to go that route and standardize on it here, but am curious if there's more of an out of the box solution -- even if I find out it's only available in later versions of Net-SNMP (at least I could file an RFE with Oracle for this). Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SCSI write retry errors on ZIL SSD drives...
Just wanted to post a quick follow-up to this. Original thread is here[1] -- not quoted for brevity. Andrew Gabriel suggested[2] that this could possibly be some workload triggered issue. We wanted to rule out a driver problem and so we tested various configurations under Solaris 10U9 and OpenSolaris with correct 4K block alignment. The Unit Attention errors under all operating environments for any X-25E (we haven't tested other brands) when used as ZIL and attached to one of the LSI port expanders used in Silicon Mechanics hardware. As soon as we move the drives to the onboard SATA controller or directly attach to the LSI controller (bypassing the expander) the issues go away. Perhaps tweaking the firmware on the port expander would have resolved the issue, but we're not able to test that scenario currently. Of note, heavy workload wasn't required to trigger the problem. We ran bonnie++ hard on the system -- which appeared to tax the ZIL quite a bit, but got no errors. However, as soon as we set up an NFS VMware datastore and loaded a couple VM's on it the Unit Attention errors began popping up -- even when they weren't particularly busy. In any case, we'll probably stop chasing our tails on this issue and will begin mounting all drives used for ZIL internally directly attached to the onboard SATA controllers. Thanks, Ray [1] http://mail.opensolaris.org/pipermail/zfs-discuss/2010-August/044362.html [2] http://mail.opensolaris.org/pipermail/zfs-discuss/2010-August/044364.html ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Best practice for Sol10U9 ZIL -- mirrored or not?
Best practice in Solaris 10 U8 and older was to use a mirrored ZIL. With the ability to remove slog devices in Solaris 10 U9, we're thinking we may get more bang for our buck to use two slog devices for improved IOPS performance instead of needing the redundancy so much. Any thoughts on this? If we lost our slog devices and had to reboot, would the system come up (eg could we remove failed slog devices from the zpool so the zpool would come online..) Thanks, Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedicated ZIL/L2ARC
On Tue, Sep 14, 2010 at 06:59:07AM -0700, Wolfraider wrote: We are looking into the possibility of adding a dedicated ZIL and/or L2ARC devices to our pool. We are looking into getting 4 – 32GB Intel X25-E SSD drives. Would this be a good solution to slow write speeds? We are currently sharing out different slices of the pool to windows servers using comstar and fibrechannel. We are currently getting around 300MB/sec performance with 70-100% disk busy. Opensolaris snv_134 Dual 3.2GHz quadcores with hyperthreading 16GB ram Pool_1 – 18 raidz2 groups with 5 drives a piece and 2 hot spares Disks are around 30% full No dedup It'll probably help. I'd get two X-25E's for ZIL (and mirror them) and one or two of Intel's lower end X-25M for L2ARC. There are some SSD devices out there with a super-capacitor and significantly higher IOPs ratings than the X-25E that might be a better choice for a ZIL device, but the X-25E is a solid drive and we have many of them deployed as ZIL devices here. Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS performance issue
On Wed, Sep 08, 2010 at 01:20:58PM -0700, Dr. Martin Mundschenk wrote: Hi! I searched the web for hours, trying to solve the NFS/ZFS low performance issue on my just setup OSOL box (snv134). The problem is discussed in many threads but I've found no solution. On a nfs shared volume, I get write performance of 3,5M/sec (!!) read performance is about 50M/sec which is ok but on a GBit network, more should be possible, since the servers disk performance reaches up to 120 M/sec. Does anyone have a solution how I can at least speed up the writes? What's the write workload like? You could try disabling the ZIL to see if that makes a difference. If it does, the addition of an SSD-based ZIL / slog device would most certainly help. Maybe you could describe the makeup of your zpool as well? Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 4k block alignment question (X-25E)
On Tue, Aug 31, 2010 at 12:47:49PM -0700, Brandon High wrote: On Mon, Aug 30, 2010 at 3:05 PM, Ray Van Dolson rvandol...@esri.com wrote: I want to fix (as much as is possible) a misalignment issue with an X-25E that I am using for both OS and as an slog device. It's pretty easy to get the alignment right fdisk uses a default of 63/255/*, which isn't easy to change. This makes each cylinder ( 63 * 255 * 512b ). You want ( $cylinder_offset ) * ( 63 * 255 * 512b ) / ( $block_alignment_size ) to be evenly divisible. For a 4k alignment you want the offset to be 8. With fdisk, create your SOLARIS2 partition that uses the entire disk. The partition will be from cylinder 1 to whatever. Cylinder 0 is used for the MBR, so it's automatically un-aligned. When you create slices in format, the MBR cylinder isn't visible, so you have to subtract 1 from the offset, so your first slice should start on cylinder 7. Each additional cylinder should start on a multiple of 8, minus 1. eg: 63, 1999, etc. It doesn't matter if the end of a slice is unaligned, other than to make aligning the next slice easier. -B Thanks Brandon. Just a follow-up to my original post... unfortunately I couldn't try aligning the slice on the SSD I was also using for slog/ZIL. The slog/ZIL slice was too small to be added to the ZIL mirror as the disk we'd thrown in the system bypassing the expander was being used completely (via EFI label). Still wanted to test, however, so I pulled one of the drives from my rpool, and added the entire disk to my mirror. This uses the EFI label and aligns everything correctly. Unit Attention errors immediately began showing up. I pulled that drive from the ZIL mirror and then used one of my two L2ARC drives (also X-25E's) in the same fashion. Same problem. So I believe the problem is still expander related moreso than alignment related. Too bad. Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 4k block alignment question (X-25E)
On Mon, Aug 30, 2010 at 10:11:32PM -0700, Christopher George wrote: I was wondering if anyone had a benchmarking showing this alignment mattered on the latest SSDs. My guess is no, but I have no data. I don't believe there can be any doubt whether a Flash based SSD (tier1 or not) is negatively affected by partition misalignment. It is intrinsic to the required asymmetric erase/program dual operation and the resultant RMW penalty to perform a write if unaligned. This is detailed in the following vendor benchmarking guidelines (SF-1500 controller): http://www.smartm.com/files/salesLiterature/storage/AN001_Benchmark_XceedIOPSSATA_Apr2010_.pdf Highlight from link - Proper partition alignment is one of the most critical attributes that can greatly boost the I/O performance of an SSD due to reduced read modify‐write operations. It should be noted, the above highlight only applies to Flash based SSD as an NVRAM based SSD does *not* suffer the same fate, as its performance is not bound by or vary with partition (mis)alignment. Here's an article with some benchmarks: http://wikis.sun.com/pages/viewpage.action?pageId=186241353 Seems to really impact IOPS. Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 4k block alignment question (X-25E)
On Mon, Aug 30, 2010 at 03:37:52PM -0700, Eric D. Mudama wrote: On Mon, Aug 30 at 15:05, Ray Van Dolson wrote: I want to fix (as much as is possible) a misalignment issue with an X-25E that I am using for both OS and as an slog device. This is on x86 hardware running Solaris 10U8. Partition table looks as follows: Part TagFlag CylindersSizeBlocks 0 rootwm 1 - 1306 10.00GB(1306/0/0) 20980890 1 unassignedwu 0 0 (0/0/0) 0 2 backupwm 0 - 3886 29.78GB(3887/0/0) 62444655 3 unassignedwu1307 - 3886 19.76GB(2580/0/0) 41447700 4 unassignedwu 0 0 (0/0/0) 0 5 unassignedwu 0 0 (0/0/0) 0 6 unassignedwu 0 0 (0/0/0) 0 7 unassignedwu 0 0 (0/0/0) 0 8 bootwu 0 -07.84MB(1/0/0) 16065 9 unassignedwu 0 0 (0/0/0) 0 And here is fdisk: Total disk size is 3890 cylinders Cylinder size is 16065 (512 byte) blocks Cylinders Partition StatusType Start End Length% = == = === == === 1 ActiveSolaris 1 38893889100 Slice 0 is where the OS lives and slice 3 is our slog. As you can see from the fdisk partition table (and from the slice view), the OS partition starts on cylinder 1 -- which is not 4k aligned. I don't think there is much I can do to fix this without reinstalling. However, I'm most concerned about the slog slice and would like to recreate its partition such that it begins on cylinder 1312. So a few questions: - Would making s3 be 4k block aligned help even though s0 is not? - Do I need to worry about 4k block aligning the *end* of the slice? eg instead of ending s3 on cylinder 3886, end it on 3880 instead? Thanks, Ray Do you specifically have benchmark data indicating unaligned or aligned+offset access on the X25-E is significantly worse than aligned access? I'd thought the tier1 SSDs didn't have problems with these workloads. I've been experiencing heavy Device Not Ready errors with this configuration, and thought perhaps it could be exacerbated by the block alignment issue. See this thread[1]. So this would be a troubleshooting step to attempt to further isolate the problem -- by eliminating the 4k alignment issue as a factor. Just want to make sure I set up the alignment as optimally as possible. Ray [1] http://markmail.org/message/5rmfzvqwlmosh2oh ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 4k block alignment question (X-25E)
On Mon, Aug 30, 2010 at 03:56:42PM -0700, Richard Elling wrote: comment below... On Aug 30, 2010, at 3:42 PM, Ray Van Dolson wrote: On Mon, Aug 30, 2010 at 03:37:52PM -0700, Eric D. Mudama wrote: On Mon, Aug 30 at 15:05, Ray Van Dolson wrote: I want to fix (as much as is possible) a misalignment issue with an X-25E that I am using for both OS and as an slog device. This is on x86 hardware running Solaris 10U8. Partition table looks as follows: Part TagFlag CylindersSizeBlocks 0 rootwm 1 - 1306 10.00GB(1306/0/0) 20980890 1 unassignedwu 0 0 (0/0/0) 0 2 backupwm 0 - 3886 29.78GB(3887/0/0) 62444655 3 unassignedwu1307 - 3886 19.76GB(2580/0/0) 41447700 4 unassignedwu 0 0 (0/0/0) 0 5 unassignedwu 0 0 (0/0/0) 0 6 unassignedwu 0 0 (0/0/0) 0 7 unassignedwu 0 0 (0/0/0) 0 8 bootwu 0 -07.84MB(1/0/0) 16065 9 unassignedwu 0 0 (0/0/0) 0 And here is fdisk: Total disk size is 3890 cylinders Cylinder size is 16065 (512 byte) blocks Cylinders Partition StatusType Start End Length% = == = === == === 1 ActiveSolaris 1 38893889100 Slice 0 is where the OS lives and slice 3 is our slog. As you can see from the fdisk partition table (and from the slice view), the OS partition starts on cylinder 1 -- which is not 4k aligned. To get to a fine alignment, you need an EFI label. However, Solaris does not (yet) support booting from EFI labeled disks. The older SMI labels are all cylinder aligned which gives you a 1/4 chance of alignment. Yep... our other boxes similar to this one are using whole disks as ZIL, so we're able to use EFI. The Device Not Ready errors happen there too (SSD's are on an expander) but only from between 5-15 errors per day (vs the 500 per hour on the split OS/slog setup). I don't think there is much I can do to fix this without reinstalling. However, I'm most concerned about the slog slice and would like to recreate its partition such that it begins on cylinder 1312. So a few questions: - Would making s3 be 4k block aligned help even though s0 is not? - Do I need to worry about 4k block aligning the *end* of the slice? eg instead of ending s3 on cylinder 3886, end it on 3880 instead? Thanks, Ray Do you specifically have benchmark data indicating unaligned or aligned+offset access on the X25-E is significantly worse than aligned access? I'd thought the tier1 SSDs didn't have problems with these workloads. I've been experiencing heavy Device Not Ready errors with this configuration, and thought perhaps it could be exacerbated by the block alignment issue. See this thread[1]. So this would be a troubleshooting step to attempt to further isolate the problem -- by eliminating the 4k alignment issue as a factor. In my experience, port expanders with SATA drives do not handle the high I/O rate that can be generated by a modest server. We are still trying to get to the bottom of these issues, but they do not appear to be related to the OS, mpt driver, ZIL use, or alignment. -- richard Very interesting. We've been looking at Nexenta as we haven't been able to reproduce our issues on OpenSolaris -- I was hoping this meant NexentaStor wouldn't have the issue. In any case -- any thoughts on whether or not I'll be helping anything if I change my slog slice starting cylinder to be 4k aligned even though slice 0 isn't? Just want to make sure I set up the alignment as optimally as possible. Ray [1] http://markmail.org/message/5rmfzvqwlmosh2oh Thanks, Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 4k block alignment question (X-25E)
On Mon, Aug 30, 2010 at 04:12:48PM -0700, Edho P Arief wrote: On Tue, Aug 31, 2010 at 6:03 AM, Ray Van Dolson rvandol...@esri.com wrote: In any case -- any thoughts on whether or not I'll be helping anything if I change my slog slice starting cylinder to be 4k aligned even though slice 0 isn't? some people claims that due to how zfs works, there will be performance hit as long the reported sector size is different with the physical size. This thread[1] has the discussion on what happened and how to handle such drives on freebsd. [1] http://marc.info/?l=freebsd-fsm=126976001214266w=2 Thanks for the pointer -- these posts seem to reference data disks within the pool rather than disks being used for slog. Perhaps some of the same issues could arise, but I'm not sure that variable stripe sizing in a RAIDZ pool would change how the ZIL / slog devices are addressed. I'm sure someone will correct me if I'm wrong on that... Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] VM's on ZFS - 7210
On Sat, Aug 28, 2010 at 05:50:38AM -0700, Eff Norwood wrote: I can't think of an easy way to measure pages that have not been consumed since it's really an SSD controller function which is obfuscated from the OS, and add the variable of over provisioning on top of that. If anyone would like to really get into what's going on inside of an SSD that makes it a bad choice for a ZIL, you can start here: http://en.wikipedia.org/wiki/TRIM_%28SSD_command%29 and http://en.wikipedia.org/wiki/Write_amplification Which will be more than you might have ever wanted to know. :) So has anyone on this list actually run into this issue? Tons of people use SSD-backed slog devices... The theory sounds sound, but if it's not really happening much in practice then I'm not too worried. Especially when I can replace a drive from my slog mirror for a $400 or so if problems do arise... (the alternative being much more expensive DRAM backed devices) Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] VM's on ZFS - 7210
On Fri, Aug 27, 2010 at 05:51:38AM -0700, David Magda wrote: On Fri, August 27, 2010 08:46, Eff Norwood wrote: Saso is correct - ESX/i always uses F_SYNC for all writes and that is for sure your performance killer. Do a snoop | grep sync and you'll see the sync write calls from VMWare. We use DDRdrives in our production VMWare storage and they are excellent for solving this problem. Our cluster supports 50,000 users and we've had no issues at all. Do not use an SSD for the ZIL - as soon as it fills up you will be very unhappy. What do you mean by fills up? There is very a very limited amount of data that is written to a slog device: between 5-30s second's worth. Furthermore a log device will at maximum be = 50% the size of physical memory. I would second this. Excellent results here with small 32GB Intel X-25E's. Even 32GB is overkill for ZIL Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] VM's on ZFS - 7210
On Fri, Aug 27, 2010 at 11:57:17AM -0700, Marion Hakanson wrote: markwo...@yahoo.com said: So the question is with a proper ZIL SSD from SUN, and a RAID10... would I be able to support all the VM's or would it still be pushing the limits a 44 disk pool? If it weren't a closed 7000-series appliance, I'd suggest running the zilstat script. It should make it clear whether (and by how much) you would benefit from the Logzilla addition in your current raidz configuration. Maybe there's some equivalent in the builtin FishWorks analytics which can give you the same information. To the OP... I'd think turning the write cache on would help if that's an option. Does the box have reliable power (UPS, etc)? Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] VM's on ZFS - 7210
On Fri, Aug 27, 2010 at 12:46:42PM -0700, Mark wrote: It does, its on a pair of large APC's. Right now we're using NFS for our ESX Servers. The only iSCSI LUN's I have are mounted inside a couple Windows VM's. I'd have to migrate all our VM's to iSCSI, which I'm willing to do if it would help and not cause other issues. So far the 7210 Appliance has been very stable. I like the zilstat script. I emailed a support tech I am working with on another issue to ask if one of the built in Analytics DTrace scripts will get that data. I found one called L2ARC Eligibility: 3235 true, 66 false. This makes it sound like we would benefit from a READZilla, not quite what I had expected... I'm sure I don't know what I'm looking at anyways :) Obviously depends on your workload, and YMMV, but for us (we're also using NFS and love the flexibility it provides w/ ESX) and without ZIL, things are pretty dog slow. My impression is that synchronous writes are used too with iSCSI, so if your problems stem from not having a ZIL w/ NFS they could very easily reappear even with iSCSI. Someone else may correct me on that... Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] VM's on ZFS - 7210
On Fri, Aug 27, 2010 at 01:22:15PM -0700, John wrote: Wouldn't it be possible to saturate the SSD ZIL with enough backlogged sync writes? What I mean is, doesn't the ZIL eventually need to make it to the pool, and if the pool as a whole (spinning disks) can't keep up with 30+ vm's of write requests, couldn't you fill up the ZIL that way? Depends on the workload of course, but we have 50+ VM server environments running off of 22x1TB SATA + 32GB Intel X25-E SSD's with no problems whatsoever. I don't have the zilstat numbers handy, but we're not pushing enough I/O for the slog device to even come close to sweating. Note that our VM's are in a LabManager environment and can spun up and down to do compiles mostly, not pushing huge amounts of non-random I/O. Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] VM's on ZFS - 7210
On Fri, Aug 27, 2010 at 03:51:39PM -0700, Eff Norwood wrote: By all means please try it to validate it yourself and post your results from hour one, day one and week one. In a ZIL use case, although the data set is small it is always writing a small ever changing (from the SSDs perspective) data set. The SSD does not know to release previously written pages and without TRIM there is no way to tell it to. That means every time a ZIL write happens, new SSD pages are consumed. After some amount of time, all of those empty pages will become consumed and the SSD will now have to go into the read-erase-write cycle which is incredibly slow and the whole point of TRIM. I can assure you from my extensive benchmarking with all major SSDs in the role of a ZIL you will eventually not be happy. Depending on your use case it might take months, but eventually all those free pages will be consumed and read-erase-write is how the SSD world works after that - unless you have TRIM, which we don't yet. -- This message posted from opensolaris.org Is there a way to measure how many SSD pages are taken up? We've had a box running for nearly 8 months now -- it's performing well, but I'd be interested to see if we'll be close to (theoretically) hitting this problem or not. Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SCSI write retry errors on ZIL SSD drives...
On Wed, Aug 25, 2010 at 11:47:38AM -0700, Andreas Grüninger wrote: Ray Supermicro does not support the use of SSDs behind an expander. You must put the SSD in the head or use an interposer card see here: http://www.lsi.com/storage_home/products_home/standard_product_ics/sas_sata_protocol_bridge/lsiss9252/index.html Supermicro offers an interposer card too: AOCSMPLSISS9252 . Hmm, interesting. FAQ #3 on this page[1] seems to indicate otherwise -- at least in the case of the Intel X25-E (SSDSA2SH064G1GC) with firmware 8860 (which we are running). Ray [1] http://www.supermicro.com/support/faqs/results.cfm?id=95 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] SCSI write retry errors on ZIL SSD drives...
I posted a thread on this once long ago[1] -- but we're still fighting with this problem and I wanted to throw it out here again. All of our hardware is from Silicon Mechanics (SuperMicro chassis and motherboards). Up until now, all of the hardware has had a single 24-disk expander / backplane -- but we recently got one of the new SC847-based models with 24 disks up front and 12 in the back -- a dual backplane setup. We're using two SSD's in the front backplane as mirrored ZIL/OS (I don't think we have the 4K alignment set up correctly) and two drives in the back as L2ARC. The rest of the disks are 1TB SATA disks which make up a single large zpool via three 8-disk RAIDZ2's. As you can see, we don't have the server maxed out on drives... In any case, this new server gets between 400 and 600 of these timeout errors an hour: Aug 21 03:10:17 dev-zfs1 scsi: [ID 365881 kern.info] /p...@0,0/pci8086,3...@8/pci15d9,1...@0 (mpt0): Aug 21 03:10:17 dev-zfs1Log info 31126000 received for target 8. Aug 21 03:10:17 dev-zfs1scsi_status=0, ioc_status=804b, scsi_state=c Aug 21 03:10:17 dev-zfs1 scsi: [ID 365881 kern.info] /p...@0,0/pci8086,3...@8/pci15d9,1...@0 (mpt0): Aug 21 03:10:17 dev-zfs1Log info 31126000 received for target 8. Aug 21 03:10:17 dev-zfs1scsi_status=0, ioc_status=804b, scsi_state=c Aug 21 03:10:17 dev-zfs1 scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,3...@8/pci15d9,1...@0/s...@8,0 (sd0): Aug 21 03:10:17 dev-zfs1Error for Command: write(10) Error Level: Retryable Aug 21 03:10:17 dev-zfs1 scsi: [ID 107833 kern.notice] Requested Block: 21230708 Error Block: 21230708 Aug 21 03:10:17 dev-zfs1 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: CVEM002600EW Aug 21 03:10:17 dev-zfs1 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention Aug 21 03:10:17 dev-zfs1 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0 Aug 21 03:10:21 dev-zfs1 scsi: [ID 365881 kern.info] /p...@0,0/pci8086,3...@8/pci15d9,1...@0 (mpt0): iostat -xnMCez shows that the first of the two ZIL drives receives about twice the number of errors as the second drive. There are no other errors on any other drives -- including the L2ARC SSD's and the ascv_t times seem reasonably low and don't indicate a bad drive to my eyes... The timeouts above exact a rather large performance penalty on the system, both in IO and general usage from an SSH console. Obvious pauses and glitches when accessing the filesystem. The problem _follows_ the ZIL and isn't tied to hardware. IOW, if I switch to using the L2ARC drives as ZIL, those drives suddenly exhibit the timeout problems... If we connect the SSD drives directly to the LSI controller instead of hanging off the hot-swap backplane, the timeouts go away. If we use SSD's attached to the SATA controllers as ZIL, there are also no performance issues or timeout errors. So the problem only occurs with SSD drives acting as ZIL attached to the backplane. This is leading me to believe we have a driver issue of some sort in the mpt subsystem unable to cope with the longer command path of multiple backplanes. Someone alluded to this in [1] as well, and it makes sense to me. One quick fix to me would seem to be upping the SCSI timeout values. How do you do this with the mpt driver? We haven't yet been able to try OpenSolaris or Nexenta on one of these systems to see if the problem goes away in later releases of the kernel or driver, but I'm curious if anyone out there has any bright ideas as to what we might be running into here and what's involved in fixing it. We've swapped out backplanes and drives and the problem happens on every single Silicon Mechanics system we have, so at this point I'm really doubting it's a hardware issue :) Hardware details are as follows: Silicon Mechanics Storform iServ R518 (Based on SuperMicro SC847E16-R1400 chassis) SuperMicro X8DT3 motherboard w/ onboard LSI1068 controller. - One LSI port goes to the front backplane (where the bulk of the SATA drives are, the two SSD's used as ZIL/OS) - The other LSI port goes to the rear backplane where the two L2ARC drives are along with a couple SATA's) We've got 6GB's of RAM and 2 quad core Xeons in the box as well. The SSD's themselves are all Intel X-25E's (32GB) with firmware 8860 and the LSI 1068 is a SAS1068E B3 with firmware 011c0200 (1.28.02.00). We're running Solaris 10U8 mostly up to date and MPT HBA Driver v1.92. Thoughts, theories and conjectures would be much appreciated... Sun these days wants us to be able to reproduce the problem on Sun hardware to get much support... Silicon Mechanics has been helpful, but they don't have a large enough inventory on hand to replicate our hardware setup it seems. :( Ray [1] http://markmail.org/message/gfz2cui2iua4dxpy ___
Re: [zfs-discuss] SCSI write retry errors on ZIL SSD drives...
On Tue, Aug 24, 2010 at 04:46:23PM -0700, Andrew Gabriel wrote: Ray Van Dolson wrote: I posted a thread on this once long ago[1] -- but we're still fighting with this problem and I wanted to throw it out here again. All of our hardware is from Silicon Mechanics (SuperMicro chassis and motherboards). Up until now, all of the hardware has had a single 24-disk expander / backplane -- but we recently got one of the new SC847-based models with 24 disks up front and 12 in the back -- a dual backplane setup. We're using two SSD's in the front backplane as mirrored ZIL/OS (I don't think we have the 4K alignment set up correctly) and two drives in the back as L2ARC. The rest of the disks are 1TB SATA disks which make up a single large zpool via three 8-disk RAIDZ2's. As you can see, we don't have the server maxed out on drives... In any case, this new server gets between 400 and 600 of these timeout errors an hour: Aug 21 03:10:17 dev-zfs1 scsi: [ID 365881 kern.info] /p...@0,0/pci8086,3...@8/pci15d9,1...@0 (mpt0): Aug 21 03:10:17 dev-zfs1Log info 31126000 received for target 8. Aug 21 03:10:17 dev-zfs1scsi_status=0, ioc_status=804b, scsi_state=c Aug 21 03:10:17 dev-zfs1 scsi: [ID 365881 kern.info] /p...@0,0/pci8086,3...@8/pci15d9,1...@0 (mpt0): Aug 21 03:10:17 dev-zfs1Log info 31126000 received for target 8. Aug 21 03:10:17 dev-zfs1scsi_status=0, ioc_status=804b, scsi_state=c Aug 21 03:10:17 dev-zfs1 scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,3...@8/pci15d9,1...@0/s...@8,0 (sd0): Aug 21 03:10:17 dev-zfs1Error for Command: write(10) Error Level: Retryable Aug 21 03:10:17 dev-zfs1 scsi: [ID 107833 kern.notice] Requested Block: 21230708 Error Block: 21230708 Aug 21 03:10:17 dev-zfs1 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: CVEM002600EW Aug 21 03:10:17 dev-zfs1 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention Aug 21 03:10:17 dev-zfs1 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0 Aug 21 03:10:21 dev-zfs1 scsi: [ID 365881 kern.info] /p...@0,0/pci8086,3...@8/pci15d9,1...@0 (mpt0): iostat -xnMCez shows that the first of the two ZIL drives receives about twice the number of errors as the second drive. There are no other errors on any other drives -- including the L2ARC SSD's and the ascv_t times seem reasonably low and don't indicate a bad drive to my eyes... The timeouts above exact a rather large performance penalty on the system, both in IO and general usage from an SSH console. Obvious pauses and glitches when accessing the filesystem. This isn't a timeout. Unit Attention is the drive saying back to the computer that it's been reset and has forgotten any negotiation which happened with the controller. It's a couple of decades since I was working on SCSI at this level, but IIRC, a drive will return Unit Attention error to the first command issued to it after a reset/powerup, except for a Test Unit Ready command. As it says, this might be caused by power on, reset, or bus reset occurred. Interesting. Thanks for the insight. The problem _follows_ the ZIL and isn't tied to hardware. IOW, if I switch to using the L2ARC drives as ZIL, those drives suddenly exhibit the timeout problems... A possibility is that the problem is related to the nature of the load a ZIL drive attracts. One scenario could be that you are crashing the drive firmware, causing it it reset and reinitialize itself, and therefore to return Unit Attention to the next command. (I don't know if X25-E's can behave this way.) I would try and correct the 4k alignment on the ZIL at least - that does significantly affect the work the drive has to do internally (as well as its performance), although I've no idea if that's related to the issue you're seeing. Will definitely give this a go -- certainly can't hurt. If we connect the SSD drives directly to the LSI controller instead of hanging off the hot-swap backplane, the timeouts go away. Again, may be related to some combination of the load type and physical characteristics. If we use SSD's attached to the SATA controllers as ZIL, there are also no performance issues or timeout errors. Why not do this then? It also avoids using SATA tunneling protocol across the SAS and port expanders. We may -- however, the main reason we'd gone with the port expander was for convenient hot swappability. Though I guess SATA is technically hot swappable, it's not as convenient :) So the problem only occurs with SSD drives acting as ZIL attached to the backplane. This is leading me to believe we have a driver issue of some sort in the mpt subsystem unable to cope with the longer command path of multiple
Re: [zfs-discuss] Opensolaris is apparently dead
On Mon, Aug 16, 2010 at 08:35:05AM -0700, Tim Cook wrote: No, no they don't. You're under the misconception that they no longer own the code just because they released a copy as GPL. That is not true. Anyone ELSE who uses the GPL code must release modifications if they wish to distribute it due to the GPL. The original author is free to license the code as many times under as many conditions as they like, and release or not release subsequent changes they make to their own code. I absolutely guarantee Oracle can and likely already has dual-licensed BTRFS. Well, Oracle obviously would want btrfs to stay as part of the Linux kernel rather than die a death of anonymity outside of it... As such, they'll need to continue to comply with GPLv2 requirements. Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Opensolaris is apparently dead
On Mon, Aug 16, 2010 at 08:48:31AM -0700, Joerg Schilling wrote: Ray Van Dolson rvandol...@esri.com wrote: I absolutely guarantee Oracle can and likely already has dual-licensed BTRFS. Well, Oracle obviously would want btrfs to stay as part of the Linux kernel rather than die a death of anonymity outside of it... As such, they'll need to continue to comply with GPLv2 requirements. No, there is definitely no need for Oracle to comply with the GPL as they own the code. Maybe there's not legally, but practically there is. If they're not GPL compliant, why would Linus or his lieutenants continue to allow the code to remain part of the Linux kernel? And what purpose would btrfs serve Oracle outside of the Linux kernel? Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Opensolaris is apparently dead
On Mon, Aug 16, 2010 at 08:55:49AM -0700, Tim Cook wrote: Why would they obviously want that? When the project started, they were competing with Sun. They now own Solaris; they no longer have a need to produce a competing product. I would be EXTREMELY surprised to see Oracle continue to push Linux as hard as they have in the past, over the next 5 years. --Tim Well, we're getting into the realm of opinion here.. but if I'm a decision maker at Oracle, I'm not abandoning Linux, nor my potential influence in the future de facto Linux filesystem. Oracle can gear Solaris towards big iron / Enterprisey, niche solutions, but I'd bet a lot that they're not abandoning the Linux space by a longshot just because they own Solaris... But your opinion is as valid as mine on this topic... :) Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Opensolaris is apparently dead
On Mon, Aug 16, 2010 at 08:58:20AM -0700, Garrett D'Amore wrote: On Mon, 2010-08-16 at 08:52 -0700, Ray Van Dolson wrote: On Mon, Aug 16, 2010 at 08:48:31AM -0700, Joerg Schilling wrote: Ray Van Dolson rvandol...@esri.com wrote: I absolutely guarantee Oracle can and likely already has dual-licensed BTRFS. Well, Oracle obviously would want btrfs to stay as part of the Linux kernel rather than die a death of anonymity outside of it... As such, they'll need to continue to comply with GPLv2 requirements. No, there is definitely no need for Oracle to comply with the GPL as they own the code. Maybe there's not legally, but practically there is. If they're not GPL compliant, why would Linus or his lieutenants continue to allow the code to remain part of the Linux kernel? And what purpose would btrfs serve Oracle outside of the Linux kernel? If they wanted to port it to Solaris under a difference license, they could. This may actually be a backup plan in case the NetApp suit goes badly. But this is pure conjecture. btrfs is often described as the next default Linux filesystem (by Ted T'So and others). It seems odd to me that Oracle wouldn't have an interest in retaining a controlling interest (as in retaining the primary engineers) in its development and ensuring it stays in the Linux kernel and meets these expectations... Seems like an excellent long-term strategy to me anyways! Anyways, getting a bit off topic here I suppose, though it's an interesting discussion. :) - Garrett Ray Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Opensolaris is apparently dead
On Mon, Aug 16, 2010 at 08:57:19AM -0700, Joerg Schilling wrote: C. Bergström codest...@osunix.org wrote: I absolutely guarantee Oracle can and likely already has dual-licensed BTRFS. No.. talk to Chris Mason.. it depends on the linux kernel too much already to be available under anything, but GPLv2 If he really believes this, then he seems to be missinformed about legal background. The question is: who wrote the btrfs code and who owns it. If Oracle pays him for writing the code, then Oracle owns the code and can relicense it under any license they like. Jörg I don't think anyone is arguing that Oracle can relicense their own copyrighted code as they see fit. The real question is, WHY would they do it? What would be the business motivation here? Chris Mason would most likely leave Oracle, Red Hat would hire him and fork the last GPL'd version of btrfs and Oracle would have relegated itself to a non-player in the Linux filesystem space... So, yes, they can do it if they want, I just think they're not THAT stupid. :) Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Opensolaris is apparently dead
On Mon, Aug 16, 2010 at 09:08:52AM -0700, Ray Van Dolson wrote: On Mon, Aug 16, 2010 at 08:57:19AM -0700, Joerg Schilling wrote: C. Bergström codest...@osunix.org wrote: I absolutely guarantee Oracle can and likely already has dual-licensed BTRFS. No.. talk to Chris Mason.. it depends on the linux kernel too much already to be available under anything, but GPLv2 If he really believes this, then he seems to be missinformed about legal background. The question is: who wrote the btrfs code and who owns it. If Oracle pays him for writing the code, then Oracle owns the code and can relicense it under any license they like. Jörg I don't think anyone is arguing that Oracle can relicense their own copyrighted code as they see fit. s/can/can't/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Opensolaris is apparently dead
On Mon, Aug 16, 2010 at 09:15:12AM -0700, Tim Cook wrote: Or, for all you know, Chris Mason's contract has a non-compete that states if he leaves Oracle he's not allowed to work on any project he was a part of for five years. The business motivation would be to set the competition back a decade. Could be, though I still feel like there are plenty of great filesystem people in the Linux kernel community who could pick things up just fine .. Anyways, way off topic now -- we've both made our points I think. :) Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS development moving behind closed doors
On Fri, Aug 13, 2010 at 02:01:07PM -0700, C. Bergström wrote: Gary Mills wrote: If this information is correct, http://opensolaris.org/jive/thread.jspa?threadID=133043 further development of ZFS will take place behind closed doors. Opensolaris will become the internal development version of Solaris with no public distributions. The community has been abandoned. It was a community of system administrators and nearly no developers. While this may make big news the real impact is probably pretty small. Source code updates will get tossed over the fence and developer partners (Intel) will still have access to onnv-gate. I'm interested to see how this plays out in actuality. It almost sounded like source code wouldn't necessarily be shared until major release were made... which would obviously make it hard for third party ZFS vendors to keep up in the interim. I guess most of this is still hear-say at this point, but if you've read somewhere where Oracle has stated they plan to continuously share source code and updates throughout their development processes (not just at release time), it'd be good to see... In a way i see this as a very good thing. It will not *force* the existing (small) community of companies and developers to band together to actually work together. From there the real open source momentum can happen instead of everyone depending on Sun/Oracle to give them a free lunch. The first step that I've been adamant about is making it easier for developers to play and get their hands on it.. If we can enable that it'll swing things around regardless of what mega-corp does or doesn't do... Just my 0.02$ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Adding ZIL to pool questions
On Sun, Aug 01, 2010 at 12:36:28PM -0700, Gregory Gee wrote: Jim, that ACARD looks really nice, but out of the price range for a home server. Edward, disabling ZIL might be ok, but let me characterize what my home server does and tell me if disabling ZIL is ok. My home OpenSolaris server is only used for storage. I have a separate linux box that runs any software I need such as media servers and such. I export all pools from the OpenSolaris box to the linux box via NFS. The OpenSolaris box has 2 pools. The first pool stores videos, pictures, various files and mail all exported via NFS to the linux box. It is a mirrored zpool. The second mirrored zpool is NFS store for VM images. The linux box I mentioned are actually VMs running in XenServer. The VM vdisks are stored and run from the OpenSolaris NFS server mounted in the XenServer box. Yes, I know that this is not a typical home setup, but I'm sure that most here don't have a 'typical home setup'. So the question is, will disabling ZIL have negative impacts on the VM vdisks stored in NFS? Or any other files on the NFS shares? You would probably see better performance at the expense of reliability in the case of an unplanned outage. Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Using a zvol from your rpool as zil for another zpool
We have a server with a couple X-25E's and a bunch of larger SATA disks. To save space, we want to install Solaris 10 (our install is only about 1.4GB) to the X-25E's and use the remaining space on the SSD's for ZIL attached to a zpool created from the SATA drives. Currently we do this by installing the OS using SVM+UFS (to mirror the OS between the two SSD's) and then using the remaining space on a slice as ZIL for the larger SATA-based zpool. However, SVM+UFS is more annoying to work with as far as LiveUpgrade is concerned. We'd love to use a ZFS root, but that requires that the entire SSD be dedicated as an rpool leaving no space for ZIL. Or does it? It appears that we could do a: # zfs create -V 24G rpool/zil On our rpool and then: # zpool add satapool log /dev/zvol/dsk/rpool/zil (I realize 24G is probably far more than a ZIL device will ever need) As rpool is mirrored, this would also take care of redundancy for the ZIL as well. This lets us have a nifty ZFS rpool for simplified LiveUpgrades and a fast SSD-based ZIL for our SATA zpool as well... What are the downsides to doing this? Will there be a noticeable performance hit? I know I've seen this discussed here before, but wasn't able to come up with the right search terms... Thanks, Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Using a zvol from your rpool as zil for another zpool
However, SVM+UFS is more annoying to work with as far as LiveUpgrade is concerned. We'd love to use a ZFS root, but that requires that the entire SSD be dedicated as an rpool leaving no space for ZIL. Or does it? It appears that we could do a: # zfs create -V 24G rpool/zil On our rpool and then: # zpool add satapool log /dev/zvol/dsk/rpool/zil (I realize 24G is probably far more than a ZIL device will ever need) As rpool is mirrored, this would also take care of redundancy for the ZIL as well. This lets us have a nifty ZFS rpool for simplified LiveUpgrades and a fast SSD-based ZIL for our SATA zpool as well... What are the downsides to doing this? Will there be a noticeable performance hit? I know I've seen this discussed here before, but wasn't able to come up with the right search terms... Well, after doing a little better on my searches, it sounds like -- at least for cache/L2ARC on zvol's, some race conditions can pop up and this isn't necessarily the most robust or tested configuration. Doesn't sound like something I'd want to do in production. Perhaps the better option is to have multiple Solaris FDISK partitions set up. This way I could still install my rpool to the first partition and use the remaining partition for my ZIL for the SATA zpool. This obviously would only work on x86 systems. Would multiple FDISK partitions be the most robust way to implement this? Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Using a zvol from your rpool as zil for another zpool
On Fri, Jul 02, 2010 at 03:40:26AM -0700, Ben Taylor wrote: We have a server with a couple X-25E's and a bunch of larger SATA disks. To save space, we want to install Solaris 10 (our install is only about 1.4GB) to the X-25E's and use the remaining space on the SSD's for ZIL attached to a zpool created from the SATA drives. Currently we do this by installing the OS using SVM+UFS (to mirror the OS between the two SSD's) and then using the remaining space on a slice as ZIL for the larger SATA-based zpool. However, SVM+UFS is more annoying to work with as far as LiveUpgrade is concerned. We'd love to use a ZFS root, but that requires that the entire SSD be dedicated as an rpool leaving no space for ZIL. Or does it? For every system I have ever done zfs root on, it's always been a slice on a disk. As an example, we have an x4500 with 1TB disks. For that root config, we are planning on something like 150G on s0, and the rest on S3. s0 for the rpool, and s3 for the qpool. We didn't want to have to deal with issues around flashing a huge volume, as we found out with our other x4500 with 500GB disks. AFAIK, it's only non-rpool disks that use the whole disk, and I doubt there's some sort of specific feature with an SSD, but I could be wrong. I like your idea of a reasonably sized root rpool and the rest used for the ZIL. But if you're going to do LU, you should probably take a good look at how much space you need for the clones and snapshots on the rpool Interesting. For some reason, I coulda sworn that Sol 10 U8 installer required you to use an entire disk for a ZFS rpool, so using only part of the disk on a slice and leaving space for other uses wasn't an option. I'll revisit this though. Thanks for the reply. Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Using a zvol from your rpool as zil for another zpool
On Fri, Jul 02, 2010 at 08:18:48AM -0700, Erik Ableson wrote: Le 2 juil. 2010 à 16:30, Ray Van Dolson rvandol...@esri.com a écrit : On Fri, Jul 02, 2010 at 03:40:26AM -0700, Ben Taylor wrote: We have a server with a couple X-25E's and a bunch of larger SATA disks. To save space, we want to install Solaris 10 (our install is only about 1.4GB) to the X-25E's and use the remaining space on the SSD's for ZIL attached to a zpool created from the SATA drives. Currently we do this by installing the OS using SVM+UFS (to mirror the OS between the two SSD's) and then using the remaining space on a slice as ZIL for the larger SATA-based zpool. However, SVM+UFS is more annoying to work with as far as LiveUpgrade is concerned. We'd love to use a ZFS root, but that requires that the entire SSD be dedicated as an rpool leaving no space for ZIL. Or does it? For every system I have ever done zfs root on, it's always been a slice on a disk. As an example, we have an x4500 with 1TB disks. For that root config, we are planning on something like 150G on s0, and the rest on S3. s0 for the rpool, and s3 for the qpool. We didn't want to have to deal with issues around flashing a huge volume, as we found out with our other x4500 with 500GB disks. AFAIK, it's only non-rpool disks that use the whole disk, and I doubt there's some sort of specific feature with an SSD, but I could be wrong. I like your idea of a reasonably sized root rpool and the rest used for the ZIL. But if you're going to do LU, you should probably take a good look at how much space you need for the clones and snapshots on the rpool Interesting. For some reason, I coulda sworn that Sol 10 U8 installer required you to use an entire disk for a ZFS rpool, so using only part of the disk on a slice and leaving space for other uses wasn't an option. I'll revisit this though. It certainly works under OpenSolaris, but you might want to look into manually partitioning the drive to ensure that it's properly aligned on the 4k boundaries. Last time I did that, it showed me a tiny space before the manually created partition. Cheers, Erik Well, everything worked fine. ZFS rpool on s0 and ZIL for another pool on s3. Unfortunately, I didn't end up doing the 4K block alignment. Doesn't look like the fdisk keyword in JumpStart lets you specify this sort of thing, but I probably could have pre-partitioned the disk from the shell before running my JumpStart. Lessons learned. Thanks all, Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What happens when unmirrored ZIL log device is removed ungracefully
On Wed, Jun 30, 2010 at 09:47:15AM -0700, Edward Ned Harvey wrote: From: Arne Jansen [mailto:sensi...@gmx.net] Edward Ned Harvey wrote: Due to recent experiences, and discussion on this list, my colleague and I performed some tests: Using solaris 10, fully upgraded. (zpool 15 is latest, which does not have log device removal that was introduced in zpool 19) In any way possible, you lose an unmirrored log device, and the OS will crash, and the whole zpool is permanently gone, even after reboots. I'm a bit confused. I tried hard, but haven't been able to reproduce this using Sol10U8. I have a mirrored slog device. While putting it under load doing synchronous file creations, we pulled the power cords and unplugged the slog devices. After powering on zfs imported the pool, but prompted to acknowledge the missing slog devices with zpool clear. After that the pool was accessible again. That's exactly how it should be. Very interesting. I did this test some months ago, so I may not recall the relevant details, but here are the details I do remember: I don't recall if I did this test on osol2009.06, or sol10. In Sol10u6 (and I think Sol10u8) the default zpool version is 10, but if you apply all your patches, then 15 becomes available. I am sure that I've never upgraded any of my sol10 zpools higher than 10. So it could be that an older zpool version might exhibit the problem, and you might be using a newer version. In osol2009.06, IIRC, the default is zpool 14, and if you upgrade fully, you'll get to something around 24. So again, it's possible the bad behavior went away in zpool 15, or any other number from 11 to 15. I'll leave it there for now. If that doesn't shed any light, I'll try to dust out some more of my mental cobwebs. Anyone else done any testing with zpool version 15 (on Solaris 10 U8)? Have a new system coming in shortly and will test myself, but knowing this is a recoverable scenario would help me rest easier as I have an unmirrored slog setup hanging around still. Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OCZ Devena line of enterprise SSD
On Thu, Jun 17, 2010 at 09:42:44AM -0700, F. Wessels wrote: I just lookup it up again and as far as i can see the super cap is present in the MLC version as well as the SLC Very nice. A pair of the 50GB SLC model would be great for ZIL. Might continue to stick with the X-25M for L2ARC though based on price. Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Deduplication and ISO files
On Fri, Jun 04, 2010 at 01:10:44PM -0700, Ray Van Dolson wrote: On Fri, Jun 04, 2010 at 01:03:32PM -0700, Brandon High wrote: On Fri, Jun 4, 2010 at 12:37 PM, Ray Van Dolson rvandol...@esri.com wrote: Makes sense. So, as someone else suggested, decreasing my block size may improve the deduplication ratio. It might. It might make your performance tank, too. Decreasing the block size increases the size of the dedup table (DDT). Every entry in the DDT uses somewhere around 250-270 bytes. If the DDT gets too large to fit in memory, it will have to be read from disk, which will destroy any sort of write performance (although a L2ARC on SSD can help) If you move to 64k blocks, you'll double the DDT size and may not actually increase your ratio. Moving to 8k blocks will increase your DDT by a factor of 16, and still may not help. Changing the recordsize will not affect files that are already in the dataset. You'll have to recopy them to re-write with the smaller block size. -B Gotcha. Just trying to make sure I understand how all this works, and if I _would_ in fact see an improvement in dedupe-ratio by tweaking the recordsize with our data-set. Once we know that we can decide if it's worth the extra costs in RAM/L2ARC. Thanks all. FYI; With 4K recordsize, I am seeing 1.26x dedupe ratio between the RHEL 5.4 ISO and the RHEL 5.5 ISO file. However, it took about 33 minutes to copy the 2.9GB ISO file onto the filesystem. :) Definitely would need more RAM in this setup... Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Deduplication and ISO files
I'm running zpool version 23 (via ZFS fuse on Linux) and have a zpool with deduplication turned on. I am testing how well deduplication will work for the storage of many, similar ISO files and so far am seeing unexpected results (or perhaps my expectations are wrong). The ISO's I'm testing with are the 32-bit and 64-bit versions of the RHEL5 DVD ISO's. While both have their differences, they do contain a lot of similar data as well. If I explode both ISO files and copy them to my ZFS filesystem I see about a 1.24x dedup ratio. However, if I have only the ISO files on the ZFS filesystem, the ratio is 1.00x -- no savings at all. Does this make sense? I'm going to experiment with other combinations of ISO files as well... Thanks, Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Deduplication and ISO files
On Fri, Jun 04, 2010 at 11:16:40AM -0700, Brandon High wrote: On Fri, Jun 4, 2010 at 9:30 AM, Ray Van Dolson rvandol...@esri.com wrote: The ISO's I'm testing with are the 32-bit and 64-bit versions of the RHEL5 DVD ISO's. While both have their differences, they do contain a lot of similar data as well. Similar != identical. Dedup works on blocks in zfs, so unless the iso files have identical data aligned at 128k boundaries you won't see any savings. If I explode both ISO files and copy them to my ZFS filesystem I see about a 1.24x dedup ratio. Each file starts a new block, so the identical files can be deduped. -B Makes sense. So, as someone else suggested, decreasing my block size may improve the deduplication ratio. recordsize I presume is the value to tweak? Thanks, Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Deduplication and ISO files
On Fri, Jun 04, 2010 at 01:03:32PM -0700, Brandon High wrote: On Fri, Jun 4, 2010 at 12:37 PM, Ray Van Dolson rvandol...@esri.com wrote: Makes sense. So, as someone else suggested, decreasing my block size may improve the deduplication ratio. It might. It might make your performance tank, too. Decreasing the block size increases the size of the dedup table (DDT). Every entry in the DDT uses somewhere around 250-270 bytes. If the DDT gets too large to fit in memory, it will have to be read from disk, which will destroy any sort of write performance (although a L2ARC on SSD can help) If you move to 64k blocks, you'll double the DDT size and may not actually increase your ratio. Moving to 8k blocks will increase your DDT by a factor of 16, and still may not help. Changing the recordsize will not affect files that are already in the dataset. You'll have to recopy them to re-write with the smaller block size. -B Gotcha. Just trying to make sure I understand how all this works, and if I _would_ in fact see an improvement in dedupe-ratio by tweaking the recordsize with our data-set. Once we know that we can decide if it's worth the extra costs in RAM/L2ARC. Thanks all. Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New SSD options
This thread has grown giant, so apologies for screwing up threading with an out of place reply. :) So, as far as SF-1500 based SSD's, the only ones currently in existence are the Vertex 2 LE and Vertex 2 EX, correct (I understand the Vertex 2 Pro was never mass produced)? Both of these are based on MLC and not SLC -- why isn't that an issue for longevity? Any other SF-1500 options out there? We continue to use UPS-backed Intel X-25E's for ZIL. Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New SSD options
On Mon, May 24, 2010 at 11:30:20AM -0700, Ray Van Dolson wrote: This thread has grown giant, so apologies for screwing up threading with an out of place reply. :) So, as far as SF-1500 based SSD's, the only ones currently in existence are the Vertex 2 LE and Vertex 2 EX, correct (I understand the Vertex 2 Pro was never mass produced)? Both of these are based on MLC and not SLC -- why isn't that an issue for longevity? Any other SF-1500 options out there? We continue to use UPS-backed Intel X-25E's for ZIL. From earlier in the thread, it sounds like none of the SF-1500 based drives even have a supercap, so it doesn't seem that they'd necessarily be a better choice than the SLC-based X-25E at this point unless you need more write IOPS... Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best practice for full stystem backup - equivelent of ufsdump/ufsrestore
On Wed, May 05, 2010 at 04:31:08PM -0700, Bob Friesenhahn wrote: On Thu, 6 May 2010, Ian Collins wrote: Bob and Ian are right. I was trying to remember the last time I installed Solaris 10, and the best I can recall, it was around late fall 2007. The fine folks at Oracle have been making improvements to the product since then, even though no new significant features have been added since that time :-( ZFS boot? I think that Richard is referring to the fact that the PowerPC/Cell Solaris 10 port for the Sony Playstation III never emerged. ;-) Other than desktop features, as a Solaris 10 user I have seen OpenSolaris kernel features continually percolate down to Solaris 10 so I don't feel as left out as Richard would like me to feel. From a zfs standpoint, Solaris 10 does not seem to be behind the currently supported OpenSolaris release. Bob Well, being able to remove ZIL devices is one important feature missing. Hopefully in U9. :) Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best practice for full stystem backup - equivelent of ufsdump/ufsrestore
On Wed, May 05, 2010 at 05:09:40PM -0700, Erik Trimble wrote: On Wed, 2010-05-05 at 19:03 -0500, Bob Friesenhahn wrote: On Wed, 5 May 2010, Ray Van Dolson wrote: From a zfs standpoint, Solaris 10 does not seem to be behind the currently supported OpenSolaris release. Well, being able to remove ZIL devices is one important feature missing. Hopefully in U9. :) While the development versions of OpenSolaris are clearly well beyond Solaris 10, I don't believe that the supported version of OpenSolaris (a year old already) has this feature yet either and Solaris 10 has been released several times since then already. When the forthcoming OpenSolaris release emerges in 2011, the situation will be far different. Solaris 10 can then play catch-up with the release of U9 in 2012. Bob Pessimist. ;-) s/2011/2010/ s/2012/2011/ Yeah, U9 in 2012 makes me very sad. I would really love to see the hot-removable ZIL's this year. Otherwise I'll need to rebuild a few zpools :) Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS monitoring - best practices?
We're starting to grow our ZFS environment and really need to start standardizing our monitoring procedures. OS tools are great for spot troubleshooting and sar can be used for some trending, but we'd really like to tie this into an SNMP based system that can generate graphs for us (via RRD or other). Whether or not we do this via our standard enterprise monitoring tool or write some custom scripts I don't really care... but I do have the following questions: - What metrics are you guys tracking? I'm thinking: - IOPS - ZIL statistics - L2ARC hit ratio - Throughput - IO Wait (I know there's probably a better term here) - How do you gather this information? Some but not all is available via SNMP. Has anyone written a ZFS specific MIB or plugin to make the info available via the standard Solaris SNMP daemon? What information is available only via zdb/mdb? - Anyone have any RRD-based setups for monitoring their ZFS environments they'd be willing to share or talk about? Thanks in advance, Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss