Re: [zfs-discuss] ZFS Distro Advice
We do the same for all of our legacy operating system backups. Take a snapshot then do an rsync and an excellent way of maintaining incremental backups for those. Magic rsync options used: -a --inplace --no-whole-file --delete-excluded This causes rsync to overwrite the file blocks in place rather than writing to a new temporary file first. As a result, zfs COW produces primitive deduplication of at least the unchanged blocks (by writing nothing) while writing new COW blocks for the changed blocks. If I understand your use case correctly (the application overwrites some blocks with the same exact contents), ZFS will ignore these no- I think he meant to rely on rsync here to do in-place updates of files and only for changed blocks with the above parameters (by using rsync's own delta mechanism). So if you have a file a and only one block changed rsync will overwrite on destination only that single block. op writes only on recent Open ZFS (illumos / FreeBSD / Linux) builds with checksum=sha256 and compression!=off. AFAIK, Solaris ZFS will COW the blocks even if their content is identical to what's already there, causing the snapshots to diverge. See https://www.illumos.org/issues/3236 for details. This is interesting. I didn't know about it. Is there an option similar to verify=on in dedup or does it just assume that checksum is your data? -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Distro Advice
Solaris 11.1 (free for non-prod use). From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Tiernan OToole Sent: 25 February 2013 14:58 To: zfs-discuss@opensolaris.org Subject: [zfs-discuss] ZFS Distro Advice Good morning all. My home NAS died over the weekend, and it leaves me with a lot of spare drives (5 2Tb and 3 1Tb disks). I have a Dell Poweredge 2900 Server sitting in the house, which has not been doing much over the last while (bought it a few years back with the intent of using it as a storage box, since it has 8 Hot Swap drive bays) and i am now looking at building the NAS using ZFS... But, now i am confused as to what OS to use... OpenIndiana? Nexenta? FreeNAS/FreeBSD? I need something that will allow me to share files over SMB (3 if possible), NFS, AFP (for Time Machine) and iSCSI. Ideally, i would like something i can manage easily and something that works with the Dell... Any recommendations? Any comparisons to each? Thanks. -- Tiernan O'Toole blog.lotas-smartman.net www.geekphotographer.com www.tiernanotoole.ie ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Distro Advice
Robert Milkowski wrote: Solaris 11.1 (free for non-prod use). But a ticking bomb if you use a cache device. It's been fixed in SRU (although this is only for customers with a support contract - still, will be in 11.2 as well). Then, I'm sure there are other bugs which are fixed in S11 and not in Illumos (and vice-versa). -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE: Un-dedup for unique blocks
It also has a lot of performance improvements and general bug fixes in the Solaris 11.1 release. Performance improvements such as? Dedup'ed ARC for one. 0 block automatically dedup'ed in-memory. Improvements to ZIL performance. Zero-copy zfs+nfs+iscsi ... -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE: Un-dedup for unique blocks
From: Richard Elling Sent: 21 January 2013 03:51 VAAI has 4 features, 3 of which have been in illumos for a long time. The remaining feature (SCSI UNMAP) was done by Nexenta and exists in their NexentaStor product, but the CEO made a conscious (and unpopular) decision to keep that code from the community. Over the summer, another developer picked up the work in the community, but I've lost track of the progress and haven't seen an RTI yet. That is one thing that always bothered me... so it is ok for others, like Nexenta, to keep stuff closed and not in open, while if Oracle does it they are bad? Isn't it at least a little bit being hypocritical? (bashing Oracle and doing sort of the same) -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] poor CIFS and NFS performance
Personally, I'd recommend putting a standard Solaris fdisk partition on the drive and creating the two slices under that. Why? In most cases giving zfs an entire disk is the best option. I wouldn't bother with any manual partitioning. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris 11 System Reboots Continuously Because of a ZFS-Related Panic (7191375)
Illumos is not so good at dealing with huge memory systems but perhaps it is also more stable as well. Well, I guess that it depends on your environment, but generally I would expect S11 to be more stable if only because the sheer amount of bugs reported by paid customers and bug fixes by Oracle that Illumos is not getting (lack of resource, limited usage, etc.). -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Appliance as a general-purpose server question
I am in the market for something newer than that, though. Anyone know what HP's using as a replacement for the DL320s? I have no idea... but they have dl380 Gen8 with a disk plane supporting 25x 2.5 disks (all in front), and it is Sandy Bridge based. Oracle/Sun have X3-2L - 24x 2.5 disks in front, another 2x 2.5 in rear, Sandy Bridge as well. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Appliance as a general-purpose server question
So, the only supported (or even possible) way is indeed to us it as NAS for file or block IO from another head running the database or application servers?.. Technically speaking you can get access to standard shell and do whatever you want - this would essentially void support contract though. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zvol access rights - chown zvol on reboot / startup / boot
No, there isn't other way to do it currently. SMF approach is probably the best option for the time being. I think that there should be couple of other properties for zvol where permissions could be stated. Best regards, Robert Milkowski http://milek.blogspot.com From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) Sent: 15 November 2012 19:57 To: zfs-discuss@opensolaris.org Subject: [zfs-discuss] zvol access rights - chown zvol on reboot / startup / boot When I google around for anyone else who cares and may have already solved the problem before I came along - it seems we're all doing the same thing for the same reason. If by any chance you are running VirtualBox on a solaris / opensolaris / openidiana / whatever ZFS host, you could of course use .vdi files for the VM virtual disks, but a lot of us are using zvol instead, for various reasons. To do the zvol, you first create the zvol (sudo zfs create -V) and then chown it to the user who runs VBox (sudo chown someuser /dev/zvol/rdsk/...) and then create a rawvmdk that references it (VBoxManage internalcommands createrawvmdk -filename /home/someuser/somedisk.vmdk -rawdisk /dev/zvol/rdsk/...) The problem is - during boot / reboot, or anytime the zpool or zfs filesystem is mounted or remounted, export, import... The zvol ownership reverts back to root:root. So you have to repeat your sudo chown before the guest VM can start. And the question is ... Obviously I can make an SMF service which will chown those devices automatically, but that's kind of a crappy solution. Is there any good way to assign the access rights, or persistently assign ownership of zvol's? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ARC de-allocation with large ram
Hi, If after it decreases in size it stays there it might be similar to: 7111576 arc shrinks in the absence of memory pressure Also, see document: ZFS ARC can shrink down without memory pressure result in slow performance [ID 1404581.1] Specifically, check if arc_no_grow is set to 1 after the cache size is decreased, and if it stays that way. The fix is in one of the SRUs and I think it should be in 11.1 I don't know if it was fixed in Illumos or even if Illumos was affected by this at all. -- Robert Milkowski http://milek.blogspot.com -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Chris Nagele Sent: 20 October 2012 18:47 To: zfs-discuss@opensolaris.org Subject: [zfs-discuss] ARC de-allocation with large ram Hi. We're running OmniOS as a ZFS storage server. For some reason, our arc cache will grow to a certain point, then suddenly drops. I used arcstat to catch it in action, but I was not able to capture what else was going on in the system at the time. I'll do that next. read hits miss hit% l2read l2hits l2miss l2hit% arcsz l2size 166 166 0 100 0 0 0 085G225G 5.9K 5.9K 0 100 0 0 0 085G225G 755 7154094 40 0 40 084G225G 17K 17K 0 100 0 0 0 067G225G 409 3951496 14 0 14 049G225G 388 3642493 24 0 24 041G225G 37K 37K2099 20 6 14 3040G225G For reference, it's a 12TB pool with 512GB SSD L2 ARC and 198GB RAM. We have nothing else running on the system except NFS. We are also not using dedupe. Here is the output of memstat at one point: # echo ::memstat | mdb -k Page SummaryPagesMB %Tot Kernel 19061902 74460 38% ZFS File Data28237282110301 56% Anon43112 1680% Exec and libs1522 50% Page cache 13509520% Free (cachelist) 6366240% Free (freelist) 2958527 115566% Total5030196571 Physical 50322219196571 According to prstat -s rss nothing else is consuming the memory. 592 root 33M 26M sleep 590 0:00:33 0.0% fmd/27 12 root 13M 11M sleep 590 0:00:08 0.0% svc.configd/21 641 root 12M 11M sleep 590 0:04:48 0.0% snmpd/1 10 root 14M 10M sleep 590 0:00:03 0.0% svc.startd/16 342 root 12M 9084K sleep 590 0:00:15 0.0% hald/5 321 root 14M 8652K sleep 590 0:03:00 0.0% nscd/52 So far I can't figure out what could be causing this. The only other thing I can think of is that we have a bunch of zfs send/receive operations going on as backups across 10 datasets in the pool. I am not sure how snapshots and send/receive affect the arc. Does anyone else have any ideas? Thanks, Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] encfs on top of zfs
Once something is written deduped you will always use the memory when you want to read any files that were written when dedup was enabled, so you do not save any memory unless you do not normally access most of your data. For reads you don't need ddt. Also in Solaris 11 (not in Illumos unfortunately AFAIK) on reads the in-memory ARC will also stay deduped (so if 10x logical blocks are deduped to 1 and you read all 10 logical copies, only one block in arc will be allocated). If there are no further modifications and you only read dedupped data, apart from disk space savings, there can be very nice improvement in performance as well (less i/o, more ram for caching, etc.). As far as the OP is concerned, unless you have a dataset that will dedup well don't bother with it, use compression instead (don't use both compression and dedup because you will shrink the average record size and balloon the memory usage). Can you expand a little bit more here? Dedup+compression works pretty well actually (not counting standard problems with current dedup - compression or not). -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS asynchronous writes being written to ZIL
The client is using async writes, that include commits. Sync writes do not need commits. What happens is that the ZFS transaction group commit occurs at more- or-less regular intervals, likely 5 seconds for more modern ZFS systems. When the commit occurs, any data that is in the ARC but not commited in a prior transaction group gets sent to the ZIL Are you sure? I don't think this is the case unless I misunderstood you or this is some recent change to Illumos. Whatever is being committed when zfs txg closes goes directly to pool and not to zil. Only sync writes will go to zil right a way (and not always, see logbias, etc.) and to arc to be committed later to a pool when txg closes. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS performance on LSI 9240-8i?
Now, if anyone is still reading, I have another question. The new Solaris 11 device naming convention hides the physical tree from me. I got just a list of long disk names all starting with c0 (see below) but I need to know which disk is connected to which controller so that I can create two parts of my mirrors to two different controllers in order to tolerate a single controller failure. I need a way of figuring the connection path for each disk. Hope I manage to explain what I want? See diskinfo(1M), for example: $ diskinfo -T bay -o Rc -h HDD00 - HDD01 - HDD02 c0t5000CCA00AC87F54d0 HDD03 c0t5000CCA00AA95838d0 HDD04 c0t5000CCA01510ECC0d0 HDD05 c0t5000CCA01515EE78d0 HDD06 c0t5000CCA01512DA3Cd0 HDD07 c0t5000CCA00AB3E1C8d0 HDD08 c0t5000CCA0151C1D18d0 HDD09 c0t5000CCA0151F7E08d0 HDD10 c0t5000CCA0151C7CA8d0 HDD11 c0t5000CCA00AA9D570d0 HDD12 c0t5000CCA0151CB180d0 HDD13 c0t5000CCA015208C98d0 HDD14 c0t5000CCA00AA97F04d0 HDD15 c0t5000CCA0151A287Cd0 HDD16 c0t5000CCA00AAA1544d0 HDD17 c0t5000CCA01521070Cd0 HDD18 c0t5000CCA00AA97EF4d0 HDD19 c0t5000CCA015214F84d0 HDD20 c0t5000CCA015214844d0 HDD21 c0t5000CCA00AAAD154d0 HDD22 c0t5000CCA00AA95558d0 HDD23 c0t5000CCA00AAA0D1Cd0 In your case you probably will have to put a configuration in place for your disk slots (on Oracle's HW it works out of the box) - go to support.oracle.com and look for the document: How To : Selecting a Physical Slot for a SAS Device with a WWN for an Oracle Solaris 11 Installation [ID 1411444.1] ps. there is also zpool status -l option which is cool: $ zpool status -l cwafseng3-0 pool: pool-0 state: ONLINE scan: scrub canceled on Thu Apr 12 13:52:13 2012 config: NAME STATE READ WRITE CKSUM pool-0 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 /dev/chassis/SUN-FIRE-X4270-M2-SERVER.unknown/HDD02/disk ONLINE 0 0 0 /dev/chassis/SUN-FIRE-X4270-M2-SERVER.unknown/HDD23/disk ONLINE 0 0 0 /dev/chassis/SUN-FIRE-X4270-M2-SERVER.unknown/HDD22/disk ONLINE 0 0 0 /dev/chassis/SUN-FIRE-X4270-M2-SERVER.unknown/HDD21/disk ONLINE 0 0 0 /dev/chassis/SUN-FIRE-X4270-M2-SERVER.unknown/HDD20/disk ONLINE 0 0 0 /dev/chassis/SUN-FIRE-X4270-M2-SERVER.unknown/HDD19/disk ONLINE 0 0 0 /dev/chassis/SUN-FIRE-X4270-M2-SERVER.unknown/HDD17/disk ONLINE 0 0 0 /dev/chassis/SUN-FIRE-X4270-M2-SERVER.unknown/HDD15/disk ONLINE 0 0 0 errors: No known data errors Best regards, Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resilver restarting several times
-Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Jim Klimov Sent: 12 May 2012 01:27 Cc: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Resilver restarting several times 2012-05-11 17:18, Bob Friesenhahn написал: On Fri, 11 May 2012, Jim Klimov wrote: Hello all, SHORT VERSION: What conditions can cause the reset of the resilvering process? My lost-and-found disk can't get back into the pool because of resilvers restarting... I recall that with sufficiently old vintage zfs, resilver would restart if a snapshot was taken. What sort of zfs is being used here? Bob Well, for the night I rebooted the machine into single-user mode, to rule out zones, crontabs and networked abusers, but I still get resilvering resets every now and then, about once an hour. I'm now trying a run with all zfs datasets unmounted, hope that helps somewhat... I'm growing puzzled now. To double check that no snapshots, etc. are being created run: zpool history -il pond -- Best regards, Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)
And he will still need an underlying filesystem like ZFS for them :) -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Nico Williams Sent: 25 April 2012 20:32 To: Paul Archer Cc: ZFS-Discuss mailing list Subject: Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD) I agree, you need something like AFS, Lustre, or pNFS. And/or an NFS proxy to those. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup
Citing yourself: The average block size for a given data block should be used as the metric to map all other datablock sizes to. For example, the ZFS recordsize is 128kb by default. If the average block (or page) size of a directory server is 2k, then the mismatch in size will result in degraded throughput for both read and write operations. One of the benefits of ZFS is that you can change the recordsize of all write operations from the time you set the new value going forward. And the above is not even entirely correct as if a file is bigger than a current value of recordsize property reducing a recordsize won't change block size for the file (it will continue to use the previous size, for example 128K). This is why you need to set recordsize to a desired value for large files *before* you create them (or you will have to copy them later on). From the performance point of view it really depends on a workload but as you described in your blog the default recordsize of 128K with an average write/read of 2K for many workloads will negatively impact performance, and lowering recordsize can potentially improve it. Nevertheless I was referring to dedup efficiency which with lower recordsize values should improve dedup ratios (although it will require more memory for ddt). From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Brad Diggs Sent: 29 December 2011 15:55 To: Robert Milkowski Cc: 'zfs-discuss discussion list' Subject: Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup Reducing the record size would negatively impact performance. For rational why, see the section titled Match Average I/O Block Sizes in my blog post on filesystem caching: http://www.thezonemanager.com/2009/03/filesystem-cache-optimization.html Brad Brad Diggs | Principal Sales Consultant | 972.814.3698 eMail: brad.di...@oracle.com Tech Blog: http://TheZoneManager.com/ http://TheZoneManager.com LinkedIn: http://www.linkedin.com/in/braddiggs On Dec 29, 2011, at 8:08 AM, Robert Milkowski wrote: Try reducing recordsize to 8K or even less *before* you put any data. This can potentially improve your dedup ratio and keep it higher after you start modifying data. From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Brad Diggs Sent: 28 December 2011 21:15 To: zfs-discuss discussion list Subject: Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup As promised, here are the findings from my testing. I created 6 directory server instances where the first instance has roughly 8.5GB of data. Then I initialized the remaining 5 instances from a binary backup of the first instance. Then, I rebooted the server to start off with an empty ZFS cache. The following table shows the increased L1ARC size, increased search rate performance, and increase CPU% busy with each starting and applying load to each successive directory server instance. The L1ARC cache grew a little bit with each additional instance but largely stayed the same size. Likewise, the ZFS dedup ratio remained the same because no data on the directory server instances was changing. image001.png However, once I started modifying the data of the replicated directory server topology, the caching efficiency quickly diminished. The following table shows that the delta for each instance increased by roughly 2GB after only 300k of changes. image002.png I suspect the divergence in data as seen by ZFS deduplication most likely occurs because reduplication occurs at the block level rather than at the byte level. When a write is sent to one directory server instance, the exact same write is propagated to the other 5 instances and therefore should be considered a duplicate. However this was not the case. There could be other reasons for the divergence as well. The two key takeaways from this exercise were as follows. There is tremendous caching potential through the use of ZFS deduplication. However, the current block level deduplication does not benefit directory as much as it perhaps could if deduplication occurred at the byte level rather than the block level. It very could be that even byte level deduplication doesn't work as well either. Until that option is available, we won't know for sure. Regards, Brad image003.png Brad Diggs | Principal Sales Consultant Tech Blog: http://TheZoneManager.com/ http://TheZoneManager.com LinkedIn: http://www.linkedin.com/in/braddiggs On Dec 12, 2011, at 10:05 AM, Brad Diggs wrote: Thanks everyone for your input on this thread. It sounds like there is sufficient weight behind the affirmative that I will include this methodology into my performance analysis test plan. If the performance goes well, I will share some of the results when we conclude in January/February timeframe. Regarding the great dd use
Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup
-Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Pawel Jakub Dawidek Sent: 10 December 2011 14:05 To: Mertol Ozyoney Cc: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup On Wed, Dec 07, 2011 at 10:48:43PM +0200, Mertol Ozyoney wrote: Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware. The only vendor i know that can do this is Netapp And you really work at Oracle?:) The answer is definiately yes. ARC caches on-disk blocks and dedup just reference those blocks. When you read dedup code is not involved at all. Let me show it to you with simple test: Create a file (dedup is on): # dd if=/dev/random of=/foo/a bs=1m count=1024 Copy this file so that it is deduped: # dd if=/foo/a of=/foo/b bs=1m Export the pool so all cache is removed and reimport it: # zpool export foo # zpool import foo Now let's read one file: # dd if=/foo/a of=/dev/null bs=1m 1073741824 bytes transferred in 10.855750 secs (98909962 bytes/sec) We read file 'a' and all its blocks are in cache now. The 'b' file shares all the same blocks, so if ARC caches blocks only once, reading 'b' should be much faster: # dd if=/foo/b of=/dev/null bs=1m 1073741824 bytes transferred in 0.870501 secs (1233475634 bytes/sec) Now look at it, 'b' was read 12.5 times faster than 'a' with no disk activity. Magic?:) Yep, however in pre Solaris 11 GA (and in Illumos) you would end up with 2x copies of blocks in ARC cache, while in S11 GA ARC will keep only 1 copy of all blocks. This can make a big difference if there are even more than just 2x files being dedupped and you need arc memory to cache other data as well. -- Robert Milkowski ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sync=disabled property
disk. This behavior is what makes NFS over ZFS slow without a slog: NFS does everything O_SYNC by default, No, it doesn't. Howver VMWare by default issues all writes as SYNC. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)
On 01/ 8/11 05:59 PM, Edward Ned Harvey wrote: Has anybody measured the cost of enabling or disabling verification? The cost of disabling verification is an infinitesimally small number multiplied by possibly all your data. Basically lim-0 times lim-infinity. This can only be evaluated on a case-by-case basis and there's no use in making any more generalizations in favor or against it. The benefit of disabling verification would presumably be faster performance. Has anybody got any measurements, or even calculations or vague estimates or clueless guesses, to indicate how significant this is? How much is there to gain by disabling verification? Exactly my point and there isn't one answer which fits all environments. In the testing I'm doing so far enabling/disabling verification doesn't make any noticeable difference so I'm sticking to verify. But I have enough memory and such a workload that I see little physical reads going on. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)
On 01/ 7/11 09:02 PM, Pawel Jakub Dawidek wrote: On Fri, Jan 07, 2011 at 07:33:53PM +, Robert Milkowski wrote: Now what if block B is a meta-data block? Metadata is not deduplicated. Good point but then it depends on a perspective. What if you you are storing lots of VMDKs? One corrupted block which is shared among hundreds of VMDKs will affect all of them. And it might be a block containing meta-data information within vmdk... Anyway, green or not, imho if in a given environment turning verification on still delivers acceptable performance then I would basically turn it on. In other environments it is about risk assessment. Best regards, Robert Milkowski ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)
On 01/ 7/11 02:13 PM, David Magda wrote: Given the above: most people are content enough to trust Fletcher to not have data corruption, but are worried about SHA-256 giving 'data corruption' when it comes de-dupe? The entire rest of the computing world is content to live with 10^-15 (for SAS disks), and yet one wouldn't be prepared to have 10^-30 (or better) for dedupe? I think you are do not understand entirely the problem. Lets say two different blocks A and B have the same sha256 checksum, A is already stored in a pool, B is being written. Without verify and dedup enabled B won't be written. Next time you ask for block B you will actually end-up with the block A. Now if B is relatively common in your data set you have a relatively big impact on many files because of one corrupted block (additionally from a fs point of view this is a silent data corruption). Without dedup if you get a single block corrupted silently an impact usually will be relatively limited. Now what if block B is a meta-data block? The point is that a potential impact of a hash collision is much bigger than a single silent data corruption to a block, not to mention that dedup or not all the other possible cases of data corruption are there anyway, adding yet another one might or might not be acceptable. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)
On 01/ 6/11 07:44 PM, Peter Taps wrote: Folks, I have been told that the checksum value returned by Sha256 is almost guaranteed to be unique. In fact, if Sha256 fails in some case, we have a bigger problem such as memory corruption, etc. Essentially, adding verification to sha256 is an overkill. Perhaps (Sha256+NoVerification) would work 99.99% of the time. But (Fletcher+Verification) would work 100% of the time. Which one of the two is a better deduplication strategy? If we do not use verification with Sha256, what is the worst case scenario? Is it just more disk space occupied (because of failure to detect duplicate blocks) or there is a chance of actual data corruption (because two blocks were assumed to be duplicate although they are not)? Yes, there is a possibility of data corruption. Or, if I go with (Sha256+Verification), how much is the overhead of verification on the overall process? It really depends on your specific workload. If your application is mostly reading data then it well might be you won't even notice verify. Sha256 is supposed to be almost bullet proof but... At the end of a day it is all about how much you value your data. But as I wrote before, try with verify and see if performance is acceptable. It well might be the case. You can always disable verify at any time. If I do go with verification, it seems (Fletcher+Verification) is more efficient than (Sha256+Verification). And both are 100% accurate in detecting duplicate blocks. I don't believe that fletcher is still allowed for dedup - right now it is only sha256. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On 01/ 3/11 04:28 PM, Richard Elling wrote: On Jan 3, 2011, at 5:08 AM, Robert Milkowski wrote: On 12/26/10 05:40 AM, Tim Cook wrote: On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling richard.ell...@gmail.com mailto:richard.ell...@gmail.com wrote: There are more people outside of Oracle developing for ZFS than inside Oracle. This has been true for some time now. Pardon my skepticism, but where is the proof of this claim (I'm quite certain you know I mean no disrespect)? Solaris11 Express was a massive leap in functionality and bugfixes to ZFS. I've seen exactly nothing out of outside of Oracle in the time since it went closed. We used to see updates bi-weekly out of Sun. Nexenta spending hundreds of man-hours on a GUI and userland apps isn't work on ZFS. Exactly my observation as well. I haven't seen any ZFS related development happening at Ilumos or Nexenta, at least not yet. I am quite sure you understand how pipelines work :-) Are you suggesting that Nexenta is developing new ZFS features behind closed doors (like Oracle...) and then will share code later-on? Somehow I don't think so... but I would love to be proved wrong :) -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On 01/ 4/11 11:35 PM, Robert Milkowski wrote: On 01/ 3/11 04:28 PM, Richard Elling wrote: On Jan 3, 2011, at 5:08 AM, Robert Milkowski wrote: On 12/26/10 05:40 AM, Tim Cook wrote: On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling richard.ell...@gmail.com mailto:richard.ell...@gmail.com wrote: There are more people outside of Oracle developing for ZFS than inside Oracle. This has been true for some time now. Pardon my skepticism, but where is the proof of this claim (I'm quite certain you know I mean no disrespect)? Solaris11 Express was a massive leap in functionality and bugfixes to ZFS. I've seen exactly nothing out of outside of Oracle in the time since it went closed. We used to see updates bi-weekly out of Sun. Nexenta spending hundreds of man-hours on a GUI and userland apps isn't work on ZFS. Exactly my observation as well. I haven't seen any ZFS related development happening at Ilumos or Nexenta, at least not yet. I am quite sure you understand how pipelines work :-) Are you suggesting that Nexenta is developing new ZFS features behind closed doors (like Oracle...) and then will share code later-on? Somehow I don't think so... but I would love to be proved wrong :) I mean I would love to see Nexenta start delivering real innovation in Solaris/Illumos kernel (zfs, networking, ...), not that I would love to see it happening behind a closed doors :) -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On 12/26/10 05:40 AM, Tim Cook wrote: On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling richard.ell...@gmail.com mailto:richard.ell...@gmail.com wrote: There are more people outside of Oracle developing for ZFS than inside Oracle. This has been true for some time now. Pardon my skepticism, but where is the proof of this claim (I'm quite certain you know I mean no disrespect)? Solaris11 Express was a massive leap in functionality and bugfixes to ZFS. I've seen exactly nothing out of outside of Oracle in the time since it went closed. We used to see updates bi-weekly out of Sun. Nexenta spending hundreds of man-hours on a GUI and userland apps isn't work on ZFS. Exactly my observation as well. I haven't seen any ZFS related development happening at Ilumos or Nexenta, at least not yet. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS ... open source moving forward?
On 11/12/2010 00:07, Erik Trimble wrote: The last update I see to the ZFS public tree is 29 Oct 2010. Which, I *think*, is about the time that the fork for the Solaris 11 Express snapshot was taken. I don't think this is the case. Although all the files show modification date of 29 Oct 2010 at src.opensolaris.org they are still old versions from August, at least the ones I checked. See http://src.opensolaris.org/source/history/onnv/onnv-gate/usr/src/uts/common/fs/zfs/ the mercurial gate doesn't have any updates either. Best regards, Robert Milkowski ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Increase Volume Size
On 07/12/2010 23:54, Tony MacDoodle wrote: Is is possible to expand the size of a ZFS volume? It was created with the following command: zfs create -V 20G ldomspool/test see man page for zfs, section about volsize property. Best regards, Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RAID-Z/mirror hybrid allocator
On 18/11/2010 17:53, Cindy Swearingen wrote: Markus, Let me correct/expand this: 1. If you create a RAIDZ pool on OS 11 Express (b151a), you will have some mirrored metadata. This feature integrated into b148 and the pool version is 29. This is the part I mixed up. 2. If you have an existing RAIDZ pool and upgrade to b151a, you would need to upgrade the pool version to use this feature. In this case, newly written metadata would be mirrored. Hi, And if one creates raid-z3 pool would meta-data be a 3-way mirror as well? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs send|recv and inherited recordsize
Hi, I thought that if I use zfs send snap | zfs recv if on a receiving side the recordsize property is set to different value it will be honored. But it doesn't seem to be the case, at least on snv_130. $ zfs get recordsize test/m1 NAME PROPERTYVALUESOURCE test/m1 recordsize 128K default $ ls -nil /test/m1/f1 5 -rw-r--r-- 1 011048576 Oct 4 10:31 /test/m1/f1 $ zdb -vv test/m1 5 Dataset test/m1 [ZPL], ID 1082, cr_txg 33413, 1.02M, 5 objects Object lvl iblk dblk dsize lsize %full type 5216K 128K 1.00M 1M 100.00 ZFS plain file $ zfs snapshot test/m...@s1 $ zfs create -o recordsize=32k test/m2 $ zfs send test/m...@s1 | zfs recv test/m2/m1 $ zfs get recordsize test/m2/m1 NAMEPROPERTYVALUESOURCE test/m2/m1 recordsize 32K inherited from test/m2 $ ls -lni /test/m2/m1/f1 5 -rw-r--r-- 1 011048576 Oct 4 10:31 /test/m2/m1/f1 $ zdb -vv test/m2/m1 5 Dataset test/m2/m1 [ZPL], ID 1110, cr_txg 33537, 1.02M, 5 objects Object lvl iblk dblk dsize lsize %full type 5216K 128K 1.00M 1M 100.00 ZFS plain file Well, dblk is 128KB - I would expect it to be 32K. Lets see what happens if I use cp instead: $ cp /test/m2/m1/f1 /test/m2/m1/f2 $ ls -lni /test/m2/m1/f2 6 -rw-r--r-- 1 011048576 Oct 4 11:15 /test/m2/m1/f2 $ zdb -vv test/m2/m1 6 Dataset test/m2/m1 [ZPL], ID 1110, cr_txg 33537, 2.03M, 6 objects Object lvl iblk dblk dsize lsize %full type 6216K32K 1.00M 1M 100.00 ZFS plain file Now it is fine. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs send|recv and inherited recordsize
thank you. On 04/10/2010 19:55, Matthew Ahrens wrote: That's correct. This behavior is because the send|recv operates on the DMU objects, whereas the recordsize property is interpreted by the ZPL. The ZPL checks the recordsize property when a file grows. But the recv doesn't grow any files, it just dumps data into the underlying objects. --matt On Mon, Oct 4, 2010 at 11:20 AM, Robert Milkowskimi...@task.gda.pl wrote: Hi, I thought that if I use zfs send snap | zfs recv if on a receiving side the recordsize property is set to different value it will be honored. But it doesn't seem to be the case, at least on snv_130. $ zfs get recordsize test/m1 NAME PROPERTYVALUESOURCE test/m1 recordsize 128K default $ ls -nil /test/m1/f1 5 -rw-r--r-- 1 011048576 Oct 4 10:31 /test/m1/f1 $ zdb -vv test/m1 5 Dataset test/m1 [ZPL], ID 1082, cr_txg 33413, 1.02M, 5 objects Object lvl iblk dblk dsize lsize %full type 5216K 128K 1.00M 1M 100.00 ZFS plain file $ zfs snapshot test/m...@s1 $ zfs create -o recordsize=32k test/m2 $ zfs send test/m...@s1 | zfs recv test/m2/m1 $ zfs get recordsize test/m2/m1 NAMEPROPERTYVALUESOURCE test/m2/m1 recordsize 32K inherited from test/m2 $ ls -lni /test/m2/m1/f1 5 -rw-r--r-- 1 011048576 Oct 4 10:31 /test/m2/m1/f1 $ zdb -vv test/m2/m1 5 Dataset test/m2/m1 [ZPL], ID 1110, cr_txg 33537, 1.02M, 5 objects Object lvl iblk dblk dsize lsize %full type 5216K 128K 1.00M 1M 100.00 ZFS plain file Well, dblk is 128KB - I would expect it to be 32K. Lets see what happens if I use cp instead: $ cp /test/m2/m1/f1 /test/m2/m1/f2 $ ls -lni /test/m2/m1/f2 6 -rw-r--r-- 1 011048576 Oct 4 11:15 /test/m2/m1/f2 $ zdb -vv test/m2/m1 6 Dataset test/m2/m1 [ZPL], ID 1110, cr_txg 33537, 2.03M, 6 objects Object lvl iblk dblk dsize lsize %full type 6216K32K 1.00M 1M 100.00 ZFS plain file Now it is fine. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] file level clones
Hi, fyi http://lwn.net/Articles/399148/ copyfile() The reflink() http://lwn.net/Articles/333783/ system call was originally proposed as a sort of fast copy operation; it would create a new copy of a file which shared all of the data blocks. If one of the files were subsequently written to, a copy-on-write operation would be performed so that the other file would not change. LWN readers last heard about this patch last September, when Linus refused to pull it http://lwn.net/Articles/353048/ for 2.6.32. Among other things, he didn't like the name. So now reflink() is back as copyfile(), with some proposed additional features. It would make the same copy-on-write copies on filesystems that support it, but copyfile() would also be able to delegate the actual copy work to the underlying storage device when it makes sense. For example, if a file is being copied on a network-mounted filesystem, it may well make sense to have the server do the actual copy work, eliminating the need to move the data over the network twice. The system call might also do ordinary copies within the kernel if nothing faster is available. The first question that was asked is: should copyfile() perhaps be an asynchronous interface? It could return a file descriptor which could be polled for the status of the operation. Then, graphical utilities could start a copy, then present a progress bar showing how things were going. Christoph Hellwig was adamant, though, that copyfile() should be a synchronous operation like almost all other Linux system calls; there is no need to create something weird and different here. Progress bars neither justify nor require the creation of asynchronous interfaces. There was also opposition to the mixing of the old reflink() idea with that of copying a file. There is little perceived value in creating a bad version of cp within the kernel. The two ideas were mixed because it seems that Linus seems to want it that way, but, after this discussion, they may yet be split apart again. http://en.wikipedia.org/wiki/Btrfs Btrfs provides a /clone/ operation which atomically creates a copy-on-write snapshot of a file, support for which was added to GNU coreutils http://en.wikipedia.org/wiki/Coreutils 7.5.^[17] http://en.wikipedia.org/wiki/Btrfs#cite_note-16 ^[18] http://en.wikipedia.org/wiki/Btrfs#cite_note-17 Cloning from byte ranges in one file to another is also supported, allowing large files to be more efficiently manipulated like standard rope http://en.wikipedia.org/wiki/Rope_%28computer_science%29 data structures. Also see http://www.symantec.com/connect/virtualstoreserver -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 'sync' properties and write operations.
On 28/08/2010 09:55, eXeC001er wrote: Hi. Can you explain to me: 1. dataset has 'sync=always' I start write to file on this dataset in no-sync mode: system write file in sync or async mode? sync 2. dataset has 'sync=disabled' I start write to file on this dataset in sync mode: system write file in sync or async mode? async The sync property takes an effect immediately for all new writes even if a file was open before the property was changed. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs set readonly=on does not entirely go into read-only mode
Hi, When I set readonly=on on a dataset then no new files are allowed to be created. However writes to already opened files are allowed. This is rather counter intuitive - if I set a filesystem as read-only I would expect it not to allow any modifications to it. I think it shouldn't behave this way and it should be considered as a bug. What do you think? ps. I tested it on S10u8 and snv_134. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] iScsi slow
On 03/08/2010 23:20, Ross Walker wrote: Nothing has been violated here. Look for WCE flag in COMSTAR where you can control how a given zvol should behave (synchronous or asynchronous). Additionally in recent build you have zfs set sync={disabled|default|always} which also works with zvols. So you do have a control over how it is supposed to behave and to make it nice it is even on per zvol basis. It is just that the default is synchronous. Ah, ok, my experience has been with Solaris and the iscsitgt which, correct me if I am wrong, is still synchronous only. I don't remember if it offered or not an ability to manipulate zvol's WCE flag but if it didn't then you can do it anyway as it is a zvol property. For an example see http://milek.blogspot.com/2010/02/zvols-write-cache.html -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Fwd: zpool import despite missing log [PSARC/2010/292 Self Review]
fyi -- Robert Milkowski http://milek.blogspot.com Original Message Subject:zpool import despite missing log [PSARC/2010/292 Self Review] Date: Mon, 26 Jul 2010 08:38:22 -0600 From: Tim Haley tim.ha...@oracle.com To: psarc-...@sun.com CC: zfs-t...@sun.com I am sponsoring the following case for George Wilson. Requested binding is micro/patch. Since this is a straight-forward addition of a command line option, I think itqualifies for self review. If an ARC member disagrees, let me know and I'll convert to a fast-track. Template Version: @(#)sac_nextcase 1.70 03/30/10 SMI This information is Copyright (c) 2010, Oracle and/or its affiliates. All rights reserved. 1. Introduction 1.1. Project/Component Working Name: zpool import despite missing log 1.2. Name of Document Author/Supplier: Author: George Wilson 1.3 Date of This Document: 26 July, 2010 4. Technical Description OVERVIEW: ZFS maintains a GUID (global unique identifier) on each device and the sum of all GUIDs of a pool are stored into the ZFS uberblock. This sum is used to determine the availability of all vdevs within a pool when a pool is imported or opened. Pools which contain a separate intent log device (e.g. a slog) will fail to import when that device is removed or is otherwise unavailable. This proposal aims to address this particular issue. PROPOSED SOLUTION: This fast-track introduce a new command line flag to the 'zpool import' sub-command. This new option, '-m', allows pools to import even when a log device is missing. The contents of that log device are obviously discarded and the pool will operate as if the log device were offlined. MANPAGE DIFFS: zpool import [-o mntopts] [-p property=value] ... [-d dir | -c cachefile] - [-D] [-f] [-R root] [-n] [-F] -a + [-D] [-f] [-m] [-R root] [-n] [-F] -a zpool import [-o mntopts] [-o property=value] ... [-d dir | -c cachefile] - [-D] [-f] [-R root] [-n] [-F] pool |id [newpool] + [-D] [-f] [-m] [-R root] [-n] [-F] pool |id [newpool] zpool import [-o mntopts] [ -o property=value] ... [-d dir | - -c cachefile] [-D] [-f] [-n] [-F] [-R root] -a + -c cachefile] [-D] [-f] [-m] [-n] [-F] [-R root] -a Imports all pools found in the search directories. Identical to the previous command, except that all pools + -m + +Allows a pool to import when there is a missing log device EXAMPLES: 1). Configuration with a single intent log device: # zpool status tank pool: tank state: ONLINE scan: none requested config: NAMESTATE READ WRITE CKSUM tankONLINE 0 0 0 c7t0d0ONLINE 0 0 0 logs c5t0d0ONLINE 0 0 0 errors: No known data errors # zpool import tank The devices below are missing, use '-m' to import the pool anyway: c5t0d0 [log] cannot import 'tank': one or more devices is currently unavailable # zpool import -m tank # zpool status tank pool: tank state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-2Q scan: none requested config: NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 c7t0d0 ONLINE 0 0 0 logs 1693927398582730352 UNAVAIL 0 0 0 was /dev/dsk/c5t0d0 errors: No known data errors 2). Configuration with mirrored intent log device: # zpool add tank log mirror c5t0d0 c5t1d0 zr...@diskmonster:/dev/dsk# zpool status tank pool: tank state: ONLINE scan: none requested config: NAMESTATE READ WRITE CKSUM tankONLINE 0 0 0 c7t0d0ONLINE 0 0 0 logs mirror-1 ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 c5t1d0 ONLINE 0 0 0 errors: No known data errors # zpool import 429789444028972405 The devices below are missing, use '-m' to import the pool anyway: mirror-1 [log] c5t0d0 c5t1d0 # zpool import -m tank # zpool status tank pool: tank state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-2Q scan: none requested config: NAME
Re: [zfs-discuss] zfs raidz1 and traditional raid 5 perfomrance comparision
On 22/07/2010 03:25, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Robert Milkowski I had a quick look at your results a moment ago. The problem is that you used a server with 4GB of RAM + a raid card with a 256MB of cache. Then your filesize for iozone was set to 4GB - so random or not you probably had a relatively good cache hit ratio for random reads. And Look again in the raw_results. I ran it with 4G, and also with 12G. There was no significant difference between the two, so I only compiled the 4G results into a spreadsheet PDF. The only tests with 12GB file size in raw files are a mirror and a single disk configuration. There are no results for raid-z there. even then a random read from 8 threads gave you only about 40% more IOPS than for a RAID-Z made out of 5 disks than a single drive. The poor result for HW-R5 is surprising though but it might be that a stripe size was not matched to ZFS recordsize and iozone block size in this case. I think what you're saying is With 5 disks performing well, you should expect 4x higher iops than a single disk, and the measured result was only 40% higher, which is a poor result. I agree. I guess the 128k recordsize used in iozone is probably large enough that it frequently causes blocks to span disks? I don't know. Probably - but it would also depend on how you configured hw-r5 (mainly it's stripe size). The other thing is that you might have had some bottleneck somewhere else as your results for N-way mirrors aren't that good either. The issue with raid-z and random reads is that as cache hit ratio goes down to 0 the IOPS approaches IOPS of a single drive. For a little bit more information see http://blogs.sun.com/roch/entry/when_to_and_not_to I don't think that's correct, less you're using a single thread. As long as multiple threads are issuing random reads on raidz, and those reads are small enough that each one is entirely written on a single disk, then you should be able to get n-1 disk operating simultaneously, to achieve (n-1)x performance of a single disk. Even if blocks are large enough to span disks, you should be able to get (n-1)x performance of a single disk for large sequential operations. While it is tru to some degree for hw raid-5, raid-z doesn't work that way. The issue is that each zfs filesystem block is basically spread across n-1 devices. So every time you want to read back a single fs block you need to wait for all n-1 devices to provide you with a part of it - and keep in mind in zfs you can't get a partial block even if that's what you are asking for as zfs has to check checksum of entire fs block. Now multiple readers make it actually worse for raid-z (assuming very poor cache hit ratio) - because each read from each reader involves all disk drives basically others can't read anything until it is done. It gets really bad for random reads. With HW raid-5 is your stripe size matches block you are reading back for random reads it is probable that while reader-X1 is reading from disk-Y1 reader-X2 is reading from disk-Y2 so you should end-up with all disk drives (-1) contributing to better overall iops. Read Roch's blog entry carefully for more information. btw: even in your results 6x disks in raid-z provided over 3x less IOPS than zfs raid-10 configuration for random reads. It is a big difference if one needs performance. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs raidz1 and traditional raid 5 perfomrance comparision
On 21/07/2010 15:40, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of v for zfs raidz1, I know for random io, iops of a raidz1 vdev eqaul to one physical disk iops, since raidz1 is like raid5 , so is raid5 has same performance like raidz1? ie. random iops equal to one physical disk's ipos. I tested this extensively about 6 months ago. Please see http://www.nedharvey.com for more details. I disagree with the assumptions you've made above, and I'll say this instead: Look at http://nedharvey.com/iozone_weezer/bobs%20method/iozone%20results%20summary. pdf Go down to the 2nd section, Compared to a single disk Look at single-disk and raidz-5disks and raid5-5disks-hardware You'll see that both raidz and raid5 are significantly faster than a single disk in all types of operations. In all cases, raidz is approximately equal to, or significantly faster than hardware raid5. I had a quick look at your results a moment ago. The problem is that you used a server with 4GB of RAM + a raid card with a 256MB of cache. Then your filesize for iozone was set to 4GB - so random or not you probably had a relatively good cache hit ratio for random reads. And even then a random read from 8 threads gave you only about 40% more IOPS than for a RAID-Z made out of 5 disks than a single drive. The poor result for HW-R5 is surprising though but it might be that a stripe size was not matched to ZFS recordsize and iozone block size in this case. The issue with raid-z and random reads is that as cache hit ratio goes down to 0 the IOPS approaches IOPS of a single drive. For a little bit more information see http://blogs.sun.com/roch/entry/when_to_and_not_to -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool throughput: snv 134 vs 138 vs 143
On 20/07/2010 07:59, Chad Cantwell wrote: I've just compiled and booted into snv_142, and I experienced the same slow dd and scrubbing as I did with my 142 and 143 compilations and with the Nexanta 3 RC2 CD. So, this would seem to indicate a build environment/process flaw rather than a regression. Are you sure it is not a debug vs. non-debug issue? -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Debunking the dedup memory myth
On 20/07/2010 04:41, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Richard L. Hamilton I would imagine that if it's read-mostly, it's a win, but otherwise it costs more than it saves. Even more conventional compression tends to be more resource intensive than decompression... I would imagine it's *easier* to have a win when it's read-mostly, but the expense of computing checksums is going to be done either way, with or without dedup. The only extra cost dedup adds is to maintain a hash tree of some kind, to see if some block has already been stored on disk. So ... of course I'm speaking hypothetically and haven't been proven ... I think dedup will accelerate the system in nearly all use cases. The main exception is whenever you have highly non-duplicated data. I think the cost of dedup CPU power is tiny little small, but in the case of highly non-duplicated data, even that little expense is a waste. Please note that by default ZFS uses fletcher4 checksums but dedup currently allows only for sha256 which are more CPU intensive. Also from a performance point of view there will be a sudden drop in write performance the moment DDT can't fit entirely in a memory. L2ARC could mitigate the impact though. Then there will be less memory available for data caching due to extra memory requirements for DDT. (however please note that IIRC DDT is treated as meta data and by default there is a limit of meta-data cache size to be no bigger than 20% of ARC - there is a bug open for it, I haven't checked if it's been fixed yet or not). What I'm wondering is when dedup is a better value than compression. Whenever files have internal repetition, compression will be better. Whenever the repetition crosses file barriers, dedup will be better. Not necessarily. Compression in ZFS works only within a single fs block scope. So for example if you have a large file with most of its block identical dedup should compress the file much better than a compression. Also please note that you can use both: compression and dedup at the same time. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Legality and the future of zfs...
On 12/07/2010 16:32, Erik Trimble wrote: ZFS is NOT automatically ACID. There is no guaranty of commits for async write operations. You would have to use synchronous writes to guaranty commits. And, furthermore, I think that there is a strong # zfs set sync=always pool will force all I/O (async or sync) to be written synchronously. ps. still, I'm not saying it would made ZFS ACID. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] carrying on [was: Legality and the future of zfs...]
On 16/07/2010 23:57, Richard Elling wrote: On Jul 15, 2010, at 4:48 AM, BM wrote: 2. No community = stale outdated code. But there is a community. What is lacking is that Oracle, in their infinite wisdom, has stopped producing OpenSolaris developer binary releases. Not to be outdone, they've stopped other OS releases as well. Surely, this is a temporary situation. AFAIK the dev OSOL releases are still being produced - they haven't been made public since b134 though. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raid-z - not even iops distribution
On 23/06/2010 18:50, Adam Leventhal wrote: Does it mean that for dataset used for databases and similar environments where basically all blocks have fixed size and there is no other data all parity information will end-up on one (z1) or two (z2) specific disks? No. There are always smaller writes to metadata that will distribute parity. What is the total width of your raidz1 stripe? 4x disks, 16KB recordsize, 128GB file, random read with 16KB block. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raid-z - not even iops distribution
On 23/06/2010 19:29, Ross Walker wrote: On Jun 23, 2010, at 1:48 PM, Robert Milkowskimi...@task.gda.pl wrote: 128GB. Does it mean that for dataset used for databases and similar environments where basically all blocks have fixed size and there is no other data all parity information will end-up on one (z1) or two (z2) specific disks? What's the record size on those datasets? 8k? 16K ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raid-z - not even iops distribution
On 24/06/2010 14:32, Ross Walker wrote: On Jun 24, 2010, at 5:40 AM, Robert Milkowskimi...@task.gda.pl wrote: On 23/06/2010 18:50, Adam Leventhal wrote: Does it mean that for dataset used for databases and similar environments where basically all blocks have fixed size and there is no other data all parity information will end-up on one (z1) or two (z2) specific disks? No. There are always smaller writes to metadata that will distribute parity. What is the total width of your raidz1 stripe? 4x disks, 16KB recordsize, 128GB file, random read with 16KB block. From what I gather each 16KB record (plus parity) is spread across the raidz disks. This causes the total random IOPS (write AND read) of the raidz to be that of the slowest disk in the raidz. Raidz is definitely made for sequential IO patterns not random. To get good random IO with raidz you need a zpool with X raidz vdevs where X = desired IOPS/IOPS of single drive. I know that and it wasn't mine question. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raid-z - not even iops distribution
On 24/06/2010 15:54, Bob Friesenhahn wrote: On Thu, 24 Jun 2010, Ross Walker wrote: Raidz is definitely made for sequential IO patterns not random. To get good random IO with raidz you need a zpool with X raidz vdevs where X = desired IOPS/IOPS of single drive. Remarkably, I have yet to see mention of someone testing a raidz which is comprised entirely of FLASH SSDs. This should help with the IOPS, particularly when reading. I have. Briefly: X4270 2x Quad-core 2.93GHz, 72GB RAM Open Solaris 2009.06 (snv_111b) ARC limited to 4GB 44x SSD in a F5100. 4x SAS HBAs, 4x physical SAS connections to the f5100 (16x SAS channels in total), each to a different domain. 1. RAID-10 pool 22x mirrors across domains ZFS: 16KB recordsize, atime=off randomread filebennch benchmark with a 16KB block size with 1, 16, ..., 128 threads, 128GB working set. maximum performance when 128 threads: ~137,000 ops/s 2. RAID-Z pool 11x 4-way RAID-z, each raid-z vdev across domains ZFS: recordsize=16k, atime=off randomread filebennch benchmark with a 16KB block size with 1, 16, ..., 128 threads, 128GB working set. maximum performance when 64-128 threads: ~34,000 ops/s With a ZFS recordsize of 32KB it got up-to ~41,000 ops/s. Larger ZFS record sizes produced worse results. RAID-Z delivered about 3.3X less ops/s compared to RAID-10 here. SSDs do not make any fundamental chanage here and RAID-Z characteristics are basically the same whether it is configured out of SSDs or HDDs. However SSDs could of course provide a good-enough performance even with RAID-Z, as at the end of a day it is not about benchmarks but your environment requirements. A given number of SSDs in a RAID-Z configuration is able to deliver the same performance as a much greater number of disk drives in RAID-10 configuration and if you don't need much space it could make sense. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raid-z - not even iops distribution
On 24/06/2010 20:52, Arne Jansen wrote: Ross Walker wrote: Raidz is definitely made for sequential IO patterns not random. To get good random IO with raidz you need a zpool with X raidz vdevs where X = desired IOPS/IOPS of single drive. I have seen statements like this repeated several times, though I haven't been able to find an in-depth discussion of why this is the case. From what I've gathered every block (what is the correct term for this? zio block?) written is spread across the whole raid-z. But in what units? will a 4k write be split into 512 byte writes? And in the opposite direction, every block needs to be read fully, even if only parts of it are being requested, because the checksum needs to be checked? Will the parity be read, too? If this is all the case, I can see why raid-z reduces the performance of an array effectively to one device w.r.t. random reads. http://blogs.sun.com/roch/entry/when_to_and_not_to -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raid-z - not even iops distribution
128GB. Does it mean that for dataset used for databases and similar environments where basically all blocks have fixed size and there is no other data all parity information will end-up on one (z1) or two (z2) specific disks? On 23/06/2010 17:51, Adam Leventhal wrote: Hey Robert, How big of a file are you making? RAID-Z does not explicitly do the parity distribution that RAID-5 does. Instead, it relies on non-uniform stripe widths to distribute IOPS. Adam On Jun 18, 2010, at 7:26 AM, Robert Milkowski wrote: Hi, zpool create test raidz c0t0d0 c1t0d0 c2t0d0 c3t0d0 \ raidz c0t1d0 c1t1d0 c2t1d0 c3t1d0 \ raidz c0t2d0 c1t2d0 c2t2d0 c3t2d0 \ raidz c0t3d0 c1t3d0 c2t3d0 c3t3d0 \ [...] raidz c0t10d0 c1t10d0 c2t10d0 c3t10d0 zfs set atime=off test zfs set recordsize=16k test (I know...) now if I create a one large file with filebench and simulate a randomread workload with 1 or more threads then disks on c2 and c3 controllers are getting about 80% more reads. This happens both on 111b and snv_134. I would rather except all of them to get about the same number of iops. Any idea why? -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] hot detach of disks, ZFS and FMA integration
On 18/06/2010 00:18, Garrett D'Amore wrote: On Thu, 2010-06-17 at 18:38 -0400, Eric Schrock wrote: On the SS7000 series, you get an alert that the enclosure has been detached from the system. The fru-monitor code (generalization of the disk-monitor) that generates this sysevent has not yet been pushed to ON. [...] I guess the fact that the SS7000 code isn't kept up to date in ON means that we may wind up having to do our own thing here... its a bit unfortunate, but ok. Eric - is it a business decision that the discussed code is not in the ON or do you actually intent to get it integrated into ON? Because if you do then I think that getting Nexenta guys expanding on it would be better for everyone instead of having them reinventing the wheel... -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] raid-z - not even iops distribution
Hi, zpool create test raidz c0t0d0 c1t0d0 c2t0d0 c3t0d0 \ raidz c0t1d0 c1t1d0 c2t1d0 c3t1d0 \ raidz c0t2d0 c1t2d0 c2t2d0 c3t2d0 \ raidz c0t3d0 c1t3d0 c2t3d0 c3t3d0 \ [...] raidz c0t10d0 c1t10d0 c2t10d0 c3t10d0 zfs set atime=off test zfs set recordsize=16k test (I know...) now if I create a one large file with filebench and simulate a randomread workload with 1 or more threads then disks on c2 and c3 controllers are getting about 80% more reads. This happens both on 111b and snv_134. I would rather except all of them to get about the same number of iops. Any idea why? -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question : Sun Storage 7000 dedup ratio per share
On 18/06/2010 14:47, ??? wrote: Dear All : Under Sun Storage 7000 system, can we see per share ratio after enable dedup function ? We would like deep to see each share dedup ratio. On Web GUI, only show dedup ratio entire storage pool. Since dedup works across all dataset with dedup enabled in a pool you can't really get a dedup ratio per share. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] At what level does the “zfs ” directory exist?
On 17/06/2010 09:18, MichaelHoy wrote: First thing, it’s simply not practical to have so many file systems. I’d already tested 5k and boot time was unacceptable, never mind the other inherent implications of such a strategy. Therefore, access to Previous Versions via Windows is out. Previous Versions should work even if you have a one large filesystems with all users homes as directories within. What Solaris/OpenSolaris version did you try for the 5k test? -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OCZ Devena line of enterprise SSD
On 15/06/2010 18:46, Brandon High wrote: On Mon, Jun 14, 2010 at 2:07 PM, Roger Hernandezrhvar...@gmail.com wrote: OCZ has a new line of enterprise SSDs, based on the SandForce 1500 controller. The SLC based drive should be great as a ZIL, and the MLC drives should be a close second. Neither is cost effective as a L2ARC, since the cache device doesn't require resiliency or high random iops. A previous generation drive (such as the Vertex or X25-M) is probably sufficient. If you don't need a high random iops from you l2arc then perhaps you don't need an l2arc at all? The whole point of having L2ARC is to serve high random read iops from RAM and L2ARC device instead of disk drives in a main pool. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] At what level does the “zfs ” directory exist?
On 16/06/2010 09:11, Arne Jansen wrote: MichaelHoy wrote: I’ve posted a query regarding the visibility of snapshots via CIFS here (http://opensolaris.org/jive/thread.jspa?threadID=130577tstart=0) however, I’m beginning to suspect that it may be a more fundamental ZFS question so I’m asking the same question here. At what level does the “zfs” directory exist? If the “.zfs” subdirectory only exists as the direct child of the mount point then can someone suggest how I can make it visible lower down without requiring me (even if it were possible for 50k users) to make each users’ home folder a file system? By way of a background, I’m looking at the possibility of hosting our students personal file space on OpenSolaris since the capacities required go well beyond my budget to keep investing in our NetApp kit. So far I’ve managed to implement the same functionality however, the visibility of the snapshots to allow self-service file restores is a real issue which may prevent me for going forward on this platform. I’d appreciate any suggestions. Do you only want to share the filesystem via CIFS? Have you had a look at the shadow_copy2 extension for samba? It maps the snapshots so windows can access them via previous versions from the explorers context menu. btw: the CIFS service supports Windows Shadow Copies out-of-the-box. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Scrub issues
On 14/06/2010 22:12, Roy Sigurd Karlsbakk wrote: Hi all It seems zfs scrub is taking a big bit out of I/O when running. During a scrub, sync I/O, such as NFS and iSCSI is mostly useless. Attaching an SLOG and some L2ARC helps this, but still, the problem remains in that the scrub is given full priority. Is this problem known to the developers? Will it be addressed? http://sparcv9.blogspot.com/2010/06/slower-zfs-scrubsresilver-on-way.html http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6494473 -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20
On 10/06/2010 20:43, Andrey Kuzmin wrote: As to your results, it sounds almost too good to be true. As Bob has pointed out, h/w design targeted hundreds IOPS, and it was hard to believe it can scale 100x. Fantastic. But it actually can do over 100k. Also several thousand IOPS on a single FC port is nothing unusual and has been the case for at least several years. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20
On 11/06/2010 09:22, sensille wrote: Andrey Kuzmin wrote: On Fri, Jun 11, 2010 at 1:54 AM, Richard Elling richard.ell...@gmail.commailto:richard.ell...@gmail.com wrote: On Jun 10, 2010, at 1:24 PM, Arne Jansen wrote: Andrey Kuzmin wrote: Well, I'm more accustomed to sequential vs. random, but YMMW. As to 67000 512 byte writes (this sounds suspiciously close to 32Mb fitting into cache), did you have write-back enabled? It's a sustained number, so it shouldn't matter. That is only 34 MB/sec. The disk can do better for sequential writes. Note: in ZFS, such writes will be coalesced into 128KB chunks. So this is just 256 IOPS in the controller, not 64K. No, it's 67k ops, it was a completely ZFS-free test setup. iostat also confirmed the numbers. It's a really simple test everyone can do it. # dd if=/dev/zero of=/dev/rdsk/cXtYdZs0 bs=512 I did a test on my workstation a moment ago and got about 21k IOPS from my sata drive (iostat). The trick here of course is that this is sequentail write with no other workload going on and a drive should be able to nicely coalesce these IOs and do a sequential writes with large blocks. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20
On 11/06/2010 10:58, Andrey Kuzmin wrote: On Fri, Jun 11, 2010 at 1:26 PM, Robert Milkowski mi...@task.gda.pl mailto:mi...@task.gda.pl wrote: On 11/06/2010 09:22, sensille wrote: Andrey Kuzmin wrote: On Fri, Jun 11, 2010 at 1:54 AM, Richard Elling richard.ell...@gmail.com mailto:richard.ell...@gmail.commailto:richard.ell...@gmail.com mailto:richard.ell...@gmail.com wrote: On Jun 10, 2010, at 1:24 PM, Arne Jansen wrote: Andrey Kuzmin wrote: Well, I'm more accustomed to sequential vs. random, but YMMW. As to 67000 512 byte writes (this sounds suspiciously close to 32Mb fitting into cache), did you have write-back enabled? It's a sustained number, so it shouldn't matter. That is only 34 MB/sec. The disk can do better for sequential writes. Note: in ZFS, such writes will be coalesced into 128KB chunks. So this is just 256 IOPS in the controller, not 64K. No, it's 67k ops, it was a completely ZFS-free test setup. iostat also confirmed the numbers. It's a really simple test everyone can do it. # dd if=/dev/zero of=/dev/rdsk/cXtYdZs0 bs=512 I did a test on my workstation a moment ago and got about 21k IOPS from my sata drive (iostat). The trick here of course is that this is sequentail write with no other workload going on and a drive should be able to nicely coalesce these IOs and do a sequential writes with large blocks. Exactly, though one might still wonder where the coalescing actually happens, in the respective OS layer or in the controller. Nonetheless, this is hardly a common use-case one would design h/w for. in the above example it happens inside a disk drive. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20
On 21/10/2009 03:54, Bob Friesenhahn wrote: I would be interested to know how many IOPS an OS like Solaris is able to push through a single device interface. The normal driver stack is likely limited as to how many IOPS it can sustain for a given LUN since the driver stack is optimized for high latency devices like disk drives. If you are creating a driver stack, the design decisions you make when requests will be satisfied in about 12ms would be much different than if requests are satisfied in 50us. Limitations of existing software stacks are likely reasons why Sun is designing hardware with more device interfaces and more independent devices. Open Solaris 2009.06, 1KB READ I/O: # dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0 # iostat -xnzCM 1|egrep device|c[0123]$ [...] r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 17497.30.0 17.10.0 0.0 0.80.00.0 0 82 c0 extended device statistics r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 17498.80.0 17.10.0 0.0 0.80.00.0 0 82 c0 extended device statistics r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 17277.60.0 16.90.0 0.0 0.80.00.0 0 82 c0 extended device statistics r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 17441.30.0 17.00.0 0.0 0.80.00.0 0 82 c0 extended device statistics r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 17333.90.0 16.90.0 0.0 0.80.00.0 0 82 c0 Now lets see how it looks like for a single SAS connection but dd to 11x SSDs: # dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t1d0p0 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t2d0p0 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t4d0p0 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t5d0p0 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t6d0p0 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t7d0p0 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t8d0p0 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t9d0p0 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t10d0p0 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t11d0p0 # iostat -xnzCM 1|egrep device|c[0123]$ [...] r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 104243.30.0 101.80.0 0.2 9.70.00.1 0 968 c0 extended device statistics r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 104249.20.0 101.80.0 0.2 9.70.00.1 0 968 c0 extended device statistics r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 104208.10.0 101.80.0 0.2 9.70.00.1 0 967 c0 extended device statistics r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 104245.80.0 101.80.0 0.2 9.70.00.1 0 966 c0 extended device statistics r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 104221.90.0 101.80.0 0.2 9.70.00.1 0 968 c0 extended device statistics r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 104212.20.0 101.80.0 0.2 9.70.00.1 0 967 c0 It looks like a single CPU core still hasn't been saturated and the bottleneck is in the device rather then OS/CPU. So the MPT driver in Solaris 2009.06 can do at least 100,000 IOPS to a single SAS port. It also scales well - I did run above dd's over 4x SAS ports at the same time and it scaled linearly by achieving well over 400k IOPS. hw used: x4270, 2x Intel X5570 2.93GHz, 4x SAS SG-PCIE8SAS-E-Z (fw. 1.27.3.0), connected to F5100. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20
On 10/06/2010 15:39, Andrey Kuzmin wrote: On Thu, Jun 10, 2010 at 6:06 PM, Robert Milkowski mi...@task.gda.pl mailto:mi...@task.gda.pl wrote: On 21/10/2009 03:54, Bob Friesenhahn wrote: I would be interested to know how many IOPS an OS like Solaris is able to push through a single device interface. The normal driver stack is likely limited as to how many IOPS it can sustain for a given LUN since the driver stack is optimized for high latency devices like disk drives. If you are creating a driver stack, the design decisions you make when requests will be satisfied in about 12ms would be much different than if requests are satisfied in 50us. Limitations of existing software stacks are likely reasons why Sun is designing hardware with more device interfaces and more independent devices. Open Solaris 2009.06, 1KB READ I/O: # dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0 /dev/null is usually a poor choice for a test lie this. Just to be on the safe side, I'd rerun it with /dev/random. That wouldn't work, would it? Please notice that I'm reading *from* an ssd and writing *to* /dev/null -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [zones-discuss] ZFS ARC cache issue
On 04/06/2010 15:46, James Carlson wrote: Petr Benes wrote: add to /etc/system something like (value depends on your needs) * limit greedy ZFS to 4 GiB set zfs:zfs_arc_max = 4294967296 And yes, this has nothing to do with zones :-). That leaves unanswered the underlying question: why do you need to do this at all? Isn't the ZFS ARC supposed to release memory when the system is under pressure? Is that mechanism not working well in some cases ... ? My understanding is that if kmem gets heavily fragmaneted ZFS won't be able to give back much memory. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Odd dump volume panic
On 12/05/2010 22:19, Ian Collins wrote: On 05/13/10 03:27 AM, Lori Alt wrote: On 05/12/10 04:29 AM, Ian Collins wrote: I just tried moving a dump volume form rpool into another pool so I used zfs send/receive to copy the volume (to keep some older dumps) then ran dumpadm -d to use the new location. This caused a panic. Nothing ended up in messages and needless to say, there isn't a dump! Creating a new volume and using that worked fine. This was on Solaris 10 update 8. Has anyone else seen anything like this? The fact that a panic occurred is some kind of bug, but I'm also not surprised that this didn't work. Dump volumes have specialized behavior and characteristics and using send/receive to move them (or any other way to move them) is probably not going to work. You need to extract the dump from the dump zvol using savecore and then move the resulting file. I'm surprised. I thought the volume used for dump is just a normal zvol or other block device. I didn't realise there was any relationship between a zvol and its contents. One odd think I did notice was the device size was reported differently on the new pool: zfs get all space/dump NAMEPROPERTY VALUE SOURCE space/dump type volume - space/dump creation Wed May 12 20:56 2010 - space/dump used 12.9G - space/dump available 201G - space/dump referenced12.9G - space/dump compressratio 1.01x - space/dump reservation none default space/dump volsize 16G- space/dump volblocksize 128K - space/dump checksum on default space/dump compression on inherited from space space/dump readonly offdefault space/dump shareiscsioffdefault space/dump copies1 default space/dump refreservationnone default space/dump primarycache alldefault space/dump secondarycachealldefault space/dump usedbysnapshots 0 - space/dump usedbydataset 12.9G - space/dump usedbychildren0 - space/dump usedbyrefreservation 0 - zfs get all rpool/dump NAMEPROPERTY VALUE SOURCE rpool/dump type volume - rpool/dump creation Thu Jun 25 19:40 2009 - rpool/dump used 16.0G - rpool/dump available 10.4G - rpool/dump referenced16K- rpool/dump compressratio 1.00x - rpool/dump reservation none default rpool/dump volsize 16G- rpool/dump volblocksize 8K - rpool/dump checksum offlocal rpool/dump compression offlocal rpool/dump readonly offdefault rpool/dump shareiscsioffdefault rpool/dump copies1 default rpool/dump refreservationnone default rpool/dump primarycache alldefault rpool/dump secondarycachealldefault zvol used as a dump device has some constraints in regards to its settings like checksum, compressions, etc. For more details see: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zvol.c#1683 See that space/dump has checksums turned on, compression turned on, etc. while rpool/dump doesn't. Additionally all blocks need to be pre-allocated (http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zvol.c#1785) - but zfs send|recv should replicate it I think. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Heads Up: zil_disable has expired, ceased to be, ...
With the put back of: [PSARC/2010/108] zil synchronicity zfs datasets now have a new 'sync' property to control synchronous behaviour. The zil_disable tunable to turn synchronous requests into asynchronous requests (disable the ZIL) has been removed. For systems that use that switch on upgrade you will now see a message on booting: sorry, variable 'zil_disable' is not defined in the 'zfs' module Please update your system to use the new sync property. Here is a summary of the property: --- The options and semantics for the zfs sync property: sync=standard This is the default option. Synchronous file system transactions (fsync, O_DSYNC, O_SYNC, etc) are written out (to the intent log) and then secondly all devices written are flushed to ensure the data is stable (not cached by device controllers). sync=always For the ultra-cautious, every file system transaction is written and flushed to stable storage by system call return. This obviously has a big performance penalty. sync=disabled Synchronous requests are disabled. File system transactions only commit to stable storage on the next DMU transaction group commit which can be many seconds. This option gives the highest performance, with no risk of corrupting the pool. However, it is very dangerous as ZFS is ignoring the synchronous transaction demands of applications such as databases or NFS. Setting sync=disabled on the currently active root or /var file system may result in out-of-spec behavior or application data loss and increased vulnerability to replay attacks. Administrators should only use this when these risks are understood. The property can be set when the dataset is created, or dynamically, and will take effect immediately. To change the property, an administrator can use the standard 'zfs' command. For example: # zfs create -o sync=disabled whirlpool/milek # zfs set sync=always whirlpool/perrin -- Team ZIL. It should be in build 140. For a little bit more information on it you might look at http://milek.blogspot.com/2010/05/zfs-synchronous-vs-asynchronous-io.html -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Heads Up: zil_disable has expired, ceased to be, ...
On 06/05/2010 12:24, Pawel Jakub Dawidek wrote: I read that this property is not inherited and I can't see why. If what I read is up-to-date, could you tell why? It is inherited. Sorry for the confusion but there was a discussion if it should or should not be inherited, then we propose that it shouldn't but it was changed again during a PSARC review that it should. And I did a copy'n'paste here. Again, sorry for the confusion. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Heads Up: zil_disable has expired, ceased to be, ...
On 06/05/2010 13:12, Robert Milkowski wrote: On 06/05/2010 12:24, Pawel Jakub Dawidek wrote: I read that this property is not inherited and I can't see why. If what I read is up-to-date, could you tell why? It is inherited. Sorry for the confusion but there was a discussion if it should or should not be inherited, then we propose that it shouldn't but it was changed again during a PSARC review that it should. And I did a copy'n'paste here. Again, sorry for the confusion. Well, actually I did copy'n'paste a proper page as it doesn't say anything about inheritance. Nevertheless, yes it is inherited. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Loss of L2ARC SSD Behaviour
On 06/05/2010 15:31, Tomas Ögren wrote: On 06 May, 2010 - Bob Friesenhahn sent me these 0,6K bytes: On Wed, 5 May 2010, Edward Ned Harvey wrote: In the L2ARC (cache) there is no ability to mirror, because cache device removal has always been supported. You can't mirror a cache device, because you don't need it. How do you know that I don't need it? The ability seems useful to me. The gain is quite minimal.. If the first device fails (which doesn't happen too often I hope), then it will be read from the normal pool once and then stored in ARC/L2ARC again. It just behaves like a cache miss for that specific block... If this happens often enough to become a performance problem, then you should throw away that L2ARC device because it's broken beyond usability. Well if a L2ARC device fails there might be an unacceptable drop in delivered performance. If it were mirrored than a drop usually would be much smaller or there could be no drop if a mirror had an option to read only from one side. Being able to mirror L2ARC might especially be useful once a persistent L2ARC is implemented as after a node restart or a resource failover in a cluster L2ARC will be kept warm. Then the only thing which might affect L2 performance considerably would be a L2ARC device failure... -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Loss of L2ARC SSD Behaviour
On 06/05/2010 19:08, Michael Sullivan wrote: Hi Marc, Well, if you are striping over multiple devices the you I/O should be spread over the devices and you should be reading them all simultaneously rather than just accessing a single device. Traditional striping would give 1/n performance improvement rather than 1/1 where n is the number of disks the stripe is spread across. The round-robin access I am referring to, is the way the L2ARC vdevs appear to be accessed. So, any given object will be taken from a single device rather than from several devices simultaneously, thereby increasing the I/O throughput. So, theoretically, a stripe spread over 4 disks would give 4 times the performance as opposed to reading from a single disk. This also assumes the controller can handle multiple I/O as well or that you are striped over different disk controllers for each disk in the stripe. SSD's are fast, but if I can read a block from more devices simultaneously, it will cut the latency of the overall read. Keep in mind that the largest block is currently 128KB and you always need to read an entire block. Splitting a block across several L2ARC devices would probably decrease performance and would invalidate all blocks if only a single l2arc device would die. Additionally having each block only on one l2arc device allows to read from all of l2arc devices at the same time. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Heads Up: zil_disable has expired, ceased to be, ...
On 06/05/2010 21:45, Nicolas Williams wrote: On Thu, May 06, 2010 at 03:30:05PM -0500, Wes Felter wrote: On 5/6/10 5:28 AM, Robert Milkowski wrote: sync=disabled Synchronous requests are disabled. File system transactions only commit to stable storage on the next DMU transaction group commit which can be many seconds. Is there a way (short of DTrace) to write() some data and get notified when the corresponding txg is committed? Think of it as a poor man's group commit. fsync(2) is it. Of course, if you disable sync writes then there's no way to find out for sure. If you need to know when a write is durable, then don't disable sync writes. Nico There is one way - issue a sync(2) - even with sync=disabled it will sync all filesystems and then return. Another workaround would be to create a snapshot... However I agree with Nico - if you don't need sync=disabled then don't use it. Someone else mentioned that yet another option like sync=fsync-only would be useful so all would be async but fsync() - but frankly I'm not convinced as it would require a support in your application but at this point you already have a full control of the behavior without need for sync=disabled. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL behavior on import
On 05/05/2010 20:45, Steven Stallion wrote: All, I had a question regarding how the ZIL interacts with zpool import: Given that the intent log is replayed in the event of a system failure, does the replay behavior differ if -f is passed to zpool import? For example, if I have a system which fails prior to completing a series of writes and I reboot using a failsafe (i.e. install disc), will the log be replayed after a zpool import -f ? yes -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pool import with failed ZIL device now possible ?
On 16/02/2010 21:54, Jeff Bonwick wrote: People used fastfs for years in specific environments (hopefully understanding the risks), and disabling the ZIL is safer than fastfs. Seems like it would be a useful ZFS dataset parameter. We agree. There's an open RFE for this: 6280630 zil synchronicity No promise on date, but it will bubble to the top eventually. So everyone knows - it has been integrated into snv_140 :) -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance of the ZIL
On 04/05/2010 18:19, Tony MacDoodle wrote: How would one determine if I should have a separate ZIL disk? We are using ZFS as the backend of our Guest Domains boot drives using LDom's. And we are seeing bad/very slow write performance? if you can disable ZIL and compare the performance to when it is off it will give you an estimate of what's the absolute maximum performance increase (if any) by having a dedicated ZIL device. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance drop during scrub?
On 28/04/2010 21:39, David Dyer-Bennet wrote: The situations being mentioned are much worse than what seem reasonable tradeoffs to me. Maybe that's because my intuition is misleading me about what's available. But if the normal workload of a system uses 25% of its sustained IOPS, and a scrub is run at low priority, I'd like to think that during a scrub I'd see a little degradation in performance, and that the scrub would take 25% or so longer than it would on an idle system. There's presumably some inefficiency, so the two loads don't just add perfectly; so maybe another 5% lost to that? That's the big uncertainty. I have a hard time believing in 20% lost to that. Well, it's not that easy as there are many other factors you need to take into account. For example how many IOs are you allowing to be queued per device? This might affect a latency for your application. Or if you have a disk array with its own cache - just by doing scrub you might be pushing other entries in a cache out which might impact the performance of your application. Then there might be SAN and and so on. I'm not saying there is no room for improvement here. All I'm saying is that it is not as easy problem as it seems. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Compellant announces zNAS
On 29/04/2010 07:57, Phil Harman wrote: That screen shot looks very much like Nexenta 3.0 with a different branding. Elsewhere, The Register confirms it's OpenSolaris. Well it looks like it is running Nexenta which is based on Open Solaris. But it is not the Open Solaris *distribution*. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re-attaching zpools after machine termination [amazon ebs ec2]
On 26/04/2010 09:27, Phillip Oldham wrote: Then perhaps you should do zpool import -R / pool *after* you attach EBS. That way Solaris won't automatically try to import the pool and your scripts will do it once disks are available. zpool import doesn't work as there was no previous export. I'm trying to solve the case where the instance terminates unexpectedly; think of someone just pulling the plug. There's no way to do the export operation before it goes down, but I still need to bring it back up, attach the EBS drives and continue as previous. The start/attach/reboot/available cycle is interesting, however. I may be able to init a reboot after attaching the drives, but it's not optimal - there's always a chance the instance might not come back up after the reboot. And it still doesn't answer *why* the drives aren't showing any data after they're initially attached. You don't have to do exports as I suggested to use 'zpool -R / pool' (notice -R). If you do so that a pool won't be added to zpool.cache and therefore after a reboot (unexpected or not) you will be able to import it again (and do so with -R). That way you can easily script it so import happens after your disks ara available. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re-attaching zpools after machine termination [amazon ebs ec2]
On 26/04/2010 11:14, Phillip Oldham wrote: You don't have to do exports as I suggested to use 'zpool -R / pool' (notice -R). I tried this after your suggestion (including the -R switch) but it failed, saying the pool I was trying to import didn't exist. which means it couldn't discover it. does 'zpool import' (no other options) list the pool? -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Pool, what happen when disk failure
On 25/04/2010 13:08, Edward Ned Harvey wrote: The system should boot-up properly even if some pools are not accessible (except rpool of course). If it is not the case then there is a bug - last time I checked it worked perfectly fine. This may be different in the latest opensolaris, but in the latest solaris, this is what I know: If a pool fails, and forces an ungraceful shutdown, then during the next bootup, the pool is treated as currently in use by another system. The OS doesn't come up all the way; you have to power cycle again, and go into failsafe mode. Then you can zpool import I think requiring the -f or -F, and reboot again normal. I just did a test on Solaris 10/09 - and system came up properly, entirely on its own, with a failed pool. zpool status showed the pool as unavailable (as I removed an underlying device) which is fine. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Benchmarking Methodologies
On 21/04/2010 18:37, Ben Rockwood wrote: You've made an excellent case for benchmarking and where its useful but what I'm asking for on this thread is for folks to share the research they've done with as much specificity as possible for research purposes. :) However you can also find some benchmarks with sysbench + mysql or oracle. I don't remember if I posted or not some of my results but I'm pretty sure you can find others. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Pool, what happen when disk failure
On 24/04/2010 13:51, Edward Ned Harvey wrote: But what you might not know: If any pool fails, the system will crash. This actually depends on the failmode property setting in your pools. The default is panic, but it also might be wait or continue - see zpool(1M) man page for more details. You will need to power cycle. The system won't boot up again; you'll have to The system should boot-up properly even if some pools are not accessible (except rpool of course). If it is not the case then there is a bug - last time I checked it worked perfectly fine. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re-attaching zpools after machine termination [amazon ebs ec2]
On 23/04/2010 13:38, Phillip Oldham wrote: The instances are ephemeral; once terminated they cease to exist, as do all their settings. Rebooting an image keeps any EBS volumes attached, but this isn't the case I'm dealing with - its when the instance terminates unexpectedly. For instance, if a reboot operation doesn't succeed or if there's an issue with the data-centre. There isn't any way (yet, AFACT) to attach an EBS during the boot process, so they must be attached after boot. Then perhaps you should do zpool import -R / pool *after* you attach EBS. That way Solaris won't automatically try to import the pool and your scripts will do it once disks are available. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is file cloning anywhere on ZFS roadmap
On 21/04/2010 07:41, Schachar Levin wrote: Hi, We are currently using NetApp file clone option to clone multiple VMs on our FS. ZFS dedup feature is great storage space wise but when we need to clone allot of VMs it just takes allot of time. Is there a way (or a planned way) to clone a file without going through the process of actually copying the blocks, but just duplicating its meta data like NetApp does? I don't know about file cloning but why not put each VM on top of a zvol - then you can clone a zvol. ? -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Double slash in mountpoint
but it suggests that it had nothing to do with a double slash - rather some process (your shell?) had an open file within the mountpoint. But supplying -f you forced zfs to unmount it anyway. -- Robert Milkowski http://milek.blogspot.com On 21/04/2010 06:16, Ryan John wrote: Thanks. That was it -Original Message- From: Brandon High [mailto:bh...@freaks.com] Sent: Wednesday, 21 April 2010 6:57 AM To: Ryan John Cc: zfs-discuss Subject: Re: [zfs-discuss] Double slash in mountpoint On Tue, Apr 20, 2010 at 7:38 PM, Ryan Johnjohn.r...@bsse.ethz.ch wrote: Anyone know how to fix it? I can't even do a zfs destroy zfs unmount -a -f -B ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Benchmarking Methodologies
On 21/04/2010 04:43, Ben Rockwood wrote: I'm doing a little research study on ZFS benchmarking and performance profiling. Like most, I've had my favorite methods, but I'm re-evaluating my choices and trying to be a bit more scientific than I have in the past. To that end, I'm curious if folks wouldn't mind sharing their work on the subject? What tool(s) to you prefer in what situations? Do you have a standard method of running them (tool args; block sizes, thread counts, ...) or procedures between runs (zpool import/export, new dataset creation,...)? etc. Any feedback is appreciated. I want to get a good sampling of opinions. I haven't heard from you in a while! Good to see you here again :) Sorry for stating obvious but at the end of a day it depends on what your goals are. Are you interested in micro-benchmarks and comparison to other file systems? I think the most relevant filesystem benchmarks for users is when you benchmark a specific application and present results from an application point of view. For example, given a workload for Oracle, MySQL, LDAP, ... how quickly it completes? How much benefit there is by using SSDs? What about other filesystems? Micro-benchmarks are fine but very hard to be properly interpreted by most users. Additionally most benchmarks are almost useless if they are not compared to some other configuration with only a benchmarked component changed. For example, knowing that some MySQL load completes in 1h on ZFS is basically useless. But knowing that on the same HW with Linux/ext3 and under the same load it completes in 2h would be interesting to users. Other interesting thing would be to see an impact of different ZFS setting on a benchmark results (aligned recordsize for database vs. default, atime off vs. on, lzjb, gzip, ssd). Also comparison of benchmark results with all default zfs setting compared to whatever setting you did which gave you the best result. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] casesensitivity mixed and CIFS
On 14/04/2010 16:04, John wrote: Hello, we set our ZFS filesystems to casesensitivity=mixed when we created them. However, CIFS access to these files is still case sensitive. Here is the configuration: # zfs get casesensitivity pool003/arch NAME PROPERTY VALUESOURCE pool003/arch casesensitivity mixed- # At the pool level it's set as follows: # zfs get casesensitivity pool003 NAME PROPERTY VALUESOURCE pool003 casesensitivity sensitive- # From a Windows client, accessing \\filer\arch\MYFOLDER\myfile.txt fails, while accessing \\filer\arch\myfolder\myfile.txt works. Any ideas? We are running snv_130. you are not using Samba daemon, are you? -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 07/04/2010 13:58, Ragnar Sundblad wrote: Rather: ...=19 would be ... if you don't mind loosing data written the ~30 seconds before the crash, you don't have to mirror your log device. For a file server, mail server, etc etc, where things are stored and supposed to be available later, you almost certainly want redundancy on your slog too. (There may be file servers where this doesn't apply, but they are special cases that should not be mentioned in the general documentation.) While I agree with you I want to mention that it is all about understanding a risk. In this case not only your server has to crash in such a way so data has not been synced (sudden power loss for example) but there would have to be some data committed to a slog device(s) which was not written to a main pool and when your server restarts your slog device would have to completely die as well. Other than that you are fine even with unmirrored slog device. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 07/04/2010 15:35, Bob Friesenhahn wrote: On Wed, 7 Apr 2010, Ragnar Sundblad wrote: So the recommendation for zpool 19 would be *strongly* recommended. Mirror your log device if you care about using your pool. And the recommendation for zpool =19 would be ... don't mirror your log device. If you have more than one, just add them both unmirrored. Rather: ... =19 would be ... if you don't mind loosing data written the ~30 seconds before the crash, you don't have to mirror your log device. It is also worth pointing out that in normal operation the slog is essentially a write-only device which is only read at boot time. The writes are assumed to work if the device claims success. If the log device fails to read (oops!), then a mirror would be quite useful. it is only read at boot if there are uncomitted data on it - during normal reboots zfs won't read data from slog. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question about large pools
On 02/04/2010 05:45, Roy Sigurd Karlsbakk wrote: Hi all From http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide I read Avoid creating a RAIDZ, RAIDZ-2, RAIDZ-3, or a mirrored configuration with one logical device of 40+ devices. See the sections below for examples of redundant configurations. What do they mean by this? 40+ devices in a single raidz[123] set or 40+ devices in a pool regardless of raidz[123] sets? It means - try to avoid a single RAID-Z group with 40+ disk drives. Creating several smaller groups in a one pool is perfectly fine. So for example - on x4540 servers try to avoid creating a pool with a single RAID-Z3 group made of 44 disks, rather create 4 RAID-Z2 groups each made of 11 disks all of them in a single pool. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] To slice, or not to slice
On 03/04/2010 19:24, Tim Cook wrote: On Fri, Apr 2, 2010 at 4:05 PM, Edward Ned Harvey guacam...@nedharvey.com mailto:guacam...@nedharvey.com wrote: Momentarily, I will begin scouring the omniscient interweb for information, but I’d like to know a little bit of what people would say here. The question is to slice, or not to slice, disks before using them in a zpool. One reason to slice comes from recent personal experience. One disk of a mirror dies. Replaced under contract with an identical disk. Same model number, same firmware. Yet when it’s plugged into the system, for an unknown reason, it appears 0.001 Gb smaller than the old disk, and therefore unable to attach and un-degrade the mirror. It seems logical this problem could have been avoided if the device added to the pool originally had been a slice somewhat smaller than the whole physical device. Say, a slice of 28G out of the 29G physical disk. Because later when I get the infinitesimally smaller disk, I can always slice 28G out of it to use as the mirror device. There is some question about performance. Is there any additional overhead caused by using a slice instead of the whole physical device? There is another question about performance. One of my colleagues said he saw some literature on the internet somewhere, saying ZFS behaves differently for slices than it does on physical devices, because it doesn’t assume it has exclusive access to that physical device, and therefore caches or buffers differently … or something like that. Any other pros/cons people can think of? And finally, if anyone has experience doing this, and process recommendations? That is … My next task is to go read documentation again, to refresh my memory from years ago, about the difference between “format,” “partition,” “label,” “fdisk,” because those terms don’t have the same meaning that they do in other OSes… And I don’t know clearly right now, which one(s) I want to do, in order to create the large slice of my disks. Your experience is exactly why I suggested ZFS start doing some right sizing if you will. Chop off a bit from the end of any disk so that we're guaranteed to be able to replace drives from different manufacturers. The excuse being no reason to, Sun drives are always of identical size. If your drives did indeed come from Sun, their response is clearly not true. Regardless, I guess I still think it should be done. Figure out what the greatest variation we've seen from drives that are supposedly of the exact same size, and chop it off the end of every disk. I'm betting it's no more than 1GB, and probably less than that. When we're talking about a 2TB drive, I'm willing to give up a gig to be guaranteed I won't have any issues when it comes time to swap it out. that's what open solaris is doing more or less for some time now. look in the archives of this mailing list for more information. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 02/04/2010 16:04, casper@sun.com wrote: sync() is actually *async* and returning from sync() says nothing about to clarify - in case of ZFS sync() is actually synchronous. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 01/04/2010 13:01, Edward Ned Harvey wrote: Is that what sync means in Linux? A sync write is one in which the application blocks until the OS acks that the write has been committed to disk. An async write is given to the OS, and the OS is permitted to buffer the write to disk at its own discretion. Meaning the async write function call returns sooner, and the application is free to continue doing other stuff, including issuing more writes. Async writes are faster from the point of view of the application. But sync writes are done by applications which need to satisfy a race condition for the sake of internal consistency. Applications which need to know their next commands will not begin until after the previous sync write was committed to disk. ROTFL!!! I think you should explain it even further for Casper :) :) :) :) :) :) :) -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 01/04/2010 20:58, Jeroen Roodhart wrote: I'm happy to see that it is now the default and I hope this will cause the Linux NFS client implementation to be faster for conforming NFS servers. Interesting thing is that apparently defaults on Solaris an Linux are chosen such that one can't signal the desired behaviour to the other. At least we didn't manage to get a Linux client to asynchronously mount a Solaris (ZFS backed) NFS export... Which is to be expected as it is not a nfs client which requests the behavior but rather a nfs server. Currently on Linux you can export a share with as sync (default) or async share while on Solaris you can't really currently force a NFS server to start working in an async mode. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] can't destroy snapshot
On 01/04/2010 15:24, Richard Elling wrote: On Mar 31, 2010, at 7:57 PM, Charles Hedrick wrote: So that eliminates one of my concerns. However the other one is still an issue. Presumably Solaris Cluster shouldn't import a pool that's still active on the other system. We'll be looking more carefully into that. Older releases of Solaris Cluster used SCSI reservations to help prevent such things. However, that is now tunable :-( Did you tune it? scsi reservation is used only if a node left a cluster. so for example in a two-node cluster when both nodes are part of a cluster both of them have a full access to shared storage and you can force zpool import on both nodes at the same time. When you think about it you need actually such behavior for RAC to work on raw devices or real cluster volumes or filesystems, etc. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] can't destroy snapshot
On 01/04/2010 02:01, Charles Hedrick wrote: So we tried recreating the pool and sending the data again. 1) compression wasn't set on the copy, even though I did sent -R, which is supposed to send all properties 2) I tried killing to send | receive pipe. Receive couldn't be killed. It hung. 3) This is Solaris Cluster. We tried forcing a failover. The pool mounted on the other server without dismounting on the first. zpool list showed it mounted on both machines. zpool iostat showed I/O actually occurring on both systems. Altogether this does not give me a good feeling about ZFS. I'm hoping the problem is just with receive and CLuster, and the it works properly on a single system. Because i'm running a critical database on ZFS on another system. 1. you shouldn't allow for a pool to be imported on more than one node at a time, if you do you will probably loose entire pool 2. if you have a pool under a cluster control and you want to manually import make sure that you do it in such an order: - disable hastorageplus resource which manages the pool - suspend a resource group so cluster won't start a storage resource in any event - manually import a pool and do whatever you need to do with it - however to be on a safe side import it with -R / option so if your node would reboot for some reason the pool won't be automatically imported - after you are done with whatever you wanted to do make sure you export the pool, resume the resource group and enable the storage resource The other approach is to keep a pool under a cluster management but eventually suspend a resource group so there won't be any unexpected failovers (but it really depends on circumstances and what you are trying to do). -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Wed, Mar 31, 2010 at 1:00 AM, Karsten Weiss Use something other than Open/Solaris with ZFS as an NFS server? :) I don't think you'll find the performance you paid for with ZFS and Solaris at this time. I've been trying to more than a year, and watching dozens, if not hundreds of threads. Getting half-ways decent performance from NFS and ZFS is impossible unless you disable the ZIL. Well, for lots of environments disabling ZIL is perfectly acceptable. And frankly the reason you get better performance out of the box on Linux as NFS server is that it actually behaves like with disabled ZIL - so disabling ZIL on ZFS for NFS shares is no worse than using Linux here or any other OS which behaves in the same manner. Actually it makes it better as even if ZIL is disabled ZFS filesystem is always consisten on a disk and you still get all the other benefits from ZFS. What would be useful though is to be able to easily disable ZIL per dataset instead of OS wide switch. This feature has already been coded and tested and awaits a formal process to be completed in order to get integrated. Should be rather sooner than later. You'd be better off getting NetApp Well, spend some extra money on a really fast NVRAM solution for ZIL and you will get much faster ZFS environment than NetApp and still you will spend much less money. Not to mention all the extra flexibity compared to NetApp. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Just to make sure you know ... if you disable the ZIL altogether, and you have a power interruption, failed cpu, or kernel halt, then you're likely to have a corrupt unusable zpool, or at least data corruption. If that is indeed acceptable to you, go nuts. ;-) I believe that the above is wrong information as long as the devices involved do flush their caches when requested to. Zfs still writes data in order (at the TXG level) and advances to the next transaction group when the devices written to affirm that they have flushed their cache. Without the ZIL, data claimed to be synchronously written since the previous transaction group may be entirely lost. If the devices don't flush their caches appropriately, the ZIL is irrelevant to pool corruption. I stand corrected. You don't lose your pool. You don't have corrupted filesystem. But you lose whatever writes were not yet completed, so if those writes happen to be things like database transactions, you could have corrupted databases or files, or missing files if you were creating them at the time, and stuff like that. AKA, data corruption. But not pool corruption, and not filesystem corruption. Which is an expected behavior when you break NFS requirements as Linux does out of the box. Disabling ZIL on a nfs server makes it no worse than the standard Linux behaviour - now you get decent performance at a cost of some data to get corrupted from a nfs client point of view. But then there are environments when it is perfectly acceptable as you there are not running critical databases but rather user home directories and zfs will flush a transaction maximum after 30s currently so user won't be able to loose more than last 30s if the nfs server would suddenly lost power. To clarify - if ZIL is disabled it makes no difference at all for a pool/filesystem level consistency. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
standard ZIL: 7m40s (ZFS default) 1x SSD ZIL: 4m07s (Flash Accelerator F20) 2x SSD ZIL: 2m42s (Flash Accelerator F20) 2x SSD mirrored ZIL: 3m59s (Flash Accelerator F20) 3x SSD ZIL: 2m47s (Flash Accelerator F20) 4x SSD ZIL: 2m57s (Flash Accelerator F20) disabled ZIL: 0m15s (local extraction0m0.269s) I was not so much interested in the absolute numbers but rather in the relative performance differences between the standard ZIL, the SSD ZIL and the disabled ZIL cases. Oh, one more comment. If you don't mirror your ZIL, and your unmirrored SSD goes bad, you lose your whole pool. Or at least suffer data corruption. This is not true. If ZIL device would die while pool is imported then ZFS would start using z ZIL withing a pool and continue to operate. On the other hand if your server would suddenly lost power and then when you power it up later on and ZFS detects that the ZIL is broken/gone it will require a sysadmin intervation to force the pool import and yes possibly loose some data. But how is it different from any other solution where your log is put on a separate device? Well, it is actually different. With ZFS you can still guearantee it to be consistent on-disk while others generally can't and often you will have to do fsck to even mount a fs in r/w... -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Simultaneous failure recovery
I have a pool (on an X4540 running S10U8) in which a disk failed, and the hot spare kicked in. That's perfect. I'm happy. Then a second disk fails. Now, I've replaced the first failed disk, and it's resilvered and I have my hot spare back. But: why hasn't it used the spare to cover the other failed drive? And can I hotspare it manually? I could do a straight replace, but that isn't quite the same thing. It seems like it is even driven. Hmmm.. perhaps it shouldn't be. Anyway you can do zpool replace and it is the same thing, why wouldn't it? -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Simultaneous failure recovery
On Tue, Mar 30, 2010 at 10:42 PM, Eric Schrockeric.schr...@oracle.com wrote: On Mar 30, 2010, at 5:39 PM, Peter Tribble wrote: I have a pool (on an X4540 running S10U8) in which a disk failed, and the hot spare kicked in. That's perfect. I'm happy. Then a second disk fails. Now, I've replaced the first failed disk, and it's resilvered and I have my hot spare back. But: why hasn't it used the spare to cover the other failed drive? And can I hotspare it manually? I could do a straight replace, but that isn't quite the same thing. Hot spares are only activated in response to a fault received by the zfs-retire FMA agent. There is no notion that the spares should be re-evaluated when they become available at a later point in time. Certainly a reasonable RFE, but not something ZFS does today. Definitely an RFE I would like. You can 'zpool attach' the spare like a normal device - that's all that the retire agent is doing under the hood. So, given: NAMESTATE READ WRITE CKSUM images DEGRADED 0 0 0 raidz1DEGRADED 0 0 0 c2t0d0 FAULTED 4 0 0 too many errors c3t0d0 ONLINE 0 0 0 c4t0d0 ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 c3t1d0 ONLINE 0 0 0 c4t1d0 ONLINE 0 0 0 spares c5t7d0AVAIL then it would be this? zpool attach images c2t0d0 c5t7d0 which I had considered, but the man page for attach says The existing device cannot be part of a raidz configuration. If I try that it fails, saying: /invalid vdev specification use '-f' to override the following errors: dev/dsk/c5t7d0s0 is reserved as a hot spare for ZFS pool images. Please see zpool(1M). Thanks! You need to use zpool replace. Once you fix the failed drive and it re-synchronizes a hot spare will detach automatically (regardless if you forced it to kick-in via zpool replace or if it did so due to FMA). For more details see http://blogs.sun.com/eschrock/entry/zfs_hot_spares -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] bit-flipping in RAM...
On 31/03/2010 10:27, Erik Trimble wrote: Orvar's post over in opensol-discuss has me thinking: After reading the paper and looking at design docs, I'm wondering if there is some facility to allow for comparing data in the ARC to it's corresponding checksum. That is, if I've got the data I want in the ARC, how can I be sure it's correct (and free of hardware memory errors)? I'd assume the way is to also store absolutely all the checksums for all blocks/metadatas being read/written in the ARC (which, of course, means that only so much RAM corruption can be compensated for), and do a validation when that every time that block is used/written from the ARC. You'd likely have to do constant metadata consistency checking, and likely have to hold multiple copies of metadata in-ARC to compensate for possible corruption. I'm assuming that this has at least been explored, right? A subset of this is already done. The ARC keeps its own in memory checksum (because some buffers in the ARC are not yet on stable storage so don't have a block pointer checksum yet). http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c arc_buf_freeze() arc_buf_thaw() arc_cksum_verify() arc_cksum_compute() It isn't done on every access but it can detect in memory corruption - I've seen it happen on several occasions but all due to errors in my code not bad physical memory. Doing in more frequently could cause a significant performance problem. or there might be an extra zpool level (or system wide) property to enable checking checksums onevery access from ARC - there will be a siginificatn performance impact but then it might be acceptable for really paranoid folks especially with modern hardware. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 31/03/2010 17:31, Bob Friesenhahn wrote: On Wed, 31 Mar 2010, Edward Ned Harvey wrote: Would your users be concerned if there was a possibility that after extracting a 50 MB tarball that files are incomplete, whole subdirectories are missing, or file permissions are incorrect? Correction: Would your users be concerned if there was a possibility that after extracting a 50MB tarball *and having a server crash* then files could be corrupted as described above. If you disable the ZIL, the filesystem still stays correct in RAM, and the only way you lose any data such as you've described, is to have an ungraceful power down or reboot. Yes, of course. Suppose that you are a system administrator. The server spontaneously reboots. A corporate VP (CFO) comes to you and says that he had just saved the critical presentation to be given to the board of the company (and all shareholders) later that day, and now it is gone due to your spontaneous server reboot. Due to a delayed financial statement, the corporate stock plummets. What are you to do? Do you expect that your employment will continue? Reliable NFS synchronous writes are good for the system administrators. well, it really depends on your environment. There is place for Oracle database and there is place for MySQL, then you don't really need to cluster everything and then there are environments where disabling ZIL is perfectly acceptablt. One of such cases is that you need to re-import a database or recover lots of files over NFS - your service is down and disabling ZIL makes a recovery MUCH faster. Then there are cases when leaving the ZIL disabled is acceptable as well. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 31/03/2010 17:22, Edward Ned Harvey wrote: The advice I would give is: Do zfs autosnapshots frequently (say ... every 5 minutes, keeping the most recent 2 hours of snaps) and then run with no ZIL. If you have an ungraceful shutdown or reboot, rollback to the latest snapshot ... and rollback once more for good measure. As long as you can afford to risk 5-10 minutes of the most recent work after a crash, then you can get a 10x performance boost most of the time, and no risk of the aforementioned data corruption. I don't really get it - rolling back to a last snapshot doesn't really improve things here it actually makes it worse as now you are going to loose even more data. Keep in mind that currently the maximum time after which ZFS commits a transaction is 30s - ZIL or not. So with disabled ZIL in worst case scenario you should loose no more than last 30-60s. You can tune it down if you want. Rolling back to a snapshot will only make it worse. Then also keep in mind that it is a worst case scenario here - it well may be there were no outstanding transactions at all - it all goes down basically to a risk assessment, impact assessment and a cost. Unless you are talking about doing regular snapshots and making sure that application is consistent while doing so - for example putting all Oracle tablespaces in a hot backup mode and taking a snapshot... otherwise it doesn't really make sense. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss