Re: [zfs-discuss] best way to configure raidz groups
Rather than hacking something like that, he could use a Disk on Module (http://en.wikipedia.org/wiki/Disk_on_module) or something like http://www.tomshardware.com/news/nanoSSD-Drive-Elecom-Japan-SATA,8538.html (which I suspect may be a DOM but I've not poked around sufficiently to see). Paul -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
Let me sum up my thoughts in this topic. To Richard [relling] : I agree with you this topic is even more confusing if we are not careful enough to specify exactly what we are talking about. Thin provision can be done on multiple layers, and though you said you like it to be closer to the app than closer to the dumb disks (if you were referring to SAN), my opinion is that each and every scenario has it's own pros/cons. I learned long time ago not to declare a technology good/bad, there are technologies which are used properly (usually declared as good tech) and others which are not (usually declared as bad). -- Let me clarify my case, and why I mentioned thin devices on SAN specifically. Many people replied with the thin device support of ZFS (which is called sparse volumes if I'm correct), but what I was talking about is something else. It's thin device awareness on the SAN. In this case you configure your LUN in the SAN as thin device, a virtual LUN(s) which is backed by a pool of physical disks in the SAN. From the OS it's transparent, so it is from the Volume Manager/Filesystem point of view. That is the basic definition of my scenarion with thin devices on SAN. High-end SAN frames like HDS USP-V (feature called Hitachi Dynamic Provisioning), EMC Symmetrix V-Max (feature called Virtual provisioning) supports this (and I'm sure many others as well). As you discovered the LUN in the OS, you start to use it, like put under Volume Manager, create filesystem, copy files, but the SAN only allocates physical blocks (more precisely group of blocks called extents) as you write them, which means you'll use only as much (or a bit more rounded to the next extent) on the physical disk as you use in reality. From this standpoint we can define two terms, thin-friendly and thin-hostile environments. Thin-friendly would be any environment where OS/VM/FS doesn't write to blocks it doesn't really use (for example during initialization it doesn't fills up the LUN with a pattern or 0s). That's why Veritas' SmartMove is a nice feature, as when you move from fat to thin devices (from the OS both LUNs look exactly the same), it will copy the blocks only which are used by the VxFS files. That is still the basics of having thin devices on SAN, and hope to have a thin-friendly environment. The next level of this is the management of the thin devices and the physical pool where thin devices allocates their extents from. Even if you get migrated to thin device LUNs, your thin devices will become fat again, even if you fill up your filesystem once, the thin device on the SAN will remain fat, no space reclamation is happening by default. The reason is pretty simple, the SAN storage has no knowledge of the filesystem structure, as such it can't decide whether a block should be released back to the pool, or it's really not in use. Then came Veritas with this brilliant idea of building a bridge between the FS and the SAN frame (this became the Thin Reclamation API), so they can communicate which blocks are not in use indeed. I really would like you to read this Quick Note from Veritas about this feature, it will explain way better the concept as I did : http://ftp.support.veritas.com/pub/support/products/Foundation_Suite/338546.pdf Btw, in this concept VxVM can even detect (via ASL) whether a LUN is thin device/thin device reclamation capable or not. Honestly I have mixed feeling about ZFS. I feel that this is obviously the future's VM/Filesystem, but then I realize in the same time the roles of the individual parts in the big picture are getting mixed up. Am I the only one with the impression that ZFS sooner or later will evolve to a SAN OS, and the zfs, zpool commands will only become some lightweight interfaces to control the SAN frame? :-) (like Solution Enabler for EMC) If you ask me the pool concept always works more efficient if 1# you have more capacity in the pool 2# if you have more systems to share the pool, that's why I see the thin device pool more rational in a SAN frame. Anyway, I'm sorry if you were already aware what I explained above, I also hope I didn't offend anyone with my views, Regards, sendai -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On 31 dec 2009, at 06.01, Richard Elling wrote: On Dec 30, 2009, at 2:24 PM, Ragnar Sundblad wrote: On 30 dec 2009, at 22.45, Richard Elling wrote: On Dec 30, 2009, at 12:25 PM, Andras Spitzer wrote: Richard, That's an interesting question, if it's worth it or not. I guess the question is always who are the targets for ZFS (I assume everyone, though in reality priorities has to set up as the developer resources are limited). For a home office, no doubt thin provisioning is not much of a use, for an enterprise company the numbers might really make a difference if we look at the space used vs space allocated. There are some studies that thin provisioning can reduce physical space used up to 30%, which is huge. (Even though I understands studies are not real life and thin provisioning is not viable in every environment) Btw, I would like to discuss scenarios where though we have over-subscribed pool in the SAN (meaning the overall allocated space to the systems is more than the physical space in the pool) with proper monitoring and proactive physical drive adds we won't let any systems/applications attached to the SAN realize that we have thin devices. Actually that's why I believe configuring thin devices without periodically reclaiming space is just a timebomb, though if you have the option to periodically reclaim space, you can maintain the pool in the SAN in a really efficient way. That's why I found Veritas' Thin Reclamation API as a milestone in the thin device field. Anyway, only future can tell if thin provisioning will or won't be a major feature in the storage world, though as I saw Veritas already added this feature I was wondering if ZFS has it at least on it's roadmap. Thin provisioning is absolutely, positively a wonderful, good thing! The question is, how does the industry handle the multitude of thin provisioning models, each layered on top of another? For example, here at the ranch I use VMWare and Xen, which thinly provision virtual disks. I do this over iSCSI to a server running ZFS which thinly provisions the iSCSI target. If I had a virtual RAID array, I would probably use that, too. Personally, I think being thinner closer to the application wins over being thinner closer to dumb storage devices (disk drives). I don't get it - why do we need anything more magic (or complicated) than support for TRIM from the filesystems and the storage systems? TRIM is just one part of the problem (or solution, depending on your point of view). The TRIM command is part of the T10 protocols that allows a host to tell a block device that data in a set of blocks is no longer of any value, and the block device can destroy the data without adverse consequence. In a world with copy-on-write and without snapshots, it is obvious that there will be a lot of blocks running around that are no longer in use. Snapshots (and their clones) changes that use case. So in a world of snapshots, there will be fewer blocks which are not used. Remember, the TRIM command is very important to OSes like Windows or OSX which do not have file systems that are copy-on-write or have decent snapshots. OTOH, ZFS does copy-on-write and lots of ZFS folks use snapshots. I don't believe that there is such a big difference between those cases. Sure, snapshots may keep more data on disk, but only as much as the user choose to keep. There has been other ways to keep old data on disk before (RCS, Solaris patch backout blurbs, logs, caches, what have you), so there is not really a brand new world there. (BTW, once upon a time, real operating systems had (optional) file versioning built into the operating system or file system itself.) If there was a mechanism that always tended to keep all of the disk full, that would be another case. Snapshots may do that with the autosnapshot and warn-and-clean-when-getting-full features of OpenSolaris, but especially servers will probably not be managed that way, they will probably have a much more controlled snapshot policy. (Especially if you want to save every possible bit of disk space, as those guys with the big fantastic and ridiculously expensive storage systems always want to do - maybe that will change in the future though.) That said, adding TRIM support is not hard in ZFS. But it depends on lower level drivers to pass the TRIM commands down the stack. These ducks are lining up now. Good. I don't see why TRIM would be hard to implement for ZFS either, except that you may want to keep data from a few txgs back just for safety, which would probably call for some two-stage freeing of data blocks (those free blocks that are to be TRIMmed, and those that already are). Once a block is freed in ZFS, it no longer needs it. So the problem of TRIM in ZFS is not related to the recent txg commit history. It may be that you want to save a few txgs back, so if you get a failure where
Re: [zfs-discuss] Thin device support in ZFS?
On 31 dec 2009, at 00.31, Bob Friesenhahn wrote: On Wed, 30 Dec 2009, Mike Gerdts wrote: Should the block size be a tunable so that page size of SSD (typically 4K, right?) and upcoming hard disks that sport a sector size 512 bytes? Enterprise SSDs are still in their infancy. The actual page size of an SSD could be almost anything. Due to lack of seek time concerns and the high cost of erasing a page, a SSD could be designed with a level of indirection so that multiple logical writes to disjoint offsets could be combined into a single SSD physical page. Likewise a large logical block could be subdivided into mutiple SSD pages, which are allocated on demand. Logic is cheap and SSDs are full of logic so it seems reasonable that future SSDs will do this, if not already, since similar logic enables wear-leveling. I believe that almost all flash devices are already are doing this, and only the first generation SD cards or something like that are not doing it and leaving it to the host. But I could be wrong of course. /ragge s ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] hard drive choice, TLER/ERC/CCTL
Thanks, sounds like it should handle all but the worst faults OK then; I believe the maximum retry timeout is typically set to about 60 seconds in consumer drives. Are you sure about this? I thought these consumer level drives would try indefinitely to carry out its operation. Even Samsung's white paper on CCTL RAID error recovery says it could take a minute or longer (see Desktop Unsuccessful Error Recovery diagram) http://www.samsung.com/global/business/hdd/learningresource/whitepapers/LearningResource_CCTL.html -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Changing ZFS drive pathing
Mike wrote: Just thought I would let you all know that I followed what Alex suggested along with what many of you pointed out and it worked! Here are the steps I followed: 1. Break root drive mirror 2. zpool export filesystem 3. run the command to start MPIOX and reboot the machine 4. zpool import filesystem 5. Check the system 6. Recreate the mirror. Thank you all for the help! I feel much better and it worked without a single problem! I'm very impressed with MPXIO and wish I had known about it before spending thousands of dollars on PowerPath. As somebody who's done a bunch of work on stmsboot[a], and who has at least a passing knowledge of devids[b] (which are what ZFS and MPxIO use to identify devices), I am disappointed that you believe it was necessary to follow the above steps. Assuming that your devices do not have devids which change, then all that should have been required was [setup your root mirror] # /usr/sbin/stmsboot -e [reboot when prompted] [twiddle thumbs] [ login ] No ZFS export and import required. No breaking and recreating of mirror required. [a] http://blogs.sun.com/jmcp/entry/on_stmsboot_1m [b] http://www.jmcp.homeunix.com/~jmcp/WhatIsAGuid.pdf James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] best way to configure raidz groups
For the OS, I'd drop the adapter/compact-flash combo and use the stripped down Kingston version of the Intel x25m MLC SSD. If you're not familiar with it, the basic scoup is that this drive contains half the flash memory (40Gb) *and* half the controller channels (5 versus 10) of the Intel drive - and so, performance is basically a little less than half although read performance is still very good. For more info, google for hardware reviews. This product is still a little hard to find, froogle for the following part numbers: Desktop Bundle - SNV125-S2BD/40GB Bare drive - SNV125-S2/40GB Currently you can find the bare drive for under $100. This is bound to give you better performance and guaranteed compatibility compared to adapters and compact flash. The problem with adapters is that, although the price is great, compatibility and build quality are all over the map and YMMV considerably. You would not be happy if you saved $20 on the adapter/flash combo and ended up with nightmare reliability. The great thing about 2.5 SSDs is that mounting is simply a question of duct tape or velcro! [ well almost ... but you can velcro them onto the sidewall of you case ] So you can use all your available 3.5 disk drive bays for ZFS disks. I was able to find some of the 64 gb snv125-S2 drives for a decent price. Do these also work well for L2ARC? This brings more questions actually. I know it's not recommended to use partitons for ZFS but does this still apply for SSD's and the root pool? I was thinking about making maybe using half of the ssd for the root pool and putting the ZIL on the other half. Or would i just be better off leaving the ZIL on the raidz drives? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Help on Mailing List
Hello there, is there any possibilty to receive all old mailings from the list? I would like to search those for know-how that i don't double post to often :-) Thanks, Florian ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help on Mailing List
http://mail.opensolaris.org/pipermail/zfs-discuss/ Henrik http://sparcv9.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS pool unusable after attempting to destroy a dataset with dedup enabled
Yeah, still no joy on getting my pool back. I think I might have to try grabbing another server with a lot more memory and slapping the HBA and the drives in that. Can ZFS deal with a controller change? Just some more info that 'may' help. After I upgraded to 8GB of RAM, I did not limit the amount of RAM zfs can take. So if you are doing any kind of limiting in /etc/system, you may want to take that out. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] hard drive choice, TLER/ERC/CCTL
On Thu, Dec 31 at 2:14, Willy wrote: Thanks, sounds like it should handle all but the worst faults OK then; I believe the maximum retry timeout is typically set to about 60 seconds in consumer drives. Are you sure about this? I thought these consumer level drives would try indefinitely to carry out its operation. Even Samsung's white paper on CCTL RAID error recovery says it could take a minute or longer (see Desktop Unsuccessful Error Recovery diagram) http://www.samsung.com/global/business/hdd/learningresource/whitepapers/LearningResource_CCTL.html Depends very much on the firmware and the error type. Each vendor will have their own trade-secret approaches to solving this issue based on their own failure rates and expected usages. --eric -- Eric D. Mudama edmud...@mail.bounceswoosh.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On Thu, 31 Dec 2009, Ragnar Sundblad wrote: Also, currently, when the SSDs for some very strange reason is constructed from flash chips designed for firmware and slowly changing configuration data and can only erase in very large chunks, TRIMing is good for the housekeeping in the SSD drive. A typical use case for this would be a laptop. I have heard quite a few times that TRIM is good for SSD drives but I don't see much actual use for it. Every responsible SSD drive maintains a reserve of unused space (20-50%) since it is needed for wear leveling and to repair failing spots. This means that even when a SSD is 100% full it still has considerable space remaining. A very simple SSD design solution is that when a SSD block is overwritten it is replaced with an already-erased block from the free pool and the old block is submitted to the free pool for eventual erasure and re-use. This approach avoids adding erase times to the write latency as long as the device can erase as fast as the average date write rate. There are of course SSDs with hardly any (or no) reserve space, but while we might be willing to sacrifice an image or two to SSD block failure in our digital camera, that is just not acceptable for serious computer use. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
Just an update : Finally I found some technical details about this Thin Reclamation API : (http://blogs.hds.com/claus/2009/12/i-love-it-when-a-plan-comes-together.html) This week, (December 7th), Symantec announced their “completing the thin provisioning ecosystem” that includes the necessary API calls for the file system to “notify” the storage array when space is “deleted”. The interface is a previously disused and now revised/reused/repurposed SCSI command (called Write Same) which was jointly worked out with Symantec, Hitachi, and 3PAR. This command allows the file systems (in this case Veritas VxFS) to notify the storage systems that space is no longer occupied. How cool is that! There is also a subcommittee to INCITS T10 studying the standardization is this and SNIA is also studying this. It won’t be long before most file systems, databases, and storage vendors adopt this technology. So it's based on the SCSI Write Same/UNMAP command, (and if I understand correctly SATA TRIM is similar to this from the FS point of view) which standard is not ratified yet. Also, happy new year to everyone! Regards, sendai -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] what happens to the deduptable (DDT) when you set dedup=off ???
On 30/12/2009 22:57, ono wrote: will i be able to see which files were affected by dedup or can i do a zfs send/recieve to another filesystem to clean it up? send|recv will be enough. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS extremely slow performance
On Dec 31, 2009, at 2:49 AM, Robert Milkowski wrote: judging by a *very* quick glance it looks like you have an issue with c3t0d0 device which is responding very slowly. Yes, there is an I/O stuck on the device which is not getting serviced. See below... -- Robert Milkowski http://milek.blogspot.com On 31/12/2009 09:10, Emily Grettel wrote: Hi, I'm using OpenSolaris 127 from my previous posts to address CIFS problems. I have a few zpools but lately (with an uptime of 32 days) we've started to get CIFS issues and really bad IO performance. I've been running scrubs on a nightly basis. I'm not sure why its happenning either - I'm new to OpenSolaris. I ran fsstat whilst trying to unrar a 8.4Gb file with an ISO inside it: fsstat zfs 1 new name name attr attr lookup rddir read read write write file remov chng get setops ops ops bytes ops bytes 3.29K 466 367 633K 1.50K 1.66M 8.66K 964K 15.6G 314K 9.38G zfs 0 0 0 4 0 8 0 135 5.13M93 5.02M zfs 0 0 0 7 0 18 0 205 5.63M 137 5.64M zfs 0 0 0 4 0 8 090 3.92K49 14.6K zfs 0 0 0 4 0 8 0 115 16.4K65 27.5K zfs 0 0 0 8 0 13 0 153 8.36M 113 8.38M zfs 0 0 0 4 0 8 094 3.96K53 19.1K zfs 0 0 0 7 0 18 080 80042 1.13K zfs 0 0 0 4 0 8 090 3.92K48 7.62K zfs 0 0 0 4 0 8 099 132K53 7.14K zfs 0 0 0 4 0 8 0 188 5.99K96 5.62K zfs 0 0 0 4 0 8 095 664K52 420K zfs 0 0 0 9 0 22 0 164 7.97K92 12.2K zfs new name name attr attr lookup rddir read read write write file remov chng get setops ops ops bytes ops bytes 0 0 0 4 0 8 0 111 2.63M70 2.63M zfs 0 0 0 4 0 8 0 262 6.63M 153 6.63M zfs 0 0 0 4 0 8 080 80044 1.70K zfs 0 0 0 4 0 8 0 337 18.1M 247 18.1M zfs 0 0 0 7 0 18 0 127 5.75M89 5.63M zfs 0 0 0 4 0 8 080 80050 25.6K zfs My iostat appears below this message (its quite long! to give you an idea). I'm really not sure why the performance has really dropped all of a sudden or how to diagnose it. CIFS shares occasionally drop out too. Its a bit of a downer to be experiencing on the 31st of December. I hope everyone has a Safe Happy New Years :-) I'm unable to upgrade to the latest release because of an issue with python: pfexec pkg image-update Creating Plan /pkg: Cannot remove 'pkg://opensolaris.org/sunwipkg-gui-l...@0.5.11 ,5.11-0.127:2009T075414Z' due to the following packages that depend on it: pkg://opensolaris.org/SUNWipkg- g...@0.5.11,5.11-0.127:2009T075333Z So I'm stuck on 127 until I can rebuild this machine :( Cheers, Em extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.01.00.00.5 2.8 1.0 2815.9 1000.0 100 100 c7t3d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.00.00.00.0 2.0 1.00.00.0 100 100 c7t3d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.06.00.0 81.5 1.2 1.0 198.6 166.6 60 100 c7t3d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.00.00.00.0 0.0 1.00.00.0 0 100 c7t3d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.06.00.04.0 0.0 0.00.00.2 0 0 c7t1d0 0.06.00.04.0 0.0 0.00.00.2 0 0 c7t2d0 0.09.00.0 31.5 0.0 0.40.0 41.7 0 38 c7t3d0 0.06.00.04.0 0.0 0.00.00.3 0 0 c7t4d0 0.06.00.04.0 0.0 0.00.00.2 0 0 c7t5d0 0.06.00.04.0 0.0 0.00.00.1 0 0 c0t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 56.0 122.0 5972.3 8592.2 0.0 0.50.02.9 0 25 c7t1d0 55.0 136.0 5998.3 8590.2 0.0 0.70.03.8 0 29 c7t2d0 0.0 111.00.0 4342.9 0.0 2.20.0 20.2 0 57 c7t3d0 103.0 153.0 5868.3 8590.7 0.0 0.40.01.7 0 21 c7t4d0 96.0 130.0 5946.8 8591.2 0.0 0.70.03.2 0
Re: [zfs-discuss] ZFS extremely slow performance
On Thu, 31 Dec 2009, Emily Grettel wrote: I'm using OpenSolaris 127 from my previous posts to address CIFS problems. I have a few zpools but lately (with an uptime of 32 days) we've started to get CIFS issues and really bad IO performance. I've been running scrubs on a nightly basis. I'm not sure why its happenning either - I'm new to OpenSolaris. Without knowing anything about your pool, your c7t3d0 device seems possibly suspect. Notice that it often posts a very high asvc_t. What is the output from 'zpool status' for this pool? Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On 31 dec 2009, at 17.18, Bob Friesenhahn wrote: On Thu, 31 Dec 2009, Ragnar Sundblad wrote: Also, currently, when the SSDs for some very strange reason is constructed from flash chips designed for firmware and slowly changing configuration data and can only erase in very large chunks, TRIMing is good for the housekeeping in the SSD drive. A typical use case for this would be a laptop. I have heard quite a few times that TRIM is good for SSD drives but I don't see much actual use for it. Every responsible SSD drive maintains a reserve of unused space (20-50%) since it is needed for wear leveling and to repair failing spots. This means that even when a SSD is 100% full it still has considerable space remaining. (At least as long as those blocks aren't used up in place of bad/worn out) blocks...) A very simple SSD design solution is that when a SSD block is overwritten it is replaced with an already-erased block from the free pool and the old block is submitted to the free pool for eventual erasure and re-use. This approach avoids adding erase times to the write latency as long as the device can erase as fast as the average date write rate. This is what they do, as far as I have understood, but more free space to play with makes the job easier and therefor faster, and gives you a larger burst headroom before you hit the erase-speed limit of the disk. There are of course SSDs with hardly any (or no) reserve space, but while we might be willing to sacrifice an image or two to SSD block failure in our digital camera, that is just not acceptable for serious computer use. I think the idea is that with TRIM you can also use the file system's unused space for wear leveling and flash block filling. If your disk is completely full there is of course no gain. /ragge s ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] best way to configure raidz groups
Thomas Burgess wrote: For the OS, I'd drop the adapter/compact-flash combo and use the stripped down Kingston version of the Intel x25m MLC SSD. If you're not familiar with it, the basic scoup is that this drive contains half the flash memory (40Gb) *and* half the controller channels (5 versus 10) of the Intel drive - and so, performance is basically a little less than half although read performance is still very good. For more info, google for hardware reviews. This product is still a little hard to find, froogle for the following part numbers: Desktop Bundle - SNV125-S2BD/40GB Bare drive - SNV125-S2/40GB Currently you can find the bare drive for under $100. This is bound to give you better performance and guaranteed compatibility compared to adapters and compact flash. The problem with adapters is that, although the price is great, compatibility and build quality are all over the map and YMMV considerably. You would not be happy if you saved $20 on the adapter/flash combo and ended up with nightmare reliability. The great thing about 2.5 SSDs is that mounting is simply a question of duct tape or velcro! [ well almost ... but you can velcro them onto the sidewall of you case ] So you can use all your available 3.5 disk drive bays for ZFS disks. I was able to find some of the 64 gb snv125-S2 drives for a decent price. Do these also work well for L2ARC? This brings more questions actually. I know it's not recommended to use partitons for ZFS but does this still apply for SSD's and the root pool? I was thinking about making maybe using half of the ssd for the root pool and putting the ZIL on the other half. Or would i just be better off leaving the ZIL on the raidz drives? It's OK to use partitions on SSDs, so long as you realize that using an SSD for multiple purposes splits the bandwidth into the SSD across multiple uses. In your case, using an SSD as both an L2ARC and a root pool device is reasonable, as the rpool traffic should not be heavy. I would NOT recommend using a X25-M or especially the snv125-S2 as a ZIL device. Write performance isn't going to be very good at all - in fact, I think it should be not much different than using the bare drives. As an L2ARC cache device, however, it's a good choice. Oh, and there's plenty of bay adapters out there for cheap - use one. My favorite is a two-SSD-in-1-floppy drive bay like this: http://www.startech.com/item/HSB220SAT25B-35-Tray-Less-Dual-25-SATA-HD-Hot-Swap-Bay.aspx (I see them for under $40 at local stores) 20GB for a rpool is sufficient, so the rest can go to L2ARC. I would disable any swap volume on the SSDs, however. If you need swap, put it somewhere else. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On Dec 31, 2009, at 1:43 AM, Andras Spitzer wrote: Let me sum up my thoughts in this topic. To Richard [relling] : I agree with you this topic is even more confusing if we are not careful enough to specify exactly what we are talking about. Thin provision can be done on multiple layers, and though you said you like it to be closer to the app than closer to the dumb disks (if you were referring to SAN), my opinion is that each and every scenario has it's own pros/cons. I learned long time ago not to declare a technology good/bad, there are technologies which are used properly (usually declared as good tech) and others which are not (usually declared as bad). I hear you. But you are trapped thinking about 20th century designs and ZFS is a 21st century design. More below... Let me clarify my case, and why I mentioned thin devices on SAN specifically. Many people replied with the thin device support of ZFS (which is called sparse volumes if I'm correct), but what I was talking about is something else. It's thin device awareness on the SAN. In this case you configure your LUN in the SAN as thin device, a virtual LUN(s) which is backed by a pool of physical disks in the SAN. From the OS it's transparent, so it is from the Volume Manager/ Filesystem point of view. That is the basic definition of my scenarion with thin devices on SAN. High-end SAN frames like HDS USP-V (feature called Hitachi Dynamic Provisioning), EMC Symmetrix V-Max (feature called Virtual provisioning) supports this (and I'm sure many others as well). As you discovered the LUN in the OS, you start to use it, like put under Volume Manager, create filesystem, copy files, but the SAN only allocates physical blocks (more precisely group of blocks called extents) as you write them, which means you'll use only as much (or a bit more rounded to the next extent) on the physical disk as you use in reality. From this standpoint we can define two terms, thin-friendly and thin-hostile environments. Thin-friendly would be any environment where OS/VM/FS doesn't write to blocks it doesn't really use (for example during initialization it doesn't fills up the LUN with a pattern or 0s). That's why Veritas' SmartMove is a nice feature, as when you move from fat to thin devices (from the OS both LUNs look exactly the same), it will copy the blocks only which are used by the VxFS files. ZFS does this by design. There is no way in ZFS to not do this. I suppose it could be touted as a feature :-) Maybe we should brand ZFS as THINbyDESIGN(TM) Or perhaps we can rebrand SMARTMOVE(TM) as TRYINGTOCATCHUPWITHZFS(TM) :-) That is still the basics of having thin devices on SAN, and hope to have a thin-friendly environment. The next level of this is the management of the thin devices and the physical pool where thin devices allocates their extents from. Even if you get migrated to thin device LUNs, your thin devices will become fat again, even if you fill up your filesystem once, the thin device on the SAN will remain fat, no space reclamation is happening by default. The reason is pretty simple, the SAN storage has no knowledge of the filesystem structure, as such it can't decide whether a block should be released back to the pool, or it's really not in use. Then came Veritas with this brilliant idea of building a bridge between the FS and the SAN frame (this became the Thin Reclamation API), so they can communicate which blocks are not in use indeed. I really would like you to read this Quick Note from Veritas about this feature, it will explain way better the concept as I did : http://ftp.support.veritas.com/pub/support/products/Foundation_Suite/338546.pdf Btw, in this concept VxVM can even detect (via ASL) whether a LUN is thin device/thin device reclamation capable or not. Correct. Since VxVM and VxFS are separate software, they have expanded the interface between them. Consider adding a mirror or replacing a drive. Prior to SMARTMOVE, VxVM had no idea what part of the volume was data and what was unused. So VxVM would silver the mirror by copying all of the blocks from one side to the other. Clearly this is uncool when your SAN storage is virtualized. With SMARTMOVE, VxFS has a method to tell VxVM that portions of the volume are unused. Now when you silver the mirror, VxVM knows that some bits are unused and it won't bother to copy them. This is a bona fide good thing for virtualized SAN arrays. ZFS was designed with the knowledge that the limited interface between file systems and volume managers was a severe limitation that leads to all sorts of complexity and angst. So a different design is needed. ZFS has fully integrated RAID with the file system, so there is no need, by design, to create a new interface between these layers. In other words, the only way to silver a disk in ZFS is to silver the data. You can't silver unused space.
Re: [zfs-discuss] Thin device support in ZFS?
[I TRIMmed the thread a bit ;-)] On Dec 31, 2009, at 1:43 AM, Ragnar Sundblad wrote: On 31 dec 2009, at 06.01, Richard Elling wrote: In a world with copy-on-write and without snapshots, it is obvious that there will be a lot of blocks running around that are no longer in use. Snapshots (and their clones) changes that use case. So in a world of snapshots, there will be fewer blocks which are not used. Remember, the TRIM command is very important to OSes like Windows or OSX which do not have file systems that are copy-on-write or have decent snapshots. OTOH, ZFS does copy-on-write and lots of ZFS folks use snapshots. I don't believe that there is such a big difference between those cases. The reason you want TRIM for SSDs is to recover the write speed. A freshly cleaned page can be written faster than a dirty page. But in COW, you are writing to new pages and not rewriting old pages. This is fundamentally different than FAT, NTFS, or HFS+, but it is those markets which are driving TRIM adoption. [TRIMmed] Once a block is freed in ZFS, it no longer needs it. So the problem of TRIM in ZFS is not related to the recent txg commit history. It may be that you want to save a few txgs back, so if you get a failure where parts of the last txg gets lost, you will still be able to get an old (few seconds/minutes) version of your data back. This is already implemented. Blocks freed in the past few txgs are not returned to the freelist immediately. This was needed to enable uberblock recovery in b128. So TRIMming from the freelist is safe. This could happen if the sync commands aren't correctly implemented all the way (as we have seen some stories about on this list). Maybe someone disabled syncing somewhere to improve performance. It could also happen if a non volatile caching device, such as a storage controller, breaks in some bad way. Or maybe you just had a bad/old battery/supercap in a device that implements NV storage with batteries/supercaps. The issue is that traversing the free block list has to be protected by locks, so that the file system does not allocate a block when it is also TRIMming the block. Not so difficult, as long as the TRIM occurs relatively quickly. I think that any TRIM implementation should be an administration command, like scrub. It probably doesn't make sense to have it running all of the time. But on occasion, it might make sense. I am not sure why it shouldn't run at all times, except for the fact that it seems to be badly implemented in some SATA devices with high latencies, so that it will interrupt any data streaming to/from the disks. I don't see how it would not have negative performance impacts. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: I have heard quite a few times that TRIM is good for SSD drives but I don't see much actual use for it. Every responsible SSD drive maintains a reserve of unused space (20-50%) since it is needed for wear leveling and to repair failing spots. This means that even when a SSD is 100% full it still has considerable space remaining. A very simple SSD design solution is that when a SSD block is overwritten it is replaced with an already-erased block from the free pool and the old block is submitted to the free pool for eventual erasure and re-use. This approach avoids adding erase times to the write latency as long as the device can erase as fast as the average date write rate. The question in case if SSDs is: ZFS is COW, but does the SSD know which block is in use and which is not? If the SSD did know whether a block is in use, it could erase unused blocks in advance. But what is an unused block on a filesystem that supports snapshots? From the perspective of the SSD I see only the following difference between a COW filesystem an a conventional filesystem. A conventional filesystem may write more often to the same block number than a COW filesystem does. But even for the non-COW case, I would expect that the SSD frequently remaps overwritten blocks to previously erased spares. My conclusion is that ZFS on a SSD works fine in case that the the primary used blocks plus all active snapshots use less space than the official size - the spare reserve from the SSD. If you however fill up the medium, I expect a performance degradation. Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
Richard Elling richard.ell...@gmail.com wrote: The reason you want TRIM for SSDs is to recover the write speed. A freshly cleaned page can be written faster than a dirty page. But in COW, you are writing to new pages and not rewriting old pages. This is fundamentally different than FAT, NTFS, or HFS+, but it is those markets which are driving TRIM adoption. Your mistake is to asume a maiden SSD and not to think about what's happening after the SSD was in use for a while. Even for the COW case, blocks are reused after some time and the disk does has no way to know in advance which blocks are still in use and which blocks are no longer used and may be prepared for being overwritten. Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zpool creation best practices
mijoh...@gmail.com said: I've never had a lun go bad but bad things do happen. Does anyone else use ZFS in this way? Is this an unrecommended setup? We used ZFS like this on a Hitachi array for 3 years. Worked fine, not one bad block/checksum error detected. Still using it on an old Sun 6120 array, too. It's too late to change my setup, but in the future when I'm planning new systems, should I consider the effort to allow zfs fully control all the disks? Well, you should certainly consider all the alternatives you can afford. Our customers happen to like cheap bulk storage, so we have a Thumper, and a few SAS-connected Sun J4000 SATA JBOD's. But our grant-funded researchers may not be a typical customer mix Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] hard drive choice, TLER/ERC/CCTL
I'm in full overthink/overresearch mode on this issue, preparatory to ordering disks for my OS/zfs NAS build. So bear with me. I've been reading manuals and code, but it's hard for me to come up to speed on a new OS quickly. The question(s) underlying this thread seem to be: (1) Does zfs raidz/raidz2/etc have the same issue with long recovery times as RAID5? That being dropping a drive from the array because it experiences an error and recovery that lasts longer than the controller (zfs/OS/device driver stack in this case) waits for an error message? and (2) Can non raid edition drives be set to have shorter error recovery for raid use? On (1), I pick out the following answers: == From Miles Nordin; n Does this happen in ZFS? No. Any timeouts in ZFS are annoyingly based on the ``desktop'' storage stack underneath it which is unaware of redundancy and of the possibility of reading data from elsewhere in a redundant stripe rather than waiting 7, 30, or 180 seconds for it. ZFS will bang away on a slow drive for hours, bringing the whole system down with it, rather than read redundant data from elsewhere in the stripe, so you don't have to worry about drives dropping out randomly. Every last bit will be squeezed from the first place ZFS tried to read it, even if this takes years. == From Darren J Moffat; A combination of ZFS and FMA on OpenSolaris means it will recover. Depending on many factors - not just the hard drive and its firmware - will depend on how long the time outs actually. == From Erik Trimble; The issue is excessive error recovery times INTERNAL to the hard drive. So, worst case scenario is that ZFS marks the drive as bad during a write, causing the zpool to be degraded. It's not going to lose your data. It just may case a premature marking of a drive as bad. None of this kills a RAID (ZFS, traditional SW Raid, or HW Raid). It doesn't cause data corruption. The issue is sub-optimal disk fault determination. == From Richard Relling; For the Solaris sd(7d) driver, the default timeout is 60 seconds with 3 or 5 retries, depending on the hardware. Whether you notice this at the application level depends on other factors: reads vs writes, etc. You can tune this, of course, and you have access to the source. == From dubslick; Are you sure about this? I thought these consumer level drives would try indefinitely to carry out its operation. Even Samsung's white paper on CCTL RAID error recovery says it could take a minute or longer == From Bob Friesen; For a complete newbie, can someone simply answer the following: will using non-enterprise level drives affect ZFS like it affects hardware RAID? Yes. == So from a group of knowledgeable people I get answers all the way from no problem, it'll just work, may take a while though to ...using non-enterprise raid drives will affect zfs just like it does hardwar raid, that being to unnecessarily drop out a disk, and thereby expose the array to failure from a second read/write fault on another disk. Most of the votes seem to be in the no problem range. But beyond me trying to learn all the source code, is there any way to tell how it will really react? My issue is this: I *want* the attributes of consumer-level drives other than the infinite retries. I want slow spin speed for low vibration and low power consumption, am willing to deal with the slower transfer/access speeds to get it. I can pay for (but resent being forced to!) raid-rated drives, but I don't like the extra power consumption needed to get them to be very fast in access and transfers. I'm fine with whipping in a new drive when one of the existing ones gets flaky. I find that I may be in the curious position of being forced to pay twice the price and expend twice the power to get drives that have many features I don't want or need and don't have what I do need, except for the one issue which may (infrequently!) tear up whatever data I have built. ... maybe... On question (2), I believe that my research has led to the following: Drives which support the SMART Command Transport spec, which is many newer disks, appear to allow setting timeouts on read/write operations completing. However, this setting appears not to persist beyond a power cycle. Is there any good reason there can't be a driver added to the boot sequence that will open a file for which drives need to be SCT-set to have timeouts which are shorter than infinite (one of the issues from above) and also short enough to meet the needs of returning errors in a timely manner so that there is not a huge window for a second fault to corrupt a zfs array? Forgive me if I'm being too literal here. Think
Re: [zfs-discuss] Thin device support in ZFS?
On 31 dec 2009, at 19.26, Richard Elling wrote: [I TRIMmed the thread a bit ;-)] On Dec 31, 2009, at 1:43 AM, Ragnar Sundblad wrote: On 31 dec 2009, at 06.01, Richard Elling wrote: In a world with copy-on-write and without snapshots, it is obvious that there will be a lot of blocks running around that are no longer in use. Snapshots (and their clones) changes that use case. So in a world of snapshots, there will be fewer blocks which are not used. Remember, the TRIM command is very important to OSes like Windows or OSX which do not have file systems that are copy-on-write or have decent snapshots. OTOH, ZFS does copy-on-write and lots of ZFS folks use snapshots. I don't believe that there is such a big difference between those cases. The reason you want TRIM for SSDs is to recover the write speed. A freshly cleaned page can be written faster than a dirty page. But in COW, you are writing to new pages and not rewriting old pages. This is fundamentally different than FAT, NTFS, or HFS+, but it is those markets which are driving TRIM adoption. Flash SSDs actually always remap new writes into a only-append-to-new-pages style, pretty much as ZFS does itself. So for a SSD there is no big difference between ZFS and filesystems as UFS, NTFS, HFS+ et al, on the flash level they all work the same. The reason is that there is no way for it to rewrite single disk blocks, it can only fill up already erased pages of 512K (for example). When the old blocks get mixed with unused blocks (because of block rewrites, TRIM or Write Many/UNMAP), it needs to compact the data by copying all active blocks from those pages into previously erased pages, and there write the active data compacted/continuos. (When this happens, things tend to get really slow.) So TRIM is just as applicable to ZFS as any other file system for flash SSD, there is no real difference. [TRIMmed] Once a block is freed in ZFS, it no longer needs it. So the problem of TRIM in ZFS is not related to the recent txg commit history. It may be that you want to save a few txgs back, so if you get a failure where parts of the last txg gets lost, you will still be able to get an old (few seconds/minutes) version of your data back. This is already implemented. Blocks freed in the past few txgs are not returned to the freelist immediately. This was needed to enable uberblock recovery in b128. So TRIMming from the freelist is safe. I see, very good! This could happen if the sync commands aren't correctly implemented all the way (as we have seen some stories about on this list). Maybe someone disabled syncing somewhere to improve performance. It could also happen if a non volatile caching device, such as a storage controller, breaks in some bad way. Or maybe you just had a bad/old battery/supercap in a device that implements NV storage with batteries/supercaps. The issue is that traversing the free block list has to be protected by locks, so that the file system does not allocate a block when it is also TRIMming the block. Not so difficult, as long as the TRIM occurs relatively quickly. I think that any TRIM implementation should be an administration command, like scrub. It probably doesn't make sense to have it running all of the time. But on occasion, it might make sense. I am not sure why it shouldn't run at all times, except for the fact that it seems to be badly implemented in some SATA devices with high latencies, so that it will interrupt any data streaming to/from the disks. I don't see how it would not have negative performance impacts. It will, I am sure! But *if* the user for one reason or the other wants TRIM, it can not be assumed that TRIMing major bunches at certain times is any better than trimming small amounts all the time. Both behaviors may be useful, but I have hard to see a real good use case where you want batch trimming, but easy to see cases where continuos trimming could be useful and hopefully hardly noticeable thanks to the file system caching. /ragge s ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] hard drive choice, TLER/ERC/CCTL
On Thu, 31 Dec 2009, R.G. Keen wrote: I'm in full overthink/overresearch mode on this issue, preparatory to ordering disks for my OS/zfs NAS build. So bear with me. I've been reading manuals and code, but it's hard for me to come up to speed on a new OS quickly. The question(s) underlying this thread seem to be: (1) Does zfs raidz/raidz2/etc have the same issue with long recovery times as RAID5? That being dropping a drive from the array because it experiences an error and recovery that lasts longer than the controller (zfs/OS/device driver stack in this case) waits for an error message? and (2) Can non raid edition drives be set to have shorter error recovery for raid use? I like the nice and short answer from this Bob Friesen fellow the best. :-) I have heard that some vendor's drives can be re-flashed or set to use short timeouts. Some vendors don't like this so they are trying to prohibit it or doing so may invalidate the warranty. Unless things have changed (since a couple of years ago when I last looked), there are some vendors (e.g. Seagate) who offer enterprise SATA drives with only a small surcharge over astonishingly similar desktop SATA drives. The only actual difference seems to be the firmware which is loaded on the drive. Check out the Barracuda ES.2 series. It does not really matter what Solaris or ZFS does if the drive essentially locks up when it is trying to recover a bad sector. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On Dec 31, 2009, at 13:44, Joerg Schilling wrote: ZFS is COW, but does the SSD know which block is in use and which is not? If the SSD did know whether a block is in use, it could erase unused blocks in advance. But what is an unused block on a filesystem that supports snapshots? Personally, I think that at some point in the future there will need to be a command telling SSDs that the file system will take care of handling blocks, as new FS designs will be COW. ZFS is the first mainstream one to do it, but Btrfs is there as well, and it looks like Apple will be making its own FS. Just as the first 4096-byte block disks are silently emulating 4096 - to-512 blocks, SSDs are currently re-mapping LBAs behind the scenes. Perhaps in the future there will be a setting to say no really, I'm talking about the /actual/ LBA 123456. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS extremely slow performance
Hello! This could be a broken disk, or it could be some other hardware/software/firmware issue. Check the errors on the device with iostat -En Heres the output: c7t1d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: WDC WD10EADS-00L Revision: 1A01 Serial No: Size: 1000.20GB 1000204886016 bytes Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 4 Predictive Failure Analysis: 0 c7t2d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: WDC WD10EADS-00P Revision: 0A01 Serial No: Size: 1000.20GB 1000204886016 bytes Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 4 Predictive Failure Analysis: 0 c7t3d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: WDC WD10EADS-00P Revision: 0A01 Serial No: Size: 1000.20GB 1000204886016 bytes Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 4 Predictive Failure Analysis: 0 c7t4d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: WDC WD10EADS-00P Revision: 0A01 Serial No: Size: 1000.20GB 1000204886016 bytes Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 4 Predictive Failure Analysis: 0 c7t5d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: WDC WD10EADS-00P Revision: 0A01 Serial No: Size: 1000.20GB 1000204886016 bytes Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 4 Predictive Failure Analysis: 0 c7t0d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: WDC WD740GD-00FL Revision: 8F33 Serial No: Size: 74.36GB 74355769344 bytes Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 6 Predictive Failure Analysis: 0 c0t1d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: WDC WD10EADS-00P Revision: 0A01 Serial No: Size: 1000.20GB 1000204886016 bytes Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 5 Predictive Failure Analysis: 0 c3t0d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: WDC WD7500AAKS-0 Revision: 4G30 Serial No: Size: 750.16GB 750156374016 bytes Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 5 Predictive Failure Analysis: 0 c3t1d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: WDC WD7500AAKS-0 Revision: 4G30 Serial No: Size: 750.16GB 750156374016 bytes Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 5 Predictive Failure Analysis: 0 You should also check the fma logs: fmadm faulty Empty fmdump -eV This turned out to be huge. But they're mostly something like this: Nov 13 2009 10:15:41.883716494 ereport.fs.zfs.checksum nvlist version: 0 class = ereport.fs.zfs.checksum ena = 0x7cfde552fd100401 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0xda1d003c03abad23 vdev = 0x4389ee65271b9187 (end detector) pool = tank pool_guid = 0xda1d003c03abad23 pool_context = 0 pool_failmode = wait vdev_guid = 0x4389ee65271b9187 vdev_type = replacing parent_guid = 0x79c2f2cf0b81ae5a parent_type = raidz zio_err = 0 zio_offset = 0xae9b3fa00 zio_size = 0x6600 zio_objset = 0x24 zio_object = 0x1b2 zio_level = 0 zio_blkid = 0x635 __ttl = 0x1 __tod = 0x4afc971d 0x34ac718e Thanks for helping and telling me about those commands :-) The scrub I started last night is still running, it usually takes about 8 hours. Will post the results. - Em From: richard.ell...@gmail.com To: mi...@task.gda.pl Date: Thu, 31 Dec 2009 08:37:03 -0800 CC: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] ZFS extremely slow performance On Dec 31, 2009, at 2:49 AM, Robert Milkowski wrote: judging by a *very* quick glance it looks like you have an issue with c3t0d0 device which is responding very slowly. Yes, there is an I/O stuck on the device which is not getting serviced. See below... -- Robert Milkowski http://milek.blogspot.com _ If It Exists, You'll Find it on SEEK Australia's #1 job site http://clk.atdmt.com/NMN/go/157639755/direct/01/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] raidz1 pool import failed with missing slog
In osol 2009.06 - rpool vdev was dying but I was able to do the clean export of the data pool. The data pool's zil was on the failed HDD's slice as well as slog's GUID. As the result I have 4 out of 4 raid5 healthy data drives but cannot import zpool to access the data. This is obviously a disaster for me. I found two discussions about similar issues: http://opensolaris.org/jive/thread.jspa?messageID=233666 and http://opensolaris.org/jive/thread.jspa?messageID=420073 But I don't think the recipes in these threads can help to import the inconsistent pool. Is there any way to ignore missing ZIL devices during the import? I expected that this is the case, since http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide says: [b]Hybrid Storage Pools (or Pools with SSDs)[/b] ... If a separate log device is not mirrored and the device that contains the log fails, storing log blocks reverts to the storage pool. If not - can I somehow reassemble the pool using [b]zpool import -D[/b] option, or do anything else to get my data back? Please help! -- This message posted from opensolaris.org zpool_import.log Description: Binary data ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] hard drive choice, TLER/ERC/CCTL
On Thu, 31 Dec 2009, Bob Friesenhahn wrote: I like the nice and short answer from this Bob Friesen fellow the best. :-) It was succinct, wasn't it? 8-) Sorry - I pulled the attribution from the ID, not the signature which was waiting below. DOH! When you say: It does not really matter what Solaris or ZFS does if the drive essentially locks up when it is trying to recover a bad sector. I'd have to say that it depends. If Solaris/zfs/etc. is restricted to actions which consist of marking the disk semi-permanently bad and continuing, yes, it amounts to the same thing: it opens a yawning chasm of one more error and you're dead, until the array can be serviced and un-degraded. At least I think it does, based on what I've read, anyway. However, if OS/S/zfs/etc. performs an appropriate fire drill up to and including logging the issues, quiescing the array, and annoying the operator then it closes up the sudden-death window. This gives the operator of the array a chance to do something about it, such as swapping in a spare and starting rebuilding/resilvering/etc. Given the largish aggregate monetary value to RAIDZ builders of sidestepping the doubled-cost of raid specialized drives, it occurs to me that having a special set of actions for desktop-ish drives might be a good idea. Something like a fix-the-failed repair mode which pulls all recoverable data off the purportedly failing drive and onto a new spare to avoid a monster resilvering and the associated vulnerable time to a second or third failure. Viewed in that light, exactly what OS/S/zfs does on a long extended reply from a disk and exactly what can be done to minimize the time when the array runs in a degraded mode where the next step loses the data seems to be a really important issue. Well, OK, it does to me because my purpose here is getting to background scrubbing of errors in the disks. Other things might be more important to others. 8-) And the question might be moot if the SMART SCT architecture in desktop drives lets you do a power-on hack to shorten the reply-failed time for better raid operation. That's actually the solution I'd like to see in a perfect world - I get back to a redundant array of INEXPENSIVE disks, and I can pick those disks to be big and slow/low power instead of fast/high power. I'd welcome any enlightened speculation on this. I do recognize that I'm an idiot on these matters compared to people with actual experience. 8-) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] $100 SSD = 5x faster dedupe
I've written about my slow-to-dedupe RAIDZ. After a week of.waitingI finally bought a little $100 30G OCZ Vertex and plugged it in as a cache. After 2 hours of warmup, my zfs send/receive rate on the pool is 16MB/sec (reading and writing each at 16MB as measured by zpool iostat). That's up from 3MB/sec, with a RAM-only cache on a 6GB machine. The SSD has about 8GB utilized right now, and the L2ARC benefit is amazing. Quite an amazing improvement for $100...recommend you don't dedupe without one. mike ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] hard drive choice, TLER/ERC/CCTL
On Thu, 31 Dec 2009, R.G. Keen wrote: Given the largish aggregate monetary value to RAIDZ builders of sidestepping the doubled-cost of raid specialized drives, it occurs to me that having a special set of actions for desktop-ish drives might be a good idea. Something like a fix-the-failed repair mode which pulls all recoverable data off the purportedly failing drive and onto a new spare to avoid a monster resilvering and the associated vulnerable time to a second or third failure. The problem is that a desktop-ish drive may single-mindedly focus on reading the bad data while otherwise responding as if it is alive. So everything just waits a long time while the OS sends new requests to the drive (which are recieved) but the OS does not get the requested data back. To make matters worse, the OS might send another request for the same data, the drive gives up on the last request, and then proceeds with the new request for the same bad data. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] $100 SSD = 5x faster dedupe
Make that 25MB/sec, and rising... So it's 8x faster now. mike -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] hard drive choice, TLER/ERC/CCTL
On Dec 31, 2009, at 6:14 PM, R.G. Keen wrote: On Thu, 31 Dec 2009, Bob Friesenhahn wrote: I like the nice and short answer from this Bob Friesen fellow the best. :-) It was succinct, wasn't it? 8-) Sorry - I pulled the attribution from the ID, not the signature which was waiting below. DOH! When you say: It does not really matter what Solaris or ZFS does if the drive essentially locks up when it is trying to recover a bad sector. I'd have to say that it depends. If Solaris/zfs/etc. is restricted to actions which consist of marking the disk semi-permanently bad and continuing, yes, it amounts to the same thing: it opens a yawning chasm of one more error and you're dead, until the array can be serviced and un-degraded. At least I think it does, based on what I've read, anyway. Some nits: disks aren't marked as semi-bad, but if ZFS has trouble with a block, it will try to not use the block again. So there is two levels of recovery at work: whole device and block. The one more and you're dead is really N errors in T time. For disks which don't return when there is an error, you can reasonably expect that T will be a long time (multiples of 60 seconds) and therefore the N in T threshold will not be triggered. The term degraded does not have a consistent definition across the industry. See the zpool man page for the definition used for ZFS. In particular, DEGRADED != FAULTED However, if OS/S/zfs/etc. performs an appropriate fire drill up to and including logging the issues, quiescing the array, and annoying the operator then it closes up the sudden-death window. This gives the operator of the array a chance to do something about it, such as swapping in a spare and starting rebuilding/resilvering/etc. Issues are logged, for sure. If you want to monitor them proactively, you need to configure SNMP traps for FMA. Given the largish aggregate monetary value to RAIDZ builders of sidestepping the doubled-cost of raid specialized drives, it occurs to me that having a special set of actions for desktop-ish drives might be a good idea. Something like a fix-the-failed repair mode which pulls all recoverable data off the purportedly failing drive and onto a new spare to avoid a monster resilvering and the associated vulnerable time to a second or third failure. It already does this, as long as there are N errors in T time. There is room for improvement here, but I'm not sure how one can set a rule that would explicitly take care of the I/O never returning from a disk while a different I/O to the same disk returns. More research required here... Viewed in that light, exactly what OS/S/zfs does on a long extended reply from a disk and exactly what can be done to minimize the time when the array runs in a degraded mode where the next step loses the data seems to be a really important issue. Once the state changes to DEGRADED, the admin must zpool clear the errors to return the state to normal. Make sure your definition of degraded matches. Well, OK, it does to me because my purpose here is getting to background scrubbing of errors in the disks. Other things might be more important to others. 8-) And the question might be moot if the SMART SCT architecture in desktop drives lets you do a power-on hack to shorten the reply-failed time for better raid operation. That's actually the solution I'd like to see in a perfect world - I get back to a redundant array of INEXPENSIVE disks, and I can pick those disks to be big and slow/low power instead of fast/high power. In my experience, disk drive firmware quality and feature sets vary widely. I've got a bunch of scars from shaky firmware and I even got a new one a few months ago. So perhaps one day the disk vendors will perfect their firmware? :-) I'd welcome any enlightened speculation on this. I do recognize that I'm an idiot on these matters compared to people with actual experience. 8-) So you want some scars too? :-) -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss