Re: [zfs-discuss] How to grow ZFS on growing pool?
* On 02 Feb 2010, Darren J Moffat wrote: zpool get autoexpand test This seems to be a new property -- it's not in my Solaris 10 or OpenSolaris 2009.06 systems, and they have always expanded immediately upon replacement. In what build number or official release does autoexpand appear, and does it always default to off? This will be important to know for upgrades. Thanks. -- -D.d...@uchicago.eduNSITUniversity of Chicago ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives
* On 02 Feb 2010, Orvar Korvar wrote: Ok, I see that the chassi contains a mother board. So never mind that question. Another q: Is it possible to have large chassi with lots of drives, and the opensolaris in another chassi, how do you connect them both? The J4500 and most other storage products being discussed are not servers: they are SATA concentrators with SAS uplinks. You plug in a bunch of cheap SATA disk, and you connect the chassis to a server with SAS. The logic board on the storage tray just converts the SAS signalling to SATA. It is not a computer in the usual sense. In many cases such products also have SAS expander ports, so that you can link multiple storage trays to a single SAS host bus adapter on your server by daisy-chaining them. So you need at least one SAS HBA on your OpenSolaris box, and SAS cables to hook up the trays containing the SATA drives. To the original question: you can purchase a J4x00 with a limited number of drives (empty is generally not an option), but there is no officially-sanctioned way to obtain the drive adapters except to buy Sun disks. You need either a SAS or a SATA drive bracket to adapt the drive to the J4x00 backplane, but they are not sold separately: one ships with each drive. As mentioned there are companies that sell remanufactured or discarded components, or machine their own substitutes. (Re)marketing Sun or compatible drive brackets has always been a lively business for a few small outfits. But Sun has no involvement with this, and may be unwilling to support a frankenstein server. Sun state that their OEM drives are of higher quality than OTS drives from manufacturers or retailers, and that they have custom firmware that improves their performance and reliability in Sun storage trays/arrays. I see no reason to disbelieve that, but it is quite a steep price to pay for that premium edge. When cost is a bigger concern than performance or reliability, I have generally bought the StorEdge product with the smallest drives I can (250 GB or 500 GB) and upgraded them myself to the size I really want. It's cheaper to buy 20 drives from CDW than 10 from Sun even when you account for the tiny throwaway drive, and you can keep the 10 extra as cold spares. At low enough scale the financial savings are worth the time to replace them as they fail. (I wish I could say the same of the StorEdge arrays themselves. Fully half of my 2540 controllers have failed, costing me huge amounts of time in both direct and contractual service, and I'm given up on them completely as a product line. I'll be thrilled to switch to JBOD.) For larger and less fault-tolerant systems, when money is available, I'm happy to pay Sun's premium. However, as others say, the other brands sometimes offer decent enough products to use instead of Sun's enterprise line. As always, it depends on your site's requirements and budget. I assume that a home NAS is comparatively low on both: therefore I wouldn't even shop with Sun unless you have a line on cheap castoffs from an enterprise shop. -- -D.d...@uchicago.eduNSITUniversity of Chicago ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to grow ZFS on growing pool?
* On 02 Feb 2010, Richard Elling wrote: This behaviour has changed twice. Long ago, the pools would autoexpand. This is a bad thing, by default, so it was changed such that the expansion would only occur on pool import (around 3-4 years ago). The autoexpand property allows you to expand without an export/import (and arrived around 18 months ago). It is not surprising that various Solaris 10 releases/patches would have one of the three behaviours. Well well, I guess it's been a while since I actually tested this. :) Thanks, Richard. I'll watch for autoexpand in next releases of s10/osol. -- -D.d...@uchicago.eduNSITUniversity of Chicago ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can zfs snapshot nfs mounts
guess the upshot is that if one were to daily rsync data to an zfs filesystem, the changes wrought there by rsync would be reflected in zfs snapshots, maybe timed to happen right after the rsync runs, as these new blocks covering only the deltas... I don't really know what deltas are... but I guess it would be only the changed parts. I do this (roughly) for Linux backups. My ZFS server exports a backup dataset via NFS to a Linux machine. Twice a day (4am and 4pm) Linux rsyncs to the NFS mountpoint. Once a day (at midnight) the ZFS server snapshots the dataset. And I'm guessing further that one would be able to recover each change from the snapshots somehow. Yes. My ZFS backup dataset has snapdir=hidden, but it's still available over the NFS mount. My Linux users can do this kind of thing: cd /nfs/backup/.zfs/snapshot/auto-d20090312 more somefile to read somefile from the 12 March 2009 backup. In my OP, I mentioned rsync and rsnapshot backup system on linux as being in some way comparable. I do understand how rsnapshot works but still not seeing exactly how the zfs snapshots work. Maybe a concrete example would be a bit easier to understand if you can give one. I''m still not really understanding COW. Copy on write means that two objects (files) referring to identical data get pointers to the data instead of duplicate copies. As long as these are only read, and not written, the pointer to the same data is fine. When a write occurs, the data is copied and one of the referrers gets a pointer to the new copy. This prevents the write from affecting both referring files. Copy on write is a description of how COW is used in virtual memory. For disk storage, copy isn't necessarily accurate: since the entire data block is rewritten anyway, a separate copy step can be optimized away. Here's a simple illustration of COW in action. It's not necessarily an accurate depiction of ZFS, but of the general concept in terms of a filesystem. 1. When a file (file A) is written to disk, blocks are allocated for the file and data is stored in those blocks. The blocks each have a reference count, and ref counts are set to 1 because only one file refers to the blocks. 2. I copy File A to File B. The new file simply refers to all the same blocks. The ref counts are raised to 2. 3. I snapshot the filesystem. This is essentially like copying every file in it, as in #2. No blocks are copied because no new data was written, but ref counts are raised. I'm not sure about zfs's implementation, but in principle I guess an immutable snapshot should only need to raise ref ct by 1 in total, whereas a mutable snapshot (i.e., a clone) would incrememnt once for every reference in the filesystem. 4. I rsync to the file in step #1. Let's suppose this leaves blocks 1 and 2 alone, but updates block 3. The new data for block 3 is written to a new block (call it 3bis), and block 3 is left on the disk as it is. Block 3's ref count is decremented, and 3bis's ref count is set to 1. File A: blocks 1, 2, 3bis File B: blocks 1, 2, 3 Block 1: ref ct 3 (file A, file B, snapshot) Block 2: ref ct 3 (file A, file B, snapshot) Block 3: ref ct 2 (file B, snapshot) Block 3bis: ref ct 1 (file A) 5. I remove file B. Ref counts for its blocks are decremented, but since all its blocks still have ref counts 0, they persist. No blocks are removed from the dataset. File A: blocks 1, 2, 3bis Block 1: ref ct 2 (file A, snapshot) Block 2: ref ct 2 (file A, snapshot) Block 3: ref ct 1 (snapshot) Block 3bis: ref ct 1 (file A) 6. I remove file A. Ref counts again decrement. Block 1: ref ct 1 (snapshot) Block 2: ref ct 1 (snapshot) Block 3: ref ct 1 (snapshot) Block 3bis: ref ct 0 Since 3bis no longer has any referrers, it is deallocated. Blocks 1, 2, and 3 are still used by the snapshot, even though the original files A and B are no longer present. This is a pretty simplistic view. In practice, not only does the COW methodology apply to the files' data blocks; it also applies to their metadata, the filesystem's directories, and so on. This ensures that directory information as well as files persist in snapshots. It also explains why snapshots are virtually instantaneous: you only make a new set of pointers to all the existing data, but you don't replace any of the existing data. So if I wanted to find a specific change in a file... that would be somewhere in the zfs snapthosts... say to retrieve a certain formulation in some kind of `rc' file that worked better than a later formulation. How would I do that? Using the .zfs/snapshot directory (see above) you can diff two different generations of a file at the same path. -- -D.d...@uchicago.eduNSITUniversity of Chicago
Re: [zfs-discuss] Can this be done?
* On 07 Apr 2009, Michael Shadle wrote: Now quick question - if I have a raidz2 named 'tank' already I can expand the pool by doing: zpool attach tank raidz2 device1 device2 device3 ... device7 It will make 'tank' larger and each group of disks (vdev? or zdev?) will be dual parity. It won't create a mirror, will it? That's correct. Anything you're unsure about, you can test. Just create a zpool using files instead of devices: for i in 1 2 3 4; do mkfile 256m /tmp/file$i done zpool create testpool raidz /tmp/file1 /tmp/file2 /tmp/file3 /tmp/file4 ...and experiment on that. No data risk this way. -- -D.d...@uchicago.eduNSITUniversity of Chicago ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can this be done?
* On 28 Mar 2009, Peter Tribble wrote: The choice of raidz1 versus raidz2 is another matter. Given that you've already got raidz1, and you can't (yet) grow that or expand it to raidz2, then there doesn't seem to be much point to having the second half of your storage being more protected. If you were starting from scratch, then you have a choice between a single raidz2 vdev and a pair of raidz1 vdevs. (Lots of other choices too, but that is really what you're asking here I guess.) I've had too many joint failures in my life to put much faith in raidz1, especially with 7 disks that likely come from the same manufacturing batch and might exhibit the same flaws. A single-redundancy system of 7 disks (gross) has too high a MTTDL for my taste. If you can sell yourself on raidz2 and the loss of two more disks' worth of data -- a loss which IMO is more than made up for by the gain in security -- consider this technique: 1. build a new zpool of a single raidz2; 2. migrate your data from the old zpool to the new one; 3. destroy the old zpool, releasing its volumes; 4. use 'zpool add' to add those old volumes to the new zpool as a second raidz2 vdev (see Richard Elling's previous post). Now you have a single zpool consisting of two raidz2 vdevs. The migration in step 2 can be done either by 'zfs send'ing each zfs in the zpool, or by constructing analogous zfs in the new zpool and rsyncing the files across in one go. -- -D.d...@uchicago.eduNSITUniversity of Chicago ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS: unreliable for professional usage?
too many words wasted, but not a single word, how to restore the data. I have read the man pages carefully. But again: there's nothing said, that on USB drives zfs umount pool is not allowed. You misunderstand. This particular point has nothing to do with USB; it's the same for any ZFS environment. You're allowed to do a zfs umount on a filesystem, there's no problem with that. But remember that ZFS is not just a filesystem, in the way that reiserfs and UFS are filesystems. It's an integrated storage pooling system and filesystem. When you umount a filesystem, you're not taking any storage offline, you're just removing the filesystem's presence on the VFS hierarchy. You umounted a zfs filesystem, not touching the pool, then removed the device. This is analogous to preparing an external hardware RAID and creating one or more filesystems, using them a while, umounting one of them, and powering down the RAID. You did nothing to protect other filesystems or the RAID's r/w cache. Everything on the RAID is now inconsistent and suspect. But since your RAID was a single striped volume, there's no mirror or parity information with which to reconstruct the data. ZFS is capable of detecting these problems, where other filesystems are often not. But no filesystem can tell what the data should have been when the only copy of the data is damaged. This is documented in ZFS. It's not about USB, it's just that USB devices can be more vulnerable to this kind of treatment than other kinds of storage are. And again: Why should a 2 weeks old Seagate HDD suddenly be damaged, if there was no shock, hit or any other event like that? It happens all the time. We just don't always know about it. -- -D.d...@uchicago.eduNSITUniversity of Chicago ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] using USB memory keys for l2arc and zil
Would there be an advantage to using 4GB USB memory sticks on a home system for zil and l2arc? Probably not. Most USB devices are slower than SATA disks. Moreover, all USB devices are alower than most SATA disks. -- -D.d...@uchicago.eduNSITUniversity of Chicago ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] j4200 drive carriers
nevermind, i will just get a Promise array. Don't. I don't normally like to badmouth vendors, but my experience with Promise was one of the worst in my career, for reasons that should be relevant other ZFS-oriented customers. We ordered a Promise array because their tech sheet said Solaris was supported. We received it and set it up, and from the start got scsi errors from the array when configuring devices. (This is before even touching ZFS; at this stage we just wanted to run fdisk.) It took a while to find someone at Promise, and when they did they wouldn't open a case ticket because, they said, Solaris was unsupported. When I went back to their web site -- a horrible site, by the way -- the tech sheet had been replaced with one that did NOT list Solaris among the supported OSes, although the author and date of the PDF file were the same. I wrote to my contact at Promise, but they held to their guns on the non-support even after I sent them copies of both PDFs. I cajoled my Sun account manager into connecting us with someone who might be able to figure it out, but no one could. It took several months to get Promise to agree to refund our unit, and only because our retailer (CDW) took the reins and held on tight. Promise stopped returning my e-mail long before that. Others may have different fortune with them; we were using the dual-controller FC Vtrak, whatever the model number is, and maybe other interfaces work better. But after the support issue, I wouldn't dare touch them again for use on Solaris. -- -D.d...@uchicago.eduNSITUniversity of Chicago ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] after controller crash, 'one or more devices is currently unavailable'
I have a feeling I pushed people away with a long message. Let me reduce my problem to one question. # zpool import -f z cannot import 'z': one or more devices is currently unavailable 'zdb -l' shows four valid labels for each of these disks except for the new one. Is this what unavailable means, in this case? I have now faked up a label for the disk that didn't have one and applied it with dd. Can anyone say what unavailable means, given that all eight disks are registered devices at the correct paths, are readable, and have labels? -- -D.[EMAIL PROTECTED]NSITUniversity of Chicago ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] after controller crash, 'one or more devices is currently unavailable'
There are a lot of hits for this error in google, but I've had trouble identifying any that resemble my situation. I apologize if you've answered it before. If it's better for me to open a case with Sun Support, I can do that, but I'm hoping to cheat my way around the system so that I don't have to send somebody Explorer output before they escalate it. Seems more efficient in the long run. :) Most of my tale of woe is background: I have a pool running under Solaris 10 5/08. It's an 8-member raidz2 whose volumes are on a 2540 array with two controllers. Volumes are mapped 1:1 with physical disks. I didn't really want a 2540, but I couldn't get anyone to swear to me that any other fiber-channel product would work with Solaris. I'm using fiber multipathing. I've had two disk failures in the past two weeks. Last week I replaced the first. No problems with ZFS initially; a 'zfs replace' did the right thing. Yesterday I replaced the second. But while investigating the problem I noticed that two of my paths had gone down, so that 6 disks had both paths attached, and 2 disks had only one path. At this time, 'zpool status' showed: pool: z state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun. more devices contains corrupted data./msg/ZFS-8000-D3 scrub: resilver completed with 0 errors on Fri Oct 24 20:04:51 2008 config: NAME STATE READ WRITE CKSUM zDEGRADED 0 0 0 raidz2 DEGRADED 0 0 0 c6t600A0B800049F9E1030548B3DF1Ed0s0 ONLINE 0 0 0 c6t600A0B800049F9E1030848B3DF52d0s0 ONLINE 0 0 0 c6t600A0B800049F9E1030B48B3DF7Ed0s0 ONLINE 0 0 0 c6t600A0B800049F9E1030E48B3DFA6d0s0 ONLINE 0 0 0 c6t600A0B800049F9E1031148B3DFD2d0s0 ONLINE 0 0 0 c6t600A0B800049F9E1031448B3DFFAd0s0 ONLINE 0 0 0 c6t600A0B800049F9E1031748B3E020d0s0 UNAVAIL 0 0 0 cannot open c6t600A0B800049F9E1031A48B3E04Cd0s0 ONLINE 0 0 0 (At the time I hadn't figured it out, but I believe now that the one disk was UNAVAIL because the disk had not been properly partitioned yet, so s0 was undefined.) Solaris 10's mpath support seems so far to be fairly intolerant of reconfiguration without a reboot, and I wasn't ready to reboot yet, but I thought I'd try resetting the controller that wasn't attached to all of the disks. But it appears that for some reason the CAM software reset both controllers simultaneously. The whole pool went into an error state, and all disks became unavailable. Very annoying, but not a problem for zfs-discuss. At this time, 'zpool status' showed: pool: z state: FAULTED status: One or more devices could not be opened. There are insufficient replicas for the pool to continue functioning. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-D3 scrub: none requested config: NAME STATE READ WRITE CKSUM zFAULTED 0 0 0 corrupted data raidz2 DEGRADED 0 0 0 c6t600A0B800049F9E1030548B3DF1Ed0s0 UNAVAIL 0 0 0 corrupted data c6t600A0B800049F9E1030848B3DF52d0s0 UNAVAIL 0 0 0 corrupted data c6t600A0B800049F9E1030B48B3CF7Ed0s0 UNAVAIL 0 0 0 corrupted data c6t600A0B800049F9E1030E48B3DFA6d0s0 UNAVAIL 0 0 0 cannot open c6t600A0B800049F9E1031148B3DFD2d0s0 UNAVAIL 0 0 0 corrupted data c6t600A0B800049F9E1031448B3DFFAd0s0 UNAVAIL 0 0 0 corrupted data c6t600A0B800049F9E1031748B3E020d0s0 UNAVAIL 0 0 0 cannot open c6t600A0B800049F9E1031A48B3E04Cd0s0 UNAVAIL 0 0 0 corrupted data I don't know whether there's any chance of recovering this, but I wanted to try. I reset the 2540 again, but still no communication with Solaris. I rebooted the server, and communications resumed. I had to do some further repair/reconfig on the 2540 for the two disks marked 'cannot open', but it was a minor issue and worked fine. Solaris was then able to see all my disks. Now we come to the main point. I still hadn't figured out the partitioning problem on E020d0s0 yet. It didn't occur to me because I believed that to be a spare disk which I had already