Re: [zfs-discuss] non-ECC Systems and ZFS for home users
On 09/23/10 19:08, Peter Jeremy wrote: The downsides are generally that it'll be slower and less power- efficient that a current generation server and the I/O interfaces will be also be last generation (so you are more likely to be stuck with parallel SCSI and PCI or PCIx rather than SAS/SATA and PCIe). And when something fails (fan, PSU, ...), it's more likely to be customised in some way that makes it more difficult/expensive to repair/replace. Sometimes the bargains on E-Bay are such that you can afford to get 2 or even a 3rd machine for spares, and a PCI-X SAS card has more than adequate performance for SOHO use. But, I agree, repair is probably impossible unless you can simply swap in a spare part from another box. However server class machines are pretty tough. My used Sun hardware has yet to drop a beat and they've been running 24*7 for years - well, I cycle the spares since they were never needed for parts, so it's less than that. But they are noisy... Surely the issue about repairs extends to current generation hardware. It gets obsolete so quickly that finding certain parts (especially mobos) may be next to impossible. So what's the difference other than lots of $$$? Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] non-ECC Systems and ZFS for home users
On 09/23/10 03:01, Ian Collins wrote: So, I wonder - what's the recommendation, or rather, experience as far as home users are concerned? Is it safe enough now do use ZFS on non-ECC-RAM systems (if backups are around)? It's as safe as running any other OS. The big difference is ZFS will tell you when there's a corruption. Most users of other systems are blissfully unaware of data corruption! This runs you into the possibility of perfectly good files becoming inaccessible due to bad checksums being written to all the mirrors. As Richard Elling wrote some time ago in [zfs-discuss] You really do need ECC RAM, see http://www.cs.toronto.edu/%7Ebianca/papers/sigmetrics09.pdf. There were a couple of zfs-discuss threads quite recently about memory problems causing serious issues. Personally, I wouldn't trust any valuable data to any system without ECC, regardless of OS and file systems. For home use, used Suns are available at ridiculously low prices and they seem to be much better engineered than your typical PC. Memory failures are much more likely than winning the pick 6 lotto... FWIW Richard helped me diagnose a problem with checksum failures on mirrored drives a while back and it turned out to be the CPU itself getting the actual checksum wrong /only on one particular file/, and even then only when the ambient temperature was high. So ZFS is good at ferreting out obscure hardware problems :-). Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver of older root pool disk
Bumping this because no one responded. Could this be because it's such a stupid question no one wants to stoop to answering it, or because no one knows the answer? Trying to picture, say, what could happen in /var (say /var/adm/messages), let alone a swap zvol, is giving me a headache... On 07/09/10 17:00, Frank Middleton wrote: This is a hypothetical question that could actually happen: Suppose a root pool is a mirror of c0t0d0s0 and c0t1d0s0 and for some reason c0t0d0s0 goes off line, but comes back on line after a shutdown. The primary boot disk would then be c0t0d0s0 which would have much older data than c0t1d0s0. Under normal circumstances ZFS would know that c0t0d0s0 needs to be resilvered. But in this case c0t0d0s0 is the boot disk. Would ZFS still be able to correctly resilver the correct disk under these circumstances? I suppose it might depend on which files, if any, had actually changed... Thanks -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] carrying on [was: Legality and the future of zfs...]
On 07/19/10 07:26, Andrej Podzimek wrote: I run ArchLinux with Btrfs and OpenSolaris with ZFS. I haven't had a serious issue with any of them so far. Moblin/Meego ships with btrfs by default. COW file system on a cell phone :-). Unsurprisingly for a read-mostly file system it seems pretty stable. There's an interesting discussion about btrfs on Meego at http://lwn.net/Articles/387196/ Undoubtedly, ZFS is currently much more mature and usable than Btrfs. Agreed, but it's not just ZFS, though. It's the packaging system, beadm, stmf, the whole works. A simple yum update can be a terrifying experience and almost impossible to undo. And updating to a major new Linux release? Almost as bad as updating MSWindows. Open Solaris as an administerable system is simply years ahead of anything else. However, Btrfs can evolve very quickly, considering the huge community around Linux. For example, EXT4 was first released in late 2006 and I first deployed it (with a stable on-disk format) in early 2009. But the infrastructure to make use of a ZFS-like manager simply isn't there. As a Linux and Solaris developer and user of both, I'd take Solaris any day and so would everyone I know. But going back to the original topic, the tea leaves seem to be saying that Oracle is interested primarily in Solaris as a robust server OS and probably not so much for the desktop where there realistically isn't going to be much revenue. But it would be a bad gamble if they lose a lot of mind-share. Legal issues over ZFS make it even worse. I get calls for help converting MSWindows applications and servers to Linux. ZFS and all the other goodies make a compelling case for Solaris (and Sun/Oracle hardware) instead but the uncertainties make it a hard sell. Oracle are you listening? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Move Fedora or Windows disk image to ZFS (iScsi Boot)
On 07/18/10 17:39, Packet Boy wrote: What I can not find is how to take an existing Fedora image and copy the it's contents into a ZFS volume so that I can migrate this image from my existing Fedora iScsi target to a Solaris iScsi target (and of course get the advantages of having that disk image hosted on ZFS). Do I just zfs create -V and then somehow dd the Fedora .img file on top of the newly created volume? Well, you could simply mount the iscsi devices and choose any method that is suitable to copy the existing volume. For example Fedora will create /dev/sd* for each iscisi device it knows about, so you see an empty drive at that point and the problem simply devolves to whatever you would do if you wanted to use a new physical drive. nftsclone works for MSWindows, I suppose dd might work for Linux, although the disk geometries would have to be identical and you'd have to copy the entire disk. It might be safer to create new file systems on the new disk, and use cpio or even tar to copy everything. Shame it's so hard to do mirroring with Fedora, so the ZFS mirror trick might be too difficult. I've spent hours and have not been able to find any example on how to do this. Making the new drive bootable is the real problem since it will probably not have the same identifier. For sure you'd have to edit grub ion the new drive and perhaps run grub interactively to install a boot loader. Hope this helps -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] resilver of older root pool disk
This is a hypothetical question that could actually happen: Suppose a root pool is a mirror of c0t0d0s0 and c0t1d0s0 and for some reason c0t0d0s0 goes off line, but comes back on line after a shutdown. The primary boot disk would then be c0t0d0s0 which would have much older data than c0t1d0s0. Under normal circumstances ZFS would know that c0t0d0s0 needs to be resilvered. But in this case c0t0d0s0 is the boot disk. Would ZFS still be able to correctly resilver the correct disk under these circumstances? I suppose it might depend on which files, if any, had actually changed... Thanks -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs/lofi/share panic
On 05/27/10 05:16 PM, Dennis Clarke wrote: I just tried this with a UFS based filesystem just for a lark. It never failed on UFS, regardless of the contents of /etc/dfs/dfstab. Guess I must now try this with a ZFS fs under that iso file. Just tried it again with b134 *with* share /mnt in /etc/dfs/dfstab. # mount -O -F hsfs /export/iso_images/moblin-2.1-PR-Final-ivi-201002090924.img /mnt # ls /mnt isolinux LiveOS # unshare /mnt /mnt: path doesn't exist # share /mnt # unshare /mnt # share /mnt Panic ensues (the following observed on the serial console); note that the dataset is not UFS! # May 30 13:35:44 host5 ufs: NOTICE: mount: not a UFS magic number (0x0) panic[cpu1]/thread=30001f5f560: BAD TRAP: type=31 rp=2a1014769a0 addr=218 mmu_fsr=0 occurred in module nfssrv due to a NULL pointer dereference Tried again after it rebooted Edited /etc/dfs/dfstab to remove the share /mnt # unshare /mnt # mount -O -F hsfs /backups/icon/moblin-2.1-PR-Final-ivi-201002090924.img /mnt # ls /mnt isolinux LiveOS # unshare /mnt /mnt: bad path # share /mnt # unshare /mnt # share /mnt No panic. So the problem all along appears to be what happens if you mount -O to an already shared mountpoint. Deliberately sharing before mounting (but with nothing in /etc/dfs/dfstab) resulted in a slightly different panic (more like the ones documented in the CR): panic[cpu1]/thread=30002345e0: BAD TRAP: type=34 rp=2a100f84460 addr=ff6f6c2f5267 mmu_fsr=0 unshare: alignment error: So CR6798273 should be amended to show the following: To reproduce, share (say) /mnt mount -O some-image-file /mnt share /mnt unshare /mnt share/mnt unshare ./mnt Highly reproducible panic ensues. Workaround - make sure mountpoints are not shared before mounting iso images stored on a ZFS dataset. So the problem, now seen to be relatively trivial, isn't fixed. at least in b134. For all of you who responded both off and on the list and motivated this experiment, much thanks. Perhaps someone with access to a more recent build could try this, and if it still happens, update and reopen CR6798273, although it doesn't seem very important now. Regards -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs/lofi/share panic
Many many moons ago, I submitted a CR into bugs about a highly reproducible panic that occurs if you try to re-share a lofi mounted image. That CR has AFAIK long since disappeared - I even forget what it was called. This server is used for doing network installs. Let's say you have a 64 bit iso lofi-mounted and shared. You do the install, and then wish to switch to a 32 bit iso. You unshare, umount, delete the loopback, and then lofiadm the new iso, mount it and then share it. Panic, every time. Is this such a rare use-case that no one is interested? I have the backtrace and cores if anyone wants them, although such were submitted with the original CR. This is pretty frustrating since you start to run out of ideas for mountpoint names after a while unless you forget and get the panic. FWIW (even on a freshly booted system after a panic) # lofiadm zyzzy.iso /dev/lofi/1 # mount -F hsfs /dev/lofi/1 /mnt mount: /dev/lofi/1 is already mounted or /mnt is busy # mount -O -F hsfs /dev/lofi/1 /mnt # share /mnt # If you unshare /mnt and then do this again, it will panic. This has been a bug since before Open Solaris came out. It doesn't happen if the iso is originally on UFS, but UFS really isn't an option any more. FWIW the dataset containing the isos has the sharenfs attribute set, although it doesn;t have to be actually mounted by any remote NFS for this panic to occur. Suggestions for a workaround most welcome! Thanks ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sharing with zfs
On 05/ 4/10 05:37 PM, Vadim Comanescu wrote: Im wondering is there a way to actually delete a zvol ignoring the fact that it has attached LU? You didn't say what version of what OS you are running. As of b134 or so it seems to be impossible to delete a zfs iscsi target. You might look at the thread: [zfs-discuss] How to destroy iscsi dataset?, however it never really came to any really satisfying conclusion. AFAIK the only way to delete a zfs iscsi target is to boot b132 or earlier in single user mode. IIRC there are iscsigt and COMSTAR changes coming in later releases so it night be worth trying again when we eventually get to go past b134. HTH -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD best practices
On 04/20/10 11:06 AM, Don wrote: Who else, besides STEC, is making write optimized drives and what kind of IOP performance can be expected? Just got a distributor email about Texas Memory Systems' RamSan-630, one of a range of huge non-volatile SAN products they make. Other than that this has a capacity of 4-10TB, looks like a 4U, and consumes an amazing 450W, I don't know anything about them. The iops are pretty impressive, but power-wise, at 45W/TB even mirrored disks use quite a bit less power. But 500K random iops and 8GB/s might be worth it if the specs are to be believed... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Making an rpool smaller?
On 04/16/10 07:41 PM, Brandon High wrote: 1. Attach the new drives. 2. Reboot from LiveCD. 3. zpool create new_rpool on the ssd Is step 2 actually necessary? Couldn't you create a new BE # beadm create old_rpool # beadm activate old_rpool # reboot # beadm delete rpool It's the same number of steps but saves the bother of making a zpool version compatible live cd. Also, how attached are you the pool name rpool? I have systems with root pools called spool, tpool, etc., even one rpool-1 (because the text installer detected an earlier rpool on an iscsi volume I was overwriting) and they all seem to work fine. Actually. my preferred method (if you really want the new pool to be called rpool) would be to do the 4 step rename on the ssd after all the other steps are done and you've sucessfully booted it. Then you always have the untouched old disk in case you mess up. Also, (gurus please correct here), you might need to change step 3 to something like # zpool create -f -o failmode=continue -R /mnt -m legacy rpool ssd in which case you can recv to it without rebooting at all, and #zpool set bootfs =... You might also consider where you want swap to be and make sure that vfstab is correct on the old disk now that the root pool has a different name. There was detailed documentation on how to zfs send/recv root pools on the Sun ZFS documentation site, but right now it doesn't seem to be Googleable. I'm not sure your original set of steps will work without at least doing the above two. You might need to check to be sure the ssd has an SMI label. AFAIK the official syntax for installing the MBR is # installboot -F zfs /usr/platform/`uname -i`/lib/fs/zfs/bootblk /dev/rdsk/ssd Finally, you should check or delete /etc/zfs/zpool.cache because it will likely be incorrect on the ssd after recv'ing the snapshot. HTH -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Making an rpool smaller?
On 04/16/10 08:57 PM, Frank Middleton wrote: AFAIK the official syntax for installing the MBR is # installboot -F zfs /usr/platform/`uname -i`/lib/fs/zfs/bootblk /dev/rdsk/ssd Sorry, that's for SPARC. You had the installgrub down correctly... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Making an rpool smaller?
On 04/16/10 09:53 PM, Brandon High wrote: Right now, my boot environments are named after the build it's running. I'm guessing that by 'rpool' you mean the current BE above. No, I didn't :-(. Please ignore that part - too much caffeine :-). I figure that by booting to a live cd / live usb, the pool will not be in use, so there shouldn't be any special steps involved. Might be the easiest way. But I've never found having a different name for the root pool to be a problem. The lack, until recently, of a bootable CD for SPARC may have something to do with living with different names. Makes it easier to recv snapshots from different hosts and architectures, too. I'll try out a few variations on a VM and see how it goes. You'll need to do the zfs create with legacy mount option, and set the bootfs property. Otherwise it looks like you are on the right path. Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RaidZ recommendation
On 04/ 7/10 03:09 PM, Jason S wrote: I was actually already planning to get another 4 gigs of ram for the box right away anyway, but thank you for mentioning it! As there appears to be a couple ways to skin the cat here i think i am going to try both a 14 spindle RaidZ2 and 2 X 7 RaidZ2 configuration and see what the performance is like. I have a fews days of grace before i need to have this server ready for duty. Just curious, what are you planning to boot from? AFAIK you can't boot ZFS from anything much more complicated than a mirror. Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Diagnosing Permanent Errors
On 04/ 4/10 10:00 AM, Willard Korfhage wrote: What should I make of this? All the disks are bad? That seems unlikely. I found another thread http://opensolaris.org/jive/thread.jspa?messageID=399988 where it finally came down to bad memory, so I'll test that. Any other suggestions? It could be the cpu. I had a very bizarre case where the cpu would sometimes miscalculate the checksums of certain files and mostly when the cpu was also busy doing other things. Probably the cache. Days of running memtest and SUNWvts didn't result in any errors because this was a weirdly pattern sensitive problem. However, I too am of the opinion that you shouldn't even think of running zfs without ECC memory (lots of threads about that!) and that this is far, far more likely to be your problem, but I wouldn't count on diagnostics finding it, either. Of course it could be the controller too. For laughs, the cpu calculating bad checksums was discussed in http://opensolaris.org/jive/message.jspa?messageID=469108 (see last message in the thread). If you are seriously contemplating using a system with non-ECC RAM, check out the Google research mentioned in http://opensolaris.org/jive/thread.jspa?messageID=423770 http://www.cs.toronto.edu/%7Ebianca/papers/sigmetrics09.pdf Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool split problem?
On 03/31/10 12:21 PM, lori.alt wrote: The problem with splitting a root pool goes beyond the issue of the zpool.cache file. If you look at the comments for 6939334 http://monaco.sfbay.sun.com/detail.jsf?cr=6939334, you will see other files whose content is not correct when a root pool is renamed or split. 6939334 seems to be inaccessible outside of Sun. Could you list the comments here? Thanks ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] How to destroy iscsi dataset?
Our backup system has a couple of datasets used for iscsi that have somehow lost their baseline snapshots with the live system. In fact zfs list -t snapshots doesn't show any snapshots at all for them. We rotate backup and live every now and then, so these datasets have been shared at some time. Therefore an incremental zfs send/recv will fail for these datasets. The send script automatically uses a non-incremental send if the target dataset is missing, so all I need to do is somehow destroy them. # svcs -a | grep iscsi disabled 18:50:21 svc:/network/iscsi_initiator:default disabled 18:50:34 svc:/network/iscsi/target:default disabled 18:50:38 svc:/system/iscsitgt:default disabled 18:50:39 svc:/network/iscsi/initiator:default # zfs list space/os-vdisks/osolx86 NAME USED AVAIL REFER MOUNTPOINT space/os-vdisks/osolx8620G 657G 14.9G - # zfs get shareiscsi space/os-vdisks/osolx86 NAME PROPERTYVALUE SOURCE space/os-vdisks/osolx86 shareiscsi off local # zfs destroy -f space/os-vdisks/osolx86 cannot destroy 'space/os-vdisks/osolx86': dataset is busy AFAIK they aren't shared in any way now. How to delete these datasets, or find out why they are busy? Thanks ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] CR 6880994 and pkg fix
Thanks to everyone who made suggestions! This machine has run memtest for a week and VTS for several days with no errors. It does seem that the problem is probably in the CPU cache. On 03/24/10 10:07 AM, Damon Atkins wrote: You could try copying the file to /tmp (ie swap/ram) and do a continues loop of checksums On a variation of your suggestion, I implemented a bash script that applies sha1sum 10,000 times with a pause of 0.1S between each attempt, and tests the result against what seemed to be the correct result. sha1sum on /lib/libdlpi.so.1 resulted in 11% of incorrect results sha1sum on /tmp/libdlpi.so.1 resulted in 5 failures out of 10,000 sha1sum on /lib/libpam.so.1 resulted in zero errors in 10,000 sha1sum on /tmp/libpam.so.1ditto. So what we have is a pattern sensitive failure that is also sensitive to how busy the cpu is (and doesn't fail running VTS). md5sum and sha256sum produced similar results, and presumably so would fletcher2. To get really meaningful results, the machine should be otherwise idle (but then, maybe it wouldn't fail). Is anyone willing to speculate (or have any suggestions for further experiments) about what failure mode could cause a checksum calculation to be pattern sensitive and also thousands of times more likely to fail if read from disk vs. tmpfs? FWIW the failures are pretty consistent, mostly but not always producing the same bad checksum. So at boot, the cpu is busy, increasing the probability of this pattern sensitive failure, and this one time it failed on every read of /lib/libdlpi.so.1. With copies=1 this was twice as likely to happen, and when it did ZFS returned an error on any attempt to read the file. With copies=2 in this case it doesn't return an error when attempting to read. Also there were no set-bit errors this time, but then I have no idea what a set-bit error is. On 03/24/10 12:32 PM, Richard Elling wrote: Clearly, fletcher2 identified the problem. Ironically, on this hardware it seems it created the problem :-). However you have been vindicated - it was a pattern sensitive problem as you have long suggested it might be. So: that the file is still readable is a mystery, but how it became to be flagged as bad in ZFS isn't, any more. Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool split problem?
Zpool split is a wonderful feature and it seems to work well, and the choice of which disk got which name was perfect! But there seems to be an odd anomaly (at least with b132) . Started with c0t1d0s0 running b132 (root pool is called rpool) Attached c0t0d0s0 and waited for it to resilver Rebooted from c0t0d0s0 zpool split rpool spool Rebooted from c0t0d0s0, both rpool and spool were mounted Rebooted from c0t1d0s0, only rpool was mounted It seems to me for consistency rpool should not have been mounted when booting from c0t0d0s0; however that's pretty harmless. But: Rebooted from c0t0d0s0 - a couple of verbose errors on the console... # zpool status rpool pool: rpool state: UNAVAIL status: One or more devices could not be used because the label is missing or invalid. There are insufficient replicas for the pool to continue functioning. action: Destroy and re-create the pool from a backup source. see: http://www.sun.com/msg/ZFS-8000-5E scrub: none requested config: NAME STATE READ WRITE CKSUM rpool UNAVAIL 0 0 0 insufficient replicas mirror-0UNAVAIL 0 0 0 insufficient replicas c0t1d0s0 FAULTED 0 0 0 corrupted data c0t0d0s0 FAULTED 0 0 0 corrupted data # zpool status spool pool: spool state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM spool ONLINE 0 0 0 c0t0d0s0 ONLINE 0 0 0 It seems that ZFS thinks c0t0d0s0 is still part of rpool as well as being a separate pool (spool). # zpool export rpool cannot open 'rpool': I/O error This worked since zpool list doesn't show rpool any more. Reboot c0t1d0s0 - no problem (no spool) Reboot c0t0d0s0 - no problem (no rpool) The workaround seems to be to export rpool the first time you boot c0t0d0s0. No big deal but it's a bit scary when it happens. Has this been fixed in a later release? Thanks -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] CR 6880994 and pkg fix
On 03/22/10 11:50 PM, Richard Elling wrote: Look again, the checksums are different. Whoops, you are correct, as usual. Just 6 bits out of 256 different... Last year expected 4a027c11b3ba4cec bf274565d5615b7b 3ef5fe61b2ed672e ec8692f7fd33094a actual 4a027c11b3ba4cec bf274567d5615b7b 3ef5fe61b2ed672e ec86a5b3fd33094a Last Month (obviously a different file) expected 4b454eec8aebddb5 3b74c5235e1963ee c4489bdb2b475e76 fda3474dd1b6b63f actual 4b454eec8aebddb5 3b74c5255e1963ee c4489bdb2b475e76 fda354c1d1b6b63f Look which bits are different - digits 24, 53-56 in both cases. But comparing the bits, there's no discernible pattern. Is this an artifact of the algorithm made by one erring bit always being at the same offset? don't forget the -V flag :-) I didn't. As mentioned there are subsequent set-bit errors, (14 minutes later) but none for this particular incident. I'll send you the results separately since they are so puzzling. These 16 checksum failures on libdlpi.so.1 were the only fmdump -eV entries for the entire boot sequence except that it started out with one ereport.fs.zfs.data, whatever that is, for a total of exactly 17 records, 9 in 1 uS, then 8 more 40 mS later, also in 1uS. Then nothing for 4 minutes, one more checksum failure (bad_range_sets =) then 10 minutes later, two with the set-bits error, one for each disk. That's it. o Why is the file flagged by ZFS as fatally corrupted still accessible? This is the part I was hoping to get answers for since AFAIK this should be impossible. Since none of this is having any operational impact, all of these issues are of interest only, but this is a bit scary! Broken CPU, HBA, bus, memory, or power supply. No argument there. Doesn't leave much, does it :-). Since the file itself appears to be uncorrupted, and the metadata is consistent for all 16 entries, it would seem that the checksum calculation itself is failing because it would appear in this case that everything else is OK. Is there a way to apply the fletcher2 algorithm interactively as in sum(1) or cksum(1) (i.e., outside the scope of ZFS) to see if it is in some way pattern sensitive with this CPU? Since only a small subset of files is affected, this should be easy to verify. Start a scrub to heat things up and then in parallel do checksums in a tight loop... Transient failures are some of the most difficult to track down. Not all transient failures are random. Indeed, although this doesn't seem to be random. The hits to libdlpi.so.1 seems to be quite reproducible as you've seen from the fmdump log, although I doubt this particular scenario will happen again. Can you think of any tools to investigate this? I suppose I could extract the checksum code from ZFS itself to build one, but that would take quite a lot of time. Is there any documentation that explains the output of fmdump -eV? What are set-bits, for example? I guess not... from man fmdump(1m) The error log file contains /Private/ telemetry informa- tion used by Sun's automated diagnosis software. .. Each problem recorded in the fault log is identified by: oThe time of its diagnosis So did ZFS really read 8 copies of libdlpi.so.1 within 1uS, wait 40mS and then read another 8 copies in 1uS again? I doubt it :-). I bet it took 1uS just to (mis)calculate the checksum (1.6GHz 16 bit cpu). Thanks -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] CR 6880994 and pkg fix
On 03/21/10 03:24 PM, Richard Elling wrote: I feel confident we are not seeing a b0rken drive here. But something is clearly amiss and we cannot rule out the processor, memory, or controller. Absolutely no question of that, otherwise this list would be flooded :-). However, the purpose of the post wasn't really to diagnose the hardware but to ask about the behavior of ZFS under certain error conditions. Frank reports that he sees this on the same file, /lib/libdlpi.so.1, so I'll go out on a limb and speculate that there is something in the bit pattern for that file that intermittently triggers a bit flip on this system. I'll also speculate that this error will not be reproducible on another system. Hopefully not, but you never know :-). However, this instance is different. The example you quote shows both expected and actual checksums to be the same. This time the expected and actual checksums are different and fmdump isn't flagging any bad_ranges or set-bits (the behavior you observed is still happening, but orthogonal to this instance at different times and not always on this file). Since file itself is OK, and the expected checksums are always the same, neither the file nor the metatdata appear to be corrupted, so it appears that both are making it into memory without error. It would seem therefore that it is the actual checksum calculation that is failing. But, only at boot time, the calculated (bad) checksums differ (out of 16, 10, 3, and 3 are the same [1]) so it's not consistent. At this point it would seem to be cpu or memory, but why only at boot? IMO it's an old and feeble power supply under strain pushing cpu or memory to a margin not seen during normal operation, which could be why diagnostics never see anything amiss (and the importance of a good power supply). FWIW the machine passed everything vts could throw at it for a couple of days. Anyone got any suggestions for more targeted diagnostics? There were several questions embedded in the original post, and I'm not sure any of them have really been answered: o Why is the file flagged by ZFS as fatally corrupted still accessiible? [is this new behavior from b111b vs b125?]. o What possible mechanism could there be for the /calculated/ checksums of /four/ copies of just one specific file to be bad and no others? o Why did this only happen at boot to just this one file which also is peculiarly subject to the bitflips you observed, also mostly at boot (sometimes at scrub)? I like the feeble power supply answer, but why just this one file? Bizarre... # zpool get failmode rpool NAME PROPERTY VALUE SOURCE rpool failmode wait default This machine is extremely memory limited, so I suspect that libdlpi.so.1 is not in a cache. Certainly, a brand new copy wouldn't be, and there's no problem writing and (much later) reading the new copy (or the old one, for that matter). It remains to be seen if the brand new copy gets clobbered at boot (the machine, for all it's faults, remains busily up and operational for months at a time). Maybe I should schedule a reboot out of curiosity :-). This sort of specific error analysis is possible after b125. See CR6867188 for more details. Wasn't this in b125? IIRC we upgraded to b125 for this very reason. There certainly seems to be an overwhelming amount of data in the various logs! Cheers -- Frank [1] This could be (3+1) * 4 where in one instance all 3+1 happen to be the same. Does ZFS really read all 4 copies 4 times (by fmdump timestamp, 8 within 1uS, 40mS later, another 8, again within 1uS)? Not sure what the fmdump timestamps mean, so it's hard to find any pattern. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] CR 6880994 and pkg fix
On 03/15/10 01:01 PM, David Dyer-Bennet wrote: This sounds really bizarre. Yes, it is. ButCR 6880994 is bizarre too. One detail suggestion on checking what's going on (since I don't have a clue towards a real root-cause determination): Get an md5sum on a clean copy of the file, say from a new install or something, and check the allegedly-corrupted copy against that. This can fairly easily give you a pretty reliable indication if the file is truly corrupted or not. With many thanks to Danek Duvall, I got a new copy of libdlpi.so.1 # md5sum /lib/libdlpi.so.1 2468392ff87b5810571572eb572d0a41 /lib/libdlpi.so.1 # md5sum /lib/libdlpi.so.1.orig 2468392ff87b5810571572eb572d0a41 /lib/libdlpi.so.1.orig # zpool status -v errors: Permanent errors have been detected in the following files: //lib/libdlpi.so.1.orig So here we seem to have an example of a ZFS false positive, the first I've see or heard of. The good news is that it is still possible to read the file, so this augers well for the ability to boot under this circumstance. FWIW fmdump does seem to show show actual checksum errors on all four copies in 16 attempts to read them. There were 3 groups of different bad checksums; within each group the checksum was the same but differed from the expected. Perhaps someone who can could add this to CR 6880994 in the hopes that it might help lead to a better understanding. For the casual reader, CR 6880994 is about a pathological PC that gets checksum errors on the same set of files at boot, even though the root pool is mirrored. With copies=2, usually ZFS can repair them. But after a recent power cycle, all 4 copies reported bad checksums but in reality the the file seems to be uncorrupted. The machine has no ECC and flaky bus parity, so there are plenty of ways for the data to get messed up. It's a mystery why this only happens at boot, though. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] CR 6880994 and pkg fix
Can anyone say what the status of CR 6880994 (kernel/zfs Checksum failures on mirrored drives) might be? Setting copies=2 has mitigated the problem, which manifests itself consistently at boot by flagging libdlpi.so.1, but two recent power cycles in a row with no normal shutdown has resulted in a permanent error even with copies=2 on all of the root pool (and specifically having duplicated /lib to make sure there are 2 copies). How can it even be remotely possible to get a checksum failure on mirrored drives with copies=2? That means all four copies were corrupted? Admittedly this is on a grotty PC with no ECC and flaky bus parity, but how come the same file always gets flagged as being clobbered (even though apparently it isn't). The oddest part is that libdlpi.so.1 doesn't actually seem to be corrupted. nm lists it with no problem and you can copy it to /tmp, rename it, and then copy it back. objdump and readelf can all process this library with no problem. But pkg fix flags an error in it's own inscrutable way. CCing pkg-discuss in case a pkg guru can shed any light on what the output of pkg fix (below) means. Presumably libc is OK, or it wouldn't boot :-). This with b125 on X86. # zpool status -v pool: rpool state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAMESTATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror-0 ONLINE 0 0 2 c3d1s0 ONLINE 0 0 2 c3d0s0 ONLINE 0 0 2 errors: Permanent errors have been detected in the following files: //lib/libdlpi.so.1 # pkg fix SUNWcsl Verifying: pkg://opensolarisdev/SUNWcsl ERROR file: lib/libc.so.1 Elfhash: cbb55a2ea24db9e03d9cd08c25b20406896c2fef should be 0e73a56d6ea0753f3721988ccbd716e370e57c4e Created ZFS snapshot: 2010-03-13-23:39:17 . || Repairing: pkg://opensolarisdev/SUNWcsl pkg: Requested fix operation would affect files that cannot be modified in live image. Please retry this operation on an alternate boot environment # nm /lib/libdlpi.so.1 00015562 b Bbss.bss 00015562 b Bbss.bss 00015240 d Ddata.data 00015240 d Ddata.data 000152f8 d Dpicdata.picdata 3ca8 r Drodata.rodata 3ca0 r Drodata.rodata A SUNW_1.1 A SUNWprivate 000150ac D _DYNAMIC 00015562 b _END_ 00015000 D _GLOBAL_OFFSET_TABLE_ 16c0 T _PROCEDURE_LINKAGE_TABLE_ r _START_ U ___errno U __ctype U __div64 00015562 D _edata 00015562 B _end 43d7 R _etext 3c84 t _fini U _fxstat 3c68 t _init 3ca0 r _lib_version U _lxstat U _xmknod U _xstat U abs U calloc U close U closedir U dgettext U dladm_close U dladm_dev2linkid U dladm_open U dladm_parselink U dladm_phys_info U dladm_walk 2d5c T dlpi_arptype 222c T dlpi_bind 1d6c T dlpi_close 24d0 T dlpi_disabmulti 2c78 T dlpi_disabnotify 24b0 T dlpi_enabmulti 2af4 T dlpi_enabnotify 00015288 d dlpi_errlist 2ce4 T dlpi_fd 25b8 T dlpi_get_physaddr 2e50 T dlpi_iftype 1dc8 T dlpi_info 2d2c T dlpi_linkname 39a8 T dlpi_mactype 000152f8 d dlpi_mactypes 21a0 T dlpi_makelink 1b00 T dlpi_open 2158 T dlpi_parselink 3ca8 r dlpi_primsizes 2598 T dlpi_promiscoff 2578 T dlpi_promiscon 28fc T dlpi_recv 27a4 T dlpi_send 26d4 T dlpi_set_physaddr 2d04 T dlpi_set_timeout 3908 T dlpi_strerror 2d48 T dlpi_style 2384 T dlpi_unbind 1a20 T dlpi_walk U free 1998 t fstat U getenv U gethrtime U getmsg 32fc t i_dlpi_attach 3a28 t i_dlpi_buildsap 32ac t i_dlpi_checkstyle 3bfc t i_dlpi_deletenotifyid 39e8 t i_dlpi_getprimsize 3868 t i_dlpi_msg_common 23f4 t i_dlpi_multi 3bd4 t i_dlpi_notifyidexists 3ac8 t i_dlpi_notifyind_process 2f28 t i_dlpi_open 3384 t i_dlpi_passive 24e8 t i_dlpi_promisc 3460 t i_dlpi_strgetmsg 33e4 t i_dlpi_strputmsg 316c t i_dlpi_style1_open 31f0 t i_dlpi_style2_open 19f0 t i_dlpi_walk_link 3a9c t i_dlpi_writesap U ifparse_ifspec U ioctl 00015240 d libdlpi_errlist 196c t lstat U memcpy U memset 19c4 t mknod U open U opendir U poll U putmsg U readdir U snprintf 1940 t stat U strchr U strerror U strlcpy U strlen ___ zfs-discuss mailing list zfs-discuss@opensolaris.org
Re: [zfs-discuss] Proposed idea for enhancement - damage control
On 02/17/10 02:38 PM, Miles Nordin wrote: copies=2 has proven to be mostly useless in practice. Not true. Take an ancient PC with a mirrored root pool, no bus error checking and non-ECC memory, that flawlessly passes every known diagnostic (SMC included). Reboot with copies=1 and the same files in /usr/lib will get trashed every time and you'll have to reboot from some other media to repair it. Set copies=2 (copy all of /usr/lib, of course) and it will reboot every time with no problem, albeit with a varying number of repaired checksum errors, almost always on the same set of files. Without copies=2 this hardware would be useless (well, it ran Linux just fine), but with it, it has a new lease of life. There is an ancient CR about this, but AFAIK no one has any idea what the problem is or how to fix it. IMO it proves that copies=2 can help avoid data loss in the face of flaky buses and perhaps memory. I don't think you should be able to lose data on mirrored drives unless both drives fail simultaneously, but with ZFS you can. Certainly, on any machine without ECC memory, or buses without ECC (is parity good enough?) my suggestion would be to set copies=2, and I have it set for critical datasets even on machines with ECC on both. Just waiting for the bus that those SAS controllers are on to burp at the wrong moment... Is one counter-example enough? Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] most of my space is gone
On 02/ 6/10 11:21 AM, Thorsten Hirsch wrote: I wonder where ~10G have gone. All the subdirs in / use ~4.5G only (that might be the size of REFER in opensolaris-7), and my $HOME uses 38.5M, that's correct. But since rpool has a size of 15G there must be more than 10G somewhere. Do you have any old Boot Environments (BEs) around? In order to *really* empty /var/pkg/downloads, you have to delete every old BE because /var/pkg/downloads is protected by BE snapshots. Each new BE seems to take 5GB or so in /var/pkg/downloads, so it adds up fast! AFAIK there is no way to get around this. You can set a flag so that pkg tries to empty /var/pkg/downloads, but even though it looks empty, it won't actually become empty until you delete the snapshots, and IIRC you still have to manually delete the contents. I understand that you can try creating a separate dataset and mounting it on /var/pkg, but I haven't tried it yet, and I have no idea if doing so gets around the BE snapshot problem. Sadly this renders the whole concept of BEs rather useless if you boot from smallish SSDs or HDs - my workaround is to keep the old BEs on a backup disks, just like the old UFS days :-) (snapshots work, too). HTH -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] most of my space is gone
On 02/ 6/10 11:50 AM, Thorsten Hirsch wrote: Uhmm... well, no, but there might be something left over. When I was doing an image-update last time, my / ran out of space. I even couldn't beadm destroy any old boot environment, because beadm told me that there's no space left. So what I did was zfs destroy /rpool/ROOT/opensolaris-6. After that opensolaris-6 didn't show up anymore in beadm list. When something similar happened to me when updating to snv111b, I successfully snapshotted the current BE and zfs send/recv it to a different disk, and it freed up around 5GB. No one commented on this (a long time ago now), but it would be interesting to hear from the experts about the possible aftermath of running out of space. Presumably zfs list -t snapshots doesn't show any snapshots at all? If it does, it might be worth while deleting them to see if there are still any uneeded files in /var/pkg. On 02/ 6/10 12:33 PM, Bill Sommerfeld wrote: You can set the environment variable PKG_CACHEDIR to place the cache in an alternate filesystem. Cool! Would you know when this feature became available? Thanks. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Home ZFS NAS - 2 drives or 3?
On 01/30/10 05:33 PM, Ross Walker wrote: On Jan 30, 2010, at 2:53 PM, Mark white...@gmail.com wrote: I have a 1U server that supports 2 SATA drives in the chassis. I have 2 750 GB SATA drives. When I install opensolaris, I assume it will want to use all or part of one of those drives for the install. That leaves me with the remaining part of disk 1, and all of disk 2. Question is, how do I best install OS to maximize my ability to use ZFS snapshots and recover if one drive fails? Where were you planning to send the snapshots? There's been a lot of discussion about this on this list, but my solution is to mirror the entire system and zfs send/recv to it periodically to keep a live backup. Alternatively, I guess I could add a small USB drive to use solely for the OS and then have all of the 2 750 drives for ZFS. Is that a bad idea since the OS drive will be standalone? Just install the OS on the first drive and add the second drive to form a mirror. There are wikis and blogs on how to add the second drive to form an rpool mirror. After more than a year or so of experience with ZFS on drive constrained systems, I am convinced that it is a really good idea to keep the root pool and the data pools separate. AFAIK you could set up two slices on each disk and mirror the results. But actually I'm not sure why you shouldn't use your USB drive for root pool idea. If it breaks you simply reinstall (or restore it from a snapshot on your data pool after booting from a CD). I suppose you could mirror the USB drive, too, but if you can stand the downtime after a failure, that probably isn't necessary. Of course, SSDs are getting pretty cheap in bootable sizes and will probably last forever if you don't swap to them, and that would be an even better solution. USB SSD thumb drives seem to be quite cheap these days. The you'd have a full-disk mirrored data pool and a fast bootable OS pool; if you go the SSD route I'd go for at least 32GB. Of course you could get a 1TB USB drive to boot from, and use it to keep a backup of the data pool, but if it failed, you'd be SOL until you replaced it. IMO that would be the best 3-disk solution. Should be interesting to hear from the gurus about this... Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Panic running a scrub
On 01/20/10 04:27 PM, Cindy Swearingen wrote: Hi Frank, I couldn't reproduce this problem on SXCE build 130 by failing a disk in mirrored pool and then immediately running a scrub on the pool. It works as expected. The disk has to fail whilst the scrub is running. It has happened twice now, once with the bottom half of the mirror, and again with the top half. Any other symptoms (like a power failure?) before the disk went offline? It is possible that both disks went offline? Neither. The system is on a pretty beefy UPS, and one half of the mirror was definitely online (zpool status just before panic showed one disk offline and the pool as degraded). We would like to review the crash dump if you still have it, just let me know when its uploaded. Do you need the unix.0, vmcore.0 or both? I'll add either or both as attachments to newly created Bug 14012, Panic running a scrub, when you let me know which one(s) you want. Thanks -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Panic running a scrub
On 01/20/10 05:55 PM, Cindy Swearingen wrote: Hi Frank, We need both files. The vmcore is 1.4GB. An http upload is never going to complete. Is there an ftp-able place to send it, or can you download it if I post it somewhere? Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Panic running a scrub
On 01/20/10 04:27 PM, Cindy Swearingen wrote: Hi Frank, I couldn't reproduce this problem on SXCE build 130 by failing a disk in mirrored pool and then immediately running a scrub on the pool. It works as expected. As noted, the disk mustn't go offline until well after the scrub has started. There's another wrinkle. There are some COMSTAR iscsi targets on this pool. If there are no initiators accessing any of them, the scrub completes with no errors after 6 hours. If one specific target is active, the panic ensues reproducibly at about 5h30m or so. The precise configuration has 2 disks on one LSI controller as a mirrored pool (whole disks - no slices). Around 750GB of 1.3TB was in use when the most recent iscsi target was created. The pool is read-mostly, so it probably isn't fragmented. The zvol has copies=1; compression off (no dedupe with snv124). The initiator is VirtualBox running on Fedora C10 on AMD64 and the target disk has 32 bit Fedora C12 installed as whole disk, which I believe is EFI. To reproduce this might require setting up a COMSTAR iscsi target on a mirrored pool, formatting it with an EFI label, and then running a scrub. Another, similar, target has OpenSolaris installed on it, and it doesn't seem to cause a panic on a scrub if it is running; AFAIK it doesn't use EFI, but I have not run a scrub with it active since converting to COMSTAR either. This wouldn't explain why one or the other disk randomly goes offline and it may be a red herring. But the scrub now runs to completion just as it always has. Since I can't get FC12 to boot from the EFI disk in VirtualBox, I may reinstall FC12 without EFI and see if that makes a difference, but it is an extremely slow process since it takes almost 6 hours for the panic to occur each time and there's no practical way to relocate the zvol to the start of the pool. HTH -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Panic running a scrub
This is probably unreproducible, but I just got a panic whilst scrubbing a simple mirrored pool on scxe snv124. Evidently on of the disks went offline for some reason and shortly thereafter the panic happened. I have the dump and the /var/adm/messages containing the trace. Is there any point in submitting a bug report? The panic starts with: Jan 19 13:27:13 host6 ^Mpanic[cpu1]/thread=2a1009f5c80: Jan 19 13:27:13 host6 unix: [ID 403854 kern.notice] assertion failed: 0 == zap_update(dp-dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_SCRUB_BOOKMARK, sizeof (uint64_t), 4, dp-dp_scrub_bookmark, tx), file: ../../common/fs/zfs/dsl_scrub.c, line: 853 FWIW when the system came back up, it resilvered with no problem and now I'm rerunning the scrub. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Fwd: The 100, 000th beginner question about a zfs server
On 11/23/09 10:10 AM, David Dyer-Bennet wrote: Is there enough information available from system configuration utilities to make an automatic HCL (or unofficial HCL competitor) feasible? Someone could write an application people could run which would report their opinion on how well it works, plus the self-reported identity of all key components? (It could report uptime, too, as one very small objective rating of stability.) IIRC, the HCL doesn't really talk about applications. We have some really flaky PCs that run Open Solaris beautifully and their uptime is measured in months (basically only new releases or long power cuts make them come down). Would I recommend them for a ZFS based server? Not a chance! But they make super reliable X-Terminals... As Richard Elling has pointed out so eloquently, a reliable storage system has to be engineered to minimize or eliminate SPoFS, and I doubt you'll ever find that on an HCL, which really serves a different purpose, IMO. Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedupe question
Got some out-of-curiosity questions for the gurus if they have time to answer: Isn't dedupe in some ways the antithesis of setting copies 1? We go to a lot of trouble to create redundancy (n-way mirroring, raidz-n, copies=n, etc) to make things as robust as possible and then we reduce redundancy with dedupe and compression :-). What would be the difference in MTTDL between a scenario where dedupe ratio is exactly two and you've set copies=2 vs. no dedupe and copies=1? Intuitively MTTDL would be better because of the copies=2, but you'd lose twice the data when DL eventually happens. Similarly, if hypothetically dedupe ratio = 1.5 and you have a two-way mirror, vs. no dedupe and a 3 disk raidz1, which would be more reliable? Again intuition says the mirror because there's one less device to fail, but device failure isn't the only consideration. In both cases it sounds like you might gain a bit in performance, especially if the dedupe ratio is high because you don't have to write the actual duplicated blocks on a write and on a read you are more likely to have the data blocks in cache. Does this make sense? Maybe there are too many variables, but it would be so interesting to hear of possible decision making algorithms. A similar discussion applies to compression, although that seems to defeat redundancy more directly. This analysis requires good statistical maths skills! Thanks -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs code and fishworks fork
On 10/28/09 10:18 AM, Tim Cook wrote: If Nexenta was too expensive, there's nothing Sun will ever offer that will fit your price profile. Home electronics is not their business model and never will be. True, but this was discussed that on a different thread some time ago. Sun's prices on X86s are actually quite competitive if you can even find a comparable machine (i.e, with ECC on buses and memory). Given the Google report on memory failures that Richard Elling dug up a while ago, surely no one in their right mind would want to run anything the least bit important on a machine without such ECC, and I doubt you could configure a decent file server /new/ for less than $2K. If you can, I'm sure we'd all like to hear about it! However, you are certainly correct that Sun's business model isn't aimed at retail, although one wonders about the size of the market for robust SOHO/Home file/media servers that no one seems to be addressing right now (well, Apple, maybe, although they are not explicit about it and they don't offer ZFS...). Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] iscsi/comstar performance
On 10/13/09 18:35, Albert Chin wrote: Maybe this will help: http://mail.opensolaris.org/pipermail/storage-discuss/2009-September/007118.html Well, it does seem to explain the scrub problem. I think it might also explain the slow boot and startup problem - the VM only has 564M available, and it is paging a bit. Doing synchronous i/o for swap makes no sense. Is there an official way to disable this behavior? Does anyone know if the old iscsi system is going to stay around, or will COMSTAR replace it at some point? The 64K metadata block at the start of each volume is a bit awkward, too. - it seems to throw VBox into a tizzy when (failing to) boot MSWXP. The options seem to be a) stay with the old method and hope it remains supported b) figure out a way around the COMSTAR limitations c) give up and use NFS Using ZFS as an iscsi backing store for VirtualBox images seemed like a great idea, so simple to maintain and robust, but COMSTAR seems to have sand-bagged it a bit. The performance was quite acceptable before but it is pretty much unusable this way. Any ideas would be much appreciated Thanks -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mount ZFS on Dual Boot Machine (open)solaris
On 10/15/09 23:31, Cameron Jones wrote: by cross-mounting do you mean mounting the drives on 2 running OS's? that wasn't really what i was looking for but nice to know the option is there, even tho not recommended! No, since you really can't run two OSs at the same time unless you use zones. Maybe someone more expert than I could comment on the idea of running OpenSolaris on a Solaris 10 or sxce host - e.g., in the case of sxce, if they were both, say snv124? my only real aim was to have the 3 disks accessible when booting into either OS so i could share archived data between them. That's what you should do (and I do it all the time). Put your user data a separate pool and import only that on both OS instances. So in your case, install OpenSolaris in a 32GB or more slice 0 partition of the mirror and /export on (say) slice 1. My data pool is called space and it has a number of file systems most of which are mounted on /export (e.g., /export/home/userz for user userz. You could do this by zfs snap of the OpenSolaris rpool from Solaris, and then zfs recv after running format (follow the guide for restoring a zfs rpool at http://docs.sun.com/app/docs/doc/819-5461/ghzur?a=view). it sounds like i shouldn't have any problem cold-cross-mounting :) although does bug 11358 only apply to opensolaris or would it also be possible to apply to solaris 10 too? Not sure. sxce and Open Solaris both do the dreaded archive update, so AFAIK Solaris 10 would do it too, possibly with bad consequences. A workaround would be to make sure the other rpool is not mounted when you reboot, but one whoops and you might be toast. Better to keep data and OS separate. Then you can do zfs snaps for rpool backups and something different if you like for user data backups. also i thought i read in the doco that ZFS assigns an id to each drive which is unique to the OS - if i try to mount it into another OS would this id keep changing each time i switch? AFAIK it doesn't. I have sxce and OpenSolaris running alternately on one host and they mount the data pool with no problems at all. I no longer even try to cross mount the rpools because my OpenSolaris installs kept getting trashed by 11358, but at that time sxce was on UFS. I believe the ids are assigned when the pool is created, so if you zfs recv an rpool from another host with an otherwise identical configuration, it will try (and correctly fail) to mount a zombie data pool when you boot it. I assume the id is ignored on the root pool at boot time or it wouldn't be able to boot at all. Undoubtedly a guru will chip in here if this is incorrect :-) HTH -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mount ZFS on Dual Boot Machine (open)solaris
On 10/16/09 09:29, I wrote: I assume the id is ignored on the root pool at boot time or it wouldn't be able to boot at all. Undoubtedly a guru will chip in here if this is incorrect :-) Of course this was hogwash. You create the pool before receiving the snapshot, so the ID is local. One of the many nice things about ZFS is that it is so logically consistent. I'd never want to go back! Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] primarycache and secondarycache properties on Solaris 10 u8
IIRC the trigger for this thread was the suggestion that primarycache=none be set on datasets used for swap. Presumably swap only gets used when memory is low or exhausted, so it would it be correct to say that it wouldn't make any sense for swap to be in /any/ cache? If this isn't what primarycache=none means, shouldn't there be a disable-cache-entirely flag for datasets used for swap? I guess reads from swap must be buffered somewhere, so it would be an optimization to have such reads buffered in a read cache. But wouldn't the read cache be real small at this point? It's enough to make your head spin :-) -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mount ZFS on Dual Boot Machine (open)solaris
On 10/15/09 20:36, Cameron Jones wrote: My question is tho, since I can boot into either OpenSolaris or Solaris (but not both at the same time obviousvly :) i'd like to be able to mount the other disks into whatever host OS i boot into. Is this possible recommended? Definitely possible. Where do you keep your user data? It isn't clear that there is much utility in cross mounting rpools from Solaris/sxce to Open Solaris; better to keep your user data in one or more separate data pools and to just mount them. That simplifies backups, too. Is there any scope for inconsistency if say i upgrade OpenSolaris with new ZFS versions but continue mounting a mirror in Solaris with old versions? You have to watch out for the gratuitous update-archive problem http://defect.opensolaris.org/bz/show_bug.cgi?id=11358 at reboot. Otherwise AFAIK you just have to be careful. So far ZFS seems to have kept backwards compatibility. Just don't accidentally do a zpool upgrade :-). Because of 11358, I would not recommend cross mounting the rpools. But it isn't clear that that is what you really want to achieve... Many thanks, cam ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] iscsi/comstar performance
After a recent upgrade to b124, decided to switch to COMSTAR for iscsi targets for VirtualBox hosted on AMD64 Fedora C10. Both target and initiator are running zfs under b124. This combination seems unbelievably slow compared to the old iscsi subsystem. A scrub of a local 20GB disk on the target took 16 minutes. A scrub of a 20GB iscsi disk took 106 minutes! It seems to take much longer to boot from iscsi, so it seems to be reading more slowly too. There are a lot of variables - switching to Comstar, snv124, VBox 3.08, etc., but such a dramatic loss of performance probably has a single cause. Is anyone willing to speculate? Thanks -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] MPT questions
In an attempt to recycle some old PATA disks, we bought some really cheap PATA/SATA adapters, some of which actually work to the point where it is possible to boot from a ZFS installation (e.g., c1t2d0s0). Not all PATA disks work, just Seagates, it would seem, but not Maxstors. I wonder why? probe-scsi-all sees Seagate but not Maxstor disks plugged into the same adapter. Such disks have proven invaluable as a substitute for rescue CDs until such CDs become possible. The odd thing is that booting from another disk, ZFS can't see the adapted disk even though it is bootable. Could the reason be that there's no /dev/rdsk/c1t2d0, but there are c1t0d0, etc.? Format sees the disk but zpool import doesn't (this is on SPARC sun4u). This isn't at all important, just curious as to why this might be and why zpool import can't see it at all, but zpool create can. Gotta say how happy we are with the MPT driver and the LSI SAS controller - fast and reliable - petabytes of i/o and not a single zfs checksum error! This has little to do with ZFS, but should it be possible to see a PATA CD or DVD connected to an MPT (LSI) SAS controller via one of these adapters? Though I'd ask before forking out for a SATA DVD drive - just hate to put perfectly good drives out for recycling. Maybe someone can recommend a writable BlueRay SAS drive that is known to work with the MPT driver instead... Thanks -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best way to convert checksums
On 10/01/09 05:08 AM, Darren J Moffat wrote: In the future there will be a distinction between the local and the received values see the recently (yesterday) approved case PSARC/2009/510: http://arc.opensolaris.org/caselog/PSARC/2009/510/20090924_tom.erickson Currently non-recursive incremental streams send properties and full streams don't. Will the p flag reverse its meaning for incremental streams? For my purposes the current behavior is the exact opposite of what I need and it isn't obvious that the case addresses this peculiar inconsistency without going through a lot of hoops. I suppose the new properties can be sent initially so that subsequent incremental streams won't override the possibly changed local properties, but that seems so complicated :-). If I understand the case correctly, we can now set a flag that says ignore properties sent by any future incremental non-recursive stream. This instead of having a flag for incremental streams that says don't send properties. What happens if sometimes we do and sometimes we don't? Sounds like a static property when a dynamic flag is really what is wanted and this is a complicated way of working around a design inconsistency. But maybe I missed something :-) So what would the semantics of the new p flag be for non-recursive incremental streams? Thanks -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Would ZFS work for a high-bandwidth video SAN?
On 09/29/09 10:23 PM, Marc Bevand wrote: If I were you I would format every 1.5TB drive like this: * 6GB slice for the root fs As noted in another thread, 6GB is way too small. Based on actual experience, an upgradable rpool must be more than 20GB. I would suggest at least 32GB; out of 1.5TB that's still negligible. Recent release notes for image-update say that at least 8GB free is required for an update. snv111b as upgraded from a CD installed image takes 11GB without any user applications like Firefox. Note also that a nominal 1.5TB drive really only has 1.36TB of actual space as reported by zfs. Can't speak to the 12-way mirror idea, but if you go this route you might keep some slices for rpool backups. I have found having a disk with such a backup invaluable... How do you plan to do backups in general? Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Would ZFS work for a high-bandwidth video SAN?
On 09/30/09 12:59 PM, Marc Bevand wrote: It depends on how minimal your install is. Absolutely minimalist install from live CD subsequently updated via pkg to snv111b. This machine is an old 32 bit PC used now as an X-terminal, so doesn't need any additional software. It now has a bigger slice of a larger pair of disks :-). snv122 also takes around 11GB after emptying /var/pkg/download. # uname -a SunOS host8 5.11 snv_111b i86pc i386 i86pc Solaris # df -h FilesystemSize Used Avail Use% Mounted on rpool/ROOT/opensolaris-2 34G 13G 22G 37% / There's around 765GB in /var/pkg/download that could be deleted, and 1GB's worth of snapshots left by previous image-updates, bringing it down to around 11GB. consistent with a minimalist SPARC snv122 install with /var/pkg/download emptied and all but the current BE and all snapshots deleted. The OpenSolaris install instructions recommend 8GB minimum, I have It actually says 8GB free space required. This is on top of the space used by the base installation. This 8GB makes perfect sense when you consider that the baseline has to be snapshotted, and new code has to be downloaded and installed in a way that can be rolled back. I can't explain why the snv111b baseline is 11GB vs. the 6GB of the initial install, but this was a default install followed by default image-updates. one OpenSolaris 2009.06 server using about 4GB, so I thought 6GB would be sufficient. That said I have never upgraded the rpool of this server, but based on your commends I would recommend an rpool of 15GB to the original poster. The absolute minimum for an upgradable rpool is 20GB, for both SPARC and X86. This assumes you religiously purge all unnecessary files (such as /var/pkg/download) and keep swap, /var/dump, /var/crash and /opt on another disk. You *really* don't want to run out of space doing an image-update. The result is likely to require a restore from backup of the rpool, or at best, loss of some space that seems to vanish down a black hole. Technically, the rpool was recovered from a baseline snapshot several times onto a 20GB disk until I figured out empirically that 8GB of free space was required for the image-update. I really doubt your mileage will vary. Prudence says that 32GB is much safer... Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OS install question
On 09/28/09 12:40 AM, Ron Watkins wrote: Thus, im at a loss as to how to get the root pool setup as a 20Gb slice 20GB is too small. You'll be fighting for space every time you use pkg. From my considerable experience installing to a 20GB mirrored rpool, I would go for 32GB if you can. Assuming this is X86, couldn't you simply use fdisk to create whatever partitions you want and then install to one of them? Than you should be able to create the data pool using another partition. You might need to use a weird partition type temporarily. On SPARC there doesn't seem to be a problem using slices for different zpools, in fact it insists on using a slice for the root pool. Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Fixing Wikipedia tmpfs article (was Re: Which directories must be part of rpool?)
Trying to move this to a new thread, although I don't think it has anything to do with ZFS :-) On 09/28/09 08:54 AM, Chris Gerhard wrote: TMPFS was not in the first release of 4.0. It was introduced to boost the performance of diskless clients which no longer had the old network disk for their root file systems and hence /tmp was now over NFS. Whether there was a patch that brought it back into 4.0 I don't recall but I don't think so. 4.0.1 would have been the first release that actually had it. --chris On 09/28/09 03:00 AM, Joerg Schilling wrote: I am not sure whether my changes will be kept as wikipedia prefers to keep badly quoted wrong information before correct information supplied by people who have first hand information. They actually disallow first hand information. Everything on Wikipedia is supposed to be confirmed by secondary or tertiary sources. That's why I asked if there was any supporting documentation - papers, manuals, proceedings, whatever, that describe the introduction of tmpfs before 1990. If you were to write a personal page (in Wikipedia if you like) that describes the history of tmpfs, then you could refer to it in the tmpfs page as a secondary source. Actually, I suppose if it was in the source code itself, that would be pretty irrefutable! http://en.wikipedia.org/wiki/Wikipedia:Reliable_sources Wikipedia also has a lofi page (http://en.wikipedia.org/wiki/Lofi) that redirects to loop mount. It has no historical section at all... There is no fbk (file system) page. Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OS install question
On 09/28/09 01:22 PM, David Dyer-Bennet wrote: That seems truly bizarre. Virtualbox recommends 16GB, and after doing an install there's about 12GB free. There's no way Solaris will install in 4GB if I understand what you are saying. Maybe fresh off a CD when it doesn't have to download a copy first, but the reality is 16GB is not possible unless you don't want ever to to an image update. What version are you running? Have you ever tried pkg image-update? # uname -a SunOS host8 5.11 snv_111b i86pc i386 i86pc Solaris # df -h Filesystem Size Used Avail Use% Mounted on rpool/ROOT/opensolaris-2 34G 13G 22G 37% / # du -sh /var/pkg/download/ 762M/var/pkg/download/ this after deleting all old BEs and all snapshots but not emptying /var/pkg/download; swap/boot are on different slices. SPARC is similar; snv122 takes 11Gb after deleting old BEs, all snapshots, *and* /var/pkg/downloads; *without* /opt, swap, /var/crash, /var/dump, /var/tmp, /var/run and /export... AFAIK It is absolutely impossible to do a pkg image-update (say) from snv111b to snv122 without at least 9GB free (it says 8GB in the documentation). If the baseline is 11GB, you need 20GB for an install, and that leaves you zip to spare. Obvious reasons include before and after snaps, download before install, and total rollback capability. This is all going to cost some space. I believe there is a CR about this, but IMO when you can get 2TB of disk for $200 it's hard to complain. 32GB of SSD is not unreasonable and 16GB simply won't hack it. All the above is based on actual and sometimes painful experience. You *really* don't want to run out of space during an update. You'll almost certainly end up restoring your boot disk if you do and if you don't, you'll never get back all the space. Been there, done that... Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Which directories must be part of rpool?
On 09/27/09 03:05 AM, Joerg Schilling wrote: BTW: Solaris has tmpfs since late 1987. Could you fix the Wikipedia article? http://en.wikipedia.org/wiki/TMPFS it first appeared in SunOS 4.1, released in March 1990 It is a de-facto standard since then as it e.g. helps to reduce compile times. You bet! Provided the compiler doesn't use /var/tmp as IIRC early versions of gcc once did... -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Fixing Wikipedia tmpfs article (was Re: Which directories must be part of rpool?)
On 09/27/09 11:25 AM, Joerg Schilling wrote: Frank Middletonf.middle...@apogeect.com wrote: Could you fix the Wikipedia article? http://en.wikipedia.org/wiki/TMPFS it first appeared in SunOS 4.1, released in March 1990 It appeared with SunOS-4.0. The official release was probably Februars 1987, but there have been betas before IIRC. Do you have any references one could quote so that the Wikipedia article can be corrected? The section on Solaris is rather skimpy and could do with some work... AFAIK this has nothing to do with ZFS. I wonder if we should move it to another discussion. Apologies to the OP for hijacking your thread, although I think the original question has been answered only too thoroughly :-) Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Which directories must be part of rpool?
On 09/25/09 09:58 PM, David Magda wrote: The contents of /var/tmp can be expected to survive between boots (e.g., /var/tmp/vi.recover); /tmp is nuked on power cycles (because it's just memory/swap): Yes, but does mapping it to /tmp have any issues regarding booting or image-update in the context of this thread? IMO nuking is a good thing - /tmp and /var/tmp get really cluttered up after a few months, the downside of robust hardware and software :-). Not sure I really care about recovering vi edits in the case of UPS failure... If a program is creating and deleting large numbers of files, and those files aren't needed between reboots, then it really should be using /tmp. Quite. But some lazy programmer of 3rd party software decided to use the default tmpnam() function and I don't have access to the code :-(. tmpnam() The tmpnam() function always generates a file name using the path prefix defined as P_tmpdir in the stdio.h header. On Solaris systems, the default value for P_tmpdir is /var/tmp. Similar definition for [/tmp] Linux FWIW: Yes, but unless they fixed it recently (=RHFC11), Linux doesn't actually nuke /tmp, which seems to be mapped to disk. One side effect is that (like MSWindows) AFAIK there isn't a native tmpfs, so programs that create and destroy large numbers of files run orders of magnitude slower there than on Solaris - assuming the application doesn't use /var/tmp for them :-). Compilers and code generators are typical of applications that do this, though they don't usually do synchronous i/o as said programmer appears to have done. I suppose /var/tmp on zfs would never actually write these files unless they were written synchronously. In the context of this thread, for those of us with space constrained boot disks/ssds, is it OK to map /var/tmp to /tmp, and /var/crash, /var/dump, and swap to a separate data pool in the context of being able to reboot and install new images? I've been doing so for a long time now with no problems that I know of. Just wondering what the gurus think... Havn't seen any definitive response regrading /opt, which IMO should be a good candidate since the installer makes it a separate fs anyway. /usr/local can definitely be kept on a separate pool. I wouldn't move /root. I keep a separate /export/home/root and have root cd to it via a script in /root that also sets HOME, although I noticed on snv123 that logging on as root succeeded even though it couldn't find bash (defaulted to using sh). This may be a snv123 bug, but it is a huge improvement on past behavior. I daresay logging on as root might also work if root's home directory was awol. Haven't tried it... Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Which directories must be part of rpool?
On 09/26/09 12:11 PM, Toby Thain wrote: Yes, but unless they fixed it recently (=RHFC11), Linux doesn't actually nuke /tmp, which seems to be mapped to disk. One side effect is that (like MSWindows) AFAIK there isn't a native tmpfs, ... Are you sure about that? My Linux systems do. http://lxr.linux.no/linux+v2.6.31/Documentation/filesystems/tmpfs.txt OK, so you can mount /dev/shm on /tmp and /var/tmp, but that's not the default, at least as of RHFC10. I have files in /tmp going back to Feb 2008 :-). Evidently, quoting Wikipedia, tmpfs is supported by the Linux kernel from version 2.4 and up. http://en.wikipedia.org/wiki/TMPFS, FC1 6 years ago. Solaris /tmp has been a tmpfs since 1990... Now back to the thread... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Which directories must be part of rpool?
On 09/26/09 05:25 PM, Ian Collins wrote: Most of /opt can be relocated There isn't much in there on a vanilla install (X86 snv111b) # ls /opt DTT SUNWmlib http://www.sun.com/bigadmin/features/articles/nvm_boot.jsp You pretty much answered the OP with this link. Thanks for posting it! Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] White box server for OpenSolaris
On 09/25/09 11:08 AM, Travis Tabbal wrote: ... haven't heard if it's a known bug or if it will be fixed in the next version... Out of courtesy to our host, Sun makes some quite competitive X86 hardware. I have absolutely no idea how difficult it is to buy Sun machines retail, but it seems they might be missing out on an interesting market - robust and scalable SOHO servers for the DYI gang - certainly OEMS like us recommend them, although there doesn't seem to be a single-box file+application server in the lineup which might be a disadvantage to some. Also, assuming Oracle keeps the product line going, we plan to give them a serious look when we finally have to replace those sturdy old SPARCS. Unfortunately there aren't entry level SPARCs in the lineup, but sadly there probably isn't a big enough market to justify them and small developers don't need the big iron. It would be interesting to hear from Sun if they have any specific recommendations for the use of Suns for the DYI SOHO market; AFAIK it is the profits from hardware that are going a long way to support Sun's support of FOSS that we are all benefiting from, and there's a good bet that OpenSolaris will run well on Sun hardware :-) Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Which directories must be part of rpool?
On 09/25/09 04:44 PM, Lori Alt wrote: rpool rpool/ROOT rpool/ROOT/snv_124 (or whatever version you're running) rpool/ROOT/snv_124/var (you might not have this) rpool/ROOT/snv_121 (or whatever other BEs you still have) rpool/dump rpool/export rpool/export/home rpool/swap Unless you machine is so starved for physical memory that you couldn't possibly install anything, AFAIK you can always boot without dump and swap, so even if your data pool can't be mounted, you should be OK. I've done many a reboot and pkg image-update with dump and swap inaccessible. Of course with no dump, you won't get, well, a dump, after a panic... Having /usr/local (IIRC this doesn't even exist in a straight OpenSolaris install) in a shared space on your data pool is quite useful if you have more than one machine unless you have multiple architectures. Then it turns into the /opt problem. Hiving off /opt does not seem to prevent booting, and having it on a data pool doesn't seem to prevent upgrade installs. The big problem with putting /opt on a shared pool is when multiple hosts have different /opts. Using legacy mounts seems to be the only way around this. Do the gurus have a technical explanation why putting /opt in a different pool shouldn't work? /var/tmp is a strange beast. It can get quite large, and be a serious bottleneck if mapped to a physical disk and used by any program that synchronously creates and deletes large numbers of files. I have had no problems mapping /var/tmp to /tmp. Hopefully a guru will step in here and explain why this is a bad idea, but so far no problems... A 32GB SSD is marginal for a root pool, so shrinking it as much as possible makes a lot of sense until bigger SSDS become cost effective (not long from now I imagine). But if you already have a 16GB or 32GB SSD, or a dedicated boot disk = 32GB than you can be SOL unless you are very careful to empty /var/pkg/download, which doesn't seem to get emptied even if you set the magic flag. HTH -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] backup disk of rpool on solaris
On 09/20/09 03:20 AM, dick hoogendijk wrote: On Sat, 2009-09-19 at 22:03 -0400, Jeremy Kister wrote: I added a disk to the rpool of my zfs root: # zpool attach rpool c1t0d0s0 c1t1d0s0 # installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c1t1d0s0 I waited for the resilver to complete, then i shut the system down. then i physically removed c1t0d0 and put c1t1d0 in it's place. I tried to boot the system, but it panics: Afaik you can't remove the first disk. You've created a mirror of two disks from either which you may boot the system. BUT the second disk must remain where it is. You can set the bios to boot from it if the first disk fails, but you may not *swap* them. That's my experience also. If you are trying to make a bootable disk to keep on the shelf, there's an excellent example here: http://forums.sun.com/thread.jspa?threadID=5345546 IMO this should go on the wiki. I think it's a great example of the power of ZFS. I can't imagine doing anything like this with so easily with any legacy file system... Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Incremental backup via zfs send / zfs receive
A while back I posted a script that does individual send/recvs for each file system, sending incremental streams if the remote file system exists, and regular streams if not. The reason for doing it this way rather than a full recursive stream is that there's no way to avoid sending certain file systems such as swap, and it would be nice not to always send certain properties such as mountpoint, and there might be file systems you want to keep on the receiving end. The problem with the regular stream is that most of the file system properties (such as mountpoint) are not copied as they are with a recursive stream. This may seem an advantage to some, (e.g., if the remote mountpoint is already in use, the mountpoint seems to default to legacy). However, did I miss anything in the documentation, or would it be worth submitting an RFE for an option to send/recv properties in a non-recursive stream? Oddly, incremental non-recursive streams do seem to override properties, such as mountpoint, hence the /opt problem. Am I missing something, or is this really an inconsistency? IMO non-recursive regular and incremental streams should behave the same way and both have options to send or not send properties. For my purposes the default behavior is reversed for what I would like to do... Thanks -- Frank Latest version of the script follows; suggestions for improvements most welcome, especially the /opt problem where source and destination hosts have different /opts (host6-opt and host5-opt here) - see ugly hack below (/opt is on the data pool because the boot disks - soon to be SSDs - are filling up): #!/bin/bash # # backup is the alias for the host receiving the stream # To start, do a full recursive send/receive and put the # name of the initial snapshot in cur_snap, In case of # disasters, the older snap name is saved in cur_snap_prev # and there's an option not to delete any snapshots when done. # if test ! -e cur_snap; then echo cur_snap not found; exit; fi P=`cat cur_snap` mv -f cur_snap cur_snap_prev T=`date +%Y-%m-%d:%H:%M:%S` echo $T cur_snap echo snapping to sp...@$t echo Starting backup from sp...@$p to sp...@$t at `date` snap_time zfs snapshot -r sp...@$t echo snapshot done for FS in `zfs list -H | cut -f 1` do RFS=`ssh backup zfs list -H $FS 2/dev/null | cut -f 1` case $FS in space/file system to skip here) echo skipping $FS ;; *) if test $RFS; then if [ $FS = space/swap ]; then echo skipping $FS else echo do zfs send -i $...@$p $...@$t I ssh backup zfs recv -vF $RFS zfs send -i $...@$p $...@$t | ssh backup zfs recv -vF $RFS fi else echo do zfs send $...@$t I ssh backup zfs recv -v $FS zfs send $...@$t | ssh backup zfs recv -v $FS fi if [ $FS = space/host5-opt ]; then echo do ssh backup zfs set mountpoint=legacy space/host5-opt ssh backup zfs set mountpoint=legacy space/host5-opt fi ;; esac done echo --Ending backup from sp...@$p to sp...@$t at `date` snap_time DOIT=1 while [ $DOIT -eq 1 ] do read -p Delete old snapshot y/n REPLY REPLY=`echo $REPLY | tr '[:upper:]' '[:lower:]'` case $REPLY in y) ssh backup zfs destroy -r sp...@$p echo Remote sp...@$p destroyed zfs destroy -r sp...@$p echo Local sp...@$p destroyed DOIT=0 ;; n) echo Skipping: echossh backup zfs destroy -r sp...@$p echozfs destroy -r sp...@$p DOIT=0 ;; *) echo Please enter y or n ;; esac done ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Reboot seems to mess up all rpools
[Originally posted to indiana-discuss] On certain X86 machines there's a hardware/software glitch that causes odd transient checksum failures that always seem to affect the same files even if you replace them. This has been submitted as a bug: Bug 11201 - Checksum failures on mirrored drives - now CR 6880994 P4 kernel/zfs Checksum failures on mirrored drives We have SPARC based ZFS servers where we keep a copy of this rpool so we can more easily replace the damaged files (usually system libraries). In addition, to check the validity of the zfs send stream of the ZFS server rpool, there's a copy of that as well. For good reasons there might be several rpools in this data pool at any given time. When the ZFS server is rebooted, it tries to update the boot archive of every rpool it can find, including the X86 archive, which fails because it's the wrong architecture. The ZFS server is currently at snv103, but the backup server has an additional disk with snv111b on it, which was recently updated to snv122. However, if you boot snv103 and then reboot, it will also update the snv122 boot archive, rendering snv122 unbootable. All versions up to and including snv122 exhibit this behavior. I'm not sure why updating the boot archive would do this, but surely this is a bug. Reboot should only update it's own archive, and not any ZFS archives at all if it is running from UFS. Before submitting a bug report, I thought I'd check here to see if a) if this is has already been reported, and b) if I have the terminology right. Thanks -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reboot seems to mess up all rpools
Absent any replies to the list, submitted as a bug: http://defect.opensolaris.org/bz/show_bug.cgi?id=11358 Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Raid-Z Issue
On 09/11/09 03:20 PM, Brandon Mercer wrote: They are so well known that simply by asking if you were using them suggests that they suck. :) There are actually pretty hit or miss issues with all 1.5TB drives but that particular manufacturer has had a few more than others. FWIW I have a few of them in mirrored pools and they have been working flawlessly for several months now with LSI controllers. The workload is bursty - mostly MDA driven code generation and compilation of 1M KLoC applications and they work well enough for that. Also by now probably a PetaByte of zfs send/recvs and many scrubs, never a timeout and never a checksum error. They are all rev CC1H. So your mileage may vary, as they say... Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Using ZFS iscsi disk as ZFS disk
Is there any reason why an iscsi disk could not be used to extend an rpool? It would be pretty amazing if it could but I thought I'd try it anyway :-) The 20GB disk I am using to try ZFS booting on SPARC ran out of space doing an image update to snv122, so I thought I'd try extending it with an iscsi disk on the data pool (same machine, different disks). After formatting the disk with an SMI label, trying to add the new disk results in # zpool add rpool c4t600144F04AA7AA68d0 cannot label 'c4t600144F04AA7AA68d0': EFI labeled devices are not supported on root pools. # Should it be possible to do this (SPARC snv103), and if so, how to make it work? Use a different iscsi host maybe? Perhaps I should have used a plain file, or could it be impossible? Maybe I should split the UFS boot mirror and try this on one of those disks instead :-( Separately, I have succeeded in using an iscsi disk (same hardware) as a ZFS disk in an AMD64 Virtualbox, so it is possible, although /var/adm/messages is full of messages like this: Corrupt label; wrong magic number even though the disk works just fine in the VM. Any hints much appreciated Thanks -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Incremental backup via zfs send / zfs receive
On 09/07/09 07:29 PM, David Dyer-Bennet wrote: Is anybody doing this [zfs send/recv] routinely now on 2009-6 OpenSolaris, and if so can I see your commands? Wouldn't a simple recursive send/recv work in your case? I imagine all kinds of folks are doing it already. The only problem with it, AFAIK, is when a new fs is created locally without also being created on the backup disk (unless this now works with zfs V3). The following works with snv103. If it works there, it should work with 2009-6. The script method may have the advantage of not destroying file systems on the backup that don't exist on the source, but I have not tested that. ZFS send/recv is pretty cool, but at least with older versions, it takes some tweaking to get right. Rather than send to a local drive, I'm sending to a live remote system, which is some ways is more complicated since there might be things like /opt and xxx/swap that you might not want to even send. Finally, at least with ZFS version 3, an incremental send of a filesystem that doesn't exist on the far side doesn't work either, so one needs to test for that. Given this, a simple send of a recursive snapshot AFAIK isn't going to work. I am no bash expert, so this script probably can do with lots of improvements, but it seems to do what I need it to do. You would have to extensively modify it for your local needs; you would have to remove the ssh backup and fix it to receive to your local disk. I include it here in response to your request in the hope that it might be useful. Note, as written, it will create space/swap but it won't send updates. The pool I'm backing up is called space and the target host is called backup, an alias in /etc/hosts. When the machines switch roles, I edit both /etc/hosts so the stream can go the other way. This script probably won't work for rpools; there is lots of documentation about that in previous posts to this list. My solution to the rpool problem is to receive it locally to an alternate root and then send that, but this works here if the rpool isn't your only pool, of course. If any zfs/bash gurus out there can suggest improvements, they would be much appreciated, especially ways to deal with the /opt problem (which probably relates to the general rpool question). Currently the /opts for each host are set mountpoint=legacy, but that is not a great solution :-(. Cheers -- Frank #!/bin/bash P=`cat cur_snap` rm -f cur_snap T=`date +%Y-%m-%d:%H:%M:%S` echo $T cur_snap echo snapping to sp...@$t zfs snapshot -r sp...@$t echo snapshot done for FS in `zfs list -H | cut -f 1` do RFS=`ssh backup zfs list -H $FS 2/dev/null | cut -f 1` if test $RFS; then if [ $FS = space/swap ]; then echo skipping $FS else echo do zfs send -i $...@$p $...@$t I ssh backup zfs recv -vF $RFS zfs send -i $...@$p $...@$t | ssh backup zfs recv -vF $RFS fi else echo do zfs send $...@$t I ssh backup zfs recv -v $FS zfs send $...@$t | ssh backup zfs recv -v $FS fi done ssh backup zfs destroy -r sp...@$p zfs destroy -r sp...@$p ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Yet another where did the space go question
An attempt to pkg image-update from snv111b to snv122 failed miserably for a number of reasons which are probably out of scope here. Suffice it to say that it ran out of disk space after the third attempt. Before starting, I was careful to make a baseline snapshot, but rolling back to that snapshot has not freed up all the space - this on a small disk dedicated to experimenting with ZFS booting on SPARC. The disk is nominally 20GB. After zfs rollback -rR rpool/ROOT/opensola...@baseline from a different BE (snv103 booted from UFS) # zpool list NAMESIZE USED AVAILCAP HEALTH ALTROOT rpool 17.5G 10.1G 7.39G57% ONLINE - space 1.36T 314G 1.05T22% ONLINE - # zfs list -r -o space rpool NAMEAVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD rpool 7.11G 10.1G 0 20K 0 10.1G rpool/ROOT 7.11G 10.1G 0 18K 0 10.1G rpool/ROOT/opensolaris 7.11G 10.1G 942K 10.0G 0 68.6M rpool/ROOT/opensolaris/opt 7.11G 68.6M 0 68.6M 0 0 Before the aborted pkg image-updates, the rpool took around 6GB, so 4GB has vanished somewhere. Even if pkg put it's updates in a well hidden place (there are no hidden directories in / ), surely the rollback should have deleted them. # zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT rp...@baseline 0 -20K - rpool/r...@baseline 0 -18K - rpool/ROOT/opensola...@baseline 718K - 10.0G - rpool/ROOT/opensolaris/o...@baseline 0 - 68.6M - The rollback obviously worked because afterwards even the pkg set-publisher changes were gone, and other post-snapshot files were deleted. If the worst come to the worst I could obviously save the snapshot to a file and then restore it, but it sure would be nice to know where the 4GB went. BTW one image-update failure occurred because there was an X86 rpool mounted to an alternate root, and pkg somehow found it and seemed to get confused about X86 vs. SPARC, insisting on trying to create a menu.lst in /rpool/boot, which, of course, doesn't exist on SPARC. I suppose this should be a bug... Thanks -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yet another where did the space go question
Correction On 09/06/09 12:00 PM, I wrote: (there are no hidden directories in / ), Well, there is .zfs, of course, but it is normally hidden, apparently by default on SPARC rpool, but not on X86 rpool or non-rpool pools on either. Hmmm. I don't recollect setting the snapdir property on any pools, ever. - Arrg! It just failed again! # pkg image-update --be-name=snv122 DOWNLOADPKGS FILES XFER (MB) Completed1486/1486 73091/73091 1520.59/1520.59 WARNING: menu.lst file /rpool/boot/menu.lst does not exist, generating a new menu.lst file pkg: Unable to clone the current boot environment. # BE_PRINT_ERR=true beadm create newbe be_get_uuid: failed to get uuid property from BE root dataset user properties. be_get_uuid: failed to get uuid property from BE root dataset user properties. # zfs list -t snapshot | grep newbe rpool/ROOT/opensola...@newbe 30K - 11.9G - rpool/ROOT/opensolaris/o...@newbe0 - 68.6M - So it can create a new BE. So what happened this time? I guess I'll try again with BE_PRINT_ERR=true... Is the get uuid property failure fatal to pkg but not to beadm? Has anyone managed to go from snv111b to snv122 on SPARC? Thanks -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yet another where did the space go question
Near Success! After 5 (yes, five) attempts, managed to do an update of snv111b to snv122, until it ran out of space again. Looks like I need to get a bigger disk... Sorry about the monolog, but there might be someone on this list trying to use pkg on SPARC who, like me, has been unable to subscribe to the indiana list, so an update might be useful to any such person... Perhaps someone who can might forward this to the appropriate list -- the issues are known CR's, but don't seem to be mentioned in the release notes. On 09/06/09 04:55 PM, I wrote: WARNING: menu.lst file /rpool/boot/menu.lst does not exist, generating a new menu.lst file pkg: Unable to clone the current boot environment. 1) If there isn't a directory /rpool/boot, pkg will fail 2) If you try again after mkdir /rpool/boot, it will create menu.1st. If it fails for any reason and you have to restart then: 3) If there is a menu.lst containing opensolaris-1 it will fail again even if you had used be-name=. 4) If you delete menu.lst it will fail - touch it after deleting it (the CRs are ambiguous about this). So to do this upgrade, you must do mkdir /rpool/boot and touch /rpool/boot/menu.lst before you start. It might just work if you do this, but only if you have at least 11GB of space to spare (Google says 8GB). BTW pkg always says /rpool/boot/menu.lst does not exist even if it does. http://defect.opensolaris.org/bz/show_bug.cgi?id=6744 says Fixed in source http://defect.opensolaris.org/bz/show_bug.cgi?id=7880 says accepted. But the fix for 6744 messes up 7880. This is making a SPARC upgrade really painful, especially annoying since SPARC doesn't even use grub (or menu.lst). Cheers -- Frank PS My hat's off to the ZFS and pkg teams! An amazing accomplishment and a few glitches are to be expected. I'm sure there are fixes in the works, but it would seem upgrading to snv122 isn't in the cards unless I get a bigger 3rd boot disk... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snv_110 - snv_121 produces checksum errors on Raid-Z pool
It was someone from Sun that recently asked me to repost here about the checksum problem on mirrored drives. I was reluctant to do so because you and Bob might start flames again, and you did! You both sound very defensive, but of course I would never make an unsubstantiated speculation that you might have vulnerable hardware :-). But in case you do, please don't shoot the messenger... Instead of being negative, how about some conjectures of your own about this?. here's a summary of what is happening: An old machine with mirrored drives and a suspect mobo (maybe not checking PCI parity) gets checksum errors on reboot and scrub. With copies=1 it fails to repair them. With copies=2 it apparently fixes them, but zcksummon shows quite clearly that on a scrub, zfs finds and repairs them again on every scrub, even though scrub shows no errors. Typically these files are system libraries and unless you actually replace them, they are never truly repaired. Although I really don't think this is caused by cosmic rays, are you also saying that PCs without ECC on memory and/or buses will *never* experience a glitch? You obviously don't play the lottery :-) [ZFS errors due to memory hits seem far more likely than winning a 6 ball lottery for typical retail consumer loads] On 09/02/09 06:54 PM, Tim Cook wrote: Define more systems. How many people do you think are on 121? And of Absolutely no idea. Enough, though. those, how many are on the zfs mailing list? And of those, how many Probably - all of them (yes, this is an unsubstantiated speculation). have done a scrub recently to see the checksum errors? Do you have some proof to validate your beliefs? If you had read the thread carefully, you would note that a scrub actually clears the errors (but zcksummon shows that they really aren't cleared). And doesn't the guide tell us to run scrubs frequently? I am sure we all dutifully do so :-). I'd be quite happy to send you the proof. REGARDLESS, had you read all the posts to this thread, you'd know you've already been proven wrong: Wrong about what? Reading posts before they are posted? I have read every post most carefully. Having experienced checksum failures on mirrored drives for 4 months now (and there's a CR against snv115 for a similar problem), what exactly do you think I am trying to prove, or what beliefs? After 4 months of hearing the hardware being blamed for the checksum problem (which is easy to reproduce against snv111b), all I'm doing is agreeing that it is likely triggered by some kind of soft hardware glitch, we just don't know what the glitch might be. The SPoFs on this machine are the disk controller, the PCI bus, and memory, (and cpu, of course). Take your pick. FWIW it always picks on SUNWcsl (libdlpi.so.1) - 3 or 4 times now, and more recently, /usr/share/doc/SUNWmusicbrainz/COPYING.bz2. I am skeptical that the disk controller is picking on certain files, so that leaves memory and the bus. Take your pick. New files get added to the list quite infrequently. But it could also be a pure software bug - some kind of race condition, perhaps. On Wed, Sep 2, 2009 at 11:15 AM, Brent Jones br...@servuhome.net mailto:br...@servuhome.net wrote: I see this issue on each of my X4540's, 64GB of ECC memory, 1TB drives. Rolling back to snv_118 does not reveal any checksum errors, only snc_121 So, the commodity hardware here doesn't hold up, unless Sun isn't validating their equipment (not likely, as these servers have had no hardware issues prior to this build) Exactly. My whole point. Glad to hear that Sun hardware is as reliable as ever! I hope Richard's new and improved zcksummon will shed more light on this... Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snv_110 - snv_121 produces checksum errors on Raid-Z pool
On 09/02/09 05:40 AM, Henrik Johansson wrote: For those of us which have already upgraded and written data to our raidz pools, are there any risks of inconsistency, wrong checksums in the pool? Is there a bug id? This may not be a new problem insofar as it may also affect mirrors. As part of the ancient mirrored drives should not have checksum errors thread, I used Richard Elling's amazing zcksummon script http://www.richardelling.com/Home/scripts-and-programs-1/zcksummon to help diagnose this (thanks, Richard, for all your help). The bottom line is that hardware glitches (as found on cheap PCs without ECC on buses and memory) can put ZFS into a mode where it detects bogus checksum errors. If you set copies=2, it seems to always be able to repair them, but they are never actually repaired. Every time you scrub, it finds a checksum error on the affected file(s) and it pretends to repair it (or may fail if you have copies=1 set). Note: I have not tried this on raidz, only mirrors, where it is highly reproducible. It would be really interesting to see if raidz gets results similar to the mirror case when running zcksummon. Note I have NEVER had this problem on SPARC, only on certain bargain-basement PCs (used as X-Terminals) which as it turns out have mobos notorious for not detecting bus parity errors. If this is the same problem, you can certainly mitigate it by setting copies=2 and actually copying the files (e.g., by promoting a snapshot, which I believe will do this - can someone confirm?). My guess is that snv121 has done something to make the problem more likely to occur, but the problem itself is quite old (predates snv100). Could you share with us some details of your hardware, especially how much memory and if it has ECC orbus parity? Cheers -- Frank On 09/02/09 05:40 AM, Henrik Johansson wrote: Hi Adam, On Sep 2, 2009, at 1:54 AM, Adam Leventhal wrote: Hi James, After investigating this problem a bit I'd suggest avoiding deploying RAID-Z until this issue is resolved. I anticipate having it fixed in build 124. Regards Henrik http://sparcv9.blogspot.com http://sparcv9.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snv_110 - snv_121 produces checksum errors on Raid-Z pool
On 09/02/09 10:01 AM, Gaëtan Lehmann wrote: I see the same problem on a workstation with ECC RAM and disks in mirror. The host is a Dell T5500 with 2 cpus and 24 GB of RAM. Would you know if it has ECC on the buses? I have no idea if or what Solaris does on X86 to check or correct bus errors, but I vaguely remember seeing a thread about it. Asking, because it really does seem to require a hardware problem to make this happen. Did you try zcksummon? Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snv_110 - snv_121 produces checksum errors on Raid-Z pool
On 09/02/09 10:34 AM, Simon Breden wrote: I too see checksum errors ocurring for the first time using OpenSolaris 2009.06 on the /dev package repository at version snv_121. I see the problem occur within a mirrored boot pool (rpool) using SSDs. Hardware is AMD BE-2350 (ECC) processor with 4GB ECC memory on MCP55 chipset, although SATA is using mpt driver on a SuperMicro AOC-USAS-L8i controller card. More here: http://breden.org.uk/2009/09/02/home-fileserver-handling-pool-errors/ Boy, that looks familiar. Did you try zcksummon to see if the checksums are really being fixed? If it is the same problem I encountered, then they are not, even though the scrub says no errors (and the problem goes back before snv100). Your hardware seems pretty beefy, though. Note that iostat -Ene never reported any hard errors in my case even though the mobo was known to have problems, so hard errors do not explain the problem. Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Expanding a raidz pool?
On Sep 2, 2009, at 7:14 PM, rarok wrote: I'm just a casual at ZFS but you want something that now don't exists. The most of the consumers want this but Sun is not interested in that market. To grow a existing RAIDZ just adding more disk to the RAIDZ would be great but at this moment there isn't anything like that. Out of curiosity, what do the folks who want to grow their raidzs do for backups? Is restoring a backup to a newly created enlarged raidz any more dangerous than the rewriting involved in doing it on the fly? Hardware is so cheap these days, why not make a backup raidz server (power it up only to do backups, or better yet, switch to it periodically to make sure it works), and when the time comes to make the raidz bigger, just do it, one server at a time? You can run off the backup whilst the new, larger server is resilvering and have negligable downtime that way. If you are really cheap, get a couple of huge USB drives and do the backups there. Either way, they are important, and zfs send/recv is such a great way of making verifiable backups. Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snv_110 - snv_121 produces checksum errors on Raid-Z pool
On 09/02/09 12:31 PM, Richard Elling wrote: I believe this is a different problem. Adam, was this introduced in b120? Doubtless you are correct as usual. However, if this is a new problem, how did it get through Sun's legendary testing process unless it is (as you have always maintained) triggered by a hardware problem? If so, I believe that any new CR would be regarded as a duplicate of any CR that described the problem you and I researched, even if they have different root causes. Of course this seems to be new as of snv121, so one can only speculate that it might be a latent problem or a new one. Do you think that there are separate mirror vs. raidz issues? There is more work that can be leveraged from zcksummon, perhaps I'll get a few spare moments to test and update the procedure in the next few days. If you think it would be relevant, you know I can reproduce this at will. I wonder if any Sun hardware users have experienced this problem. So far IIRC the only reports are Asus and Dell. Does anyone else recollect the thread about how Solaris does (or does not) do bus error checking on x86? Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How Virtual Box handles the IO
Great to hear a few success stories! We have been experimentally running ZFS on really crappy hardware and it has never lost a pool. Running on VB with ZFS/iscsi raw disks we have yet to see any errors at all. On sun4u with lsi sas/sata it is really rock solid. And we've been going out of our way to break it because of bad experiences with ntfs, ext2 and UFS as well as many disk failures (ever had fsck run amok?). On 07/31/09 12:11 PM, Richard Elling wrote: Making flush be a nop destroys the ability to check for errors thus breaking the trust between ZFS and the data on medium. -- richard Can you comment on the issue that the underlying disks were, as far as we know, never powered down? My understanding is that disks usually try to flush their caches as quickly as possible to make room for more data, so in this scenario things were probably quiet after the guest crash, so likely what ever was in the cache would have been flushed anyway, certainly by the time the OP restarted VB and the guest. Could you also comment on CR 6667683. which I believe is proposed as a solution for recovery in this very rare case? I understand that the ZILs are allocated out of the general pool. Is there a ZIL for the ZILs, or does this make no sense? As the one who started the whole ECC discussion, I don't think anyone has ever claimed that lack of ECC caused this loss of a pool or that it could. AFAIK lack of ECC can't be a problem at all on RAIDZ vdevs, only with single drives or plain mirrors. I've suggested an RFE for the mirrored case to double buffer the writes in this case, but disabling checksums pretty much fixes the problem if you don't have ECC, so it isn't worth pursuing. You can disable checksum per file system, so this is an elegant solution if you don't have ECC memory but you do mirror. No mirror IMO is suicidal with any file system. Has anyone ever actually lost a pool on Sun hardware other than by losing too many replicas or operator error? As you have so eloquently pointed out, building a reliable storage system is an engineering problem. There are a lot of folks out there who are very happy with ZFS on decent hardware. On crappy hardware you get what you pay for... Cheers -- Frank (happy ZFS evangelist) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 07/27/09 01:27 PM, Eric D. Mudama wrote: Everyone on this list seems to blame lying hardware for ignoring commands, but disks are relatively mature and I can't believe that major OEMs would qualify disks or other hardware that willingly ignore commands. You are absolutely correct, but if the cache flush command never makes it to the disk, then it won't see it. The contention is that by not relaying the cache flush to the disk, VirtualBox caused the OP to lose his pool. IMO this argument is bogus because AFAIK the OP didn't actually power his system down, so the data would still have been in the cache, and presumably have eventually have been written. The out-of-order writes theory is also somewhat dubious, since he was able to write 10TB without VB relaying the cache flushes. This is all highly hardware dependant, and AFAIK no one ever asked the OP what hardware he had, instead, blasting him for running VB on MSWindows. Since IIRC he was using raw disk access, it is questionable whether or not MS was to blame, but in general it simply shouldn't be possible to lose a pool under any conditions. It does raise the question of what happens in general if a cache flush doesn't happen if, for example, a system crashes in such a way that it requires a power cycle to restart, and the cache never gets flushed. Do disks with volatile caches attempt to flush the cache by themselves if they detect power down? It seems that the ZFS team recognizes this as a problem, hence the CR to address it. It turns out that (at least on this almost 4 year old blog) http://blogs.sun.com/perrin/entry/the_lumberjack that the ZILs /are/ allocated recursively from the main pool. Unless there is a ZIL for the ZILs, ZFS really isn't fully journalled, and this could be the real explanation for all lost pools and/or file systems. It would be great to hear from the ZFS team that writing a ZIL, presumably a transaction in it's own right, is protected somehow (by a ZIL for the ZILs?). Of course the ZIL isn't a journal in the traditional sense, and AFAIK it has no undo capability the way that a DBMS usually has, but it needs to be structured so that bizarre things that happen when something as robust as Solaris crashes don't cause data loss. The nightmare scenario is when one disk of a mirror begins to fail and the system comes to a grinding halt where even stop-a doesn't respond, and a power cycle is the only way out. Who knows what writes may or may not have been issued or what the state of the disk cache might be at such a time. -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 07/25/09 04:30 PM, Carson Gaspar wrote: No. You'll lose unwritten data, but won't corrupt the pool, because the on-disk state will be sane, as long as your iSCSI stack doesn't lie about data commits or ignore cache flush commands. Why is this so difficult for people to understand? Let me create a simple example for you. Are you sure about this example? AFAIK metadata refers to things like the file's name, atime, ACLs, etc., etc. Your example seems to be more about how a journal works, which has little to do with metatdata other than to manage it. Now if you were too lazy to bother to follow the instructions properly, we could end up with bizarre things. This is what happens when storage lies and re-orders writes across boundaries. On 07/25/09 07:34 PM, Toby Thain wrote: The problem is assumed *ordering*. In this respect VB ignoring flushes and real hardware are not going to behave the same. Why? An ignored flush is ignored. It may be more likely in VB, but it can always happen. It mystifies me that VB would in some way alter the ordering. I wonder if the OP could tell us what actual disks and controller he used to see if the hardware might actually have done out-of-order writes despite the fact that ZFS already does write optimization. Maybe the disk didn't like the physical location of the log relative to the data so it wrote the data first? Even then it isn't onvious why this would cause the pool to be lost. A traditional journalling file system should survive the loss pf a flush. Either the log entry was written or it wasn't. Even if the disk, for some bizarre reason, writes some of the actual data before writing the log, the repair process should undo that, If written properly, it will use the information in the most current complete journal entry to repair the file system. Doing synchs are devastating to performance so usually there's an option to disable them, at the known risk of losing a lot more data. I've been using SPARCs and Solaris from the beginning. Ever since UFS supported journalling, I've never lost a file unless the disk went totally bad, and none since mirroring. Didn't miss fsck either :-) Doesn't ZIL effectively make ZFS into a journalled file system (in another thread, Bob Friesenhahn says it isn't, but I would submit that the general opinion is correct that it is; log and journal have similar semantics). The evil tuning guide is pretty emphatic about not disabling it! My intuition (and this is entirely speculative) is that the ZFS ZIL either doesn't contain everything needed to restore the superstructure, or that if it does, the recovery process is ignoring it. I think I read that the ZIL is per-file system, but one hopes it doesn't rely on the superstructure recursively, or this would be impossible to fix (maybe there's a ZIL for the ZILs :) ). On 07/21/09 11:53 AM, George Wilson wrote: We are working on the pool rollback mechanism and hope to have that soon. The ZFS team recognizes that not all hardware is created equal and thus the need for this mechanism. We are using the following CR as the tracker for this work: 6667683 need a way to rollback to an uberblock from a previous txg so maybe this discussion is moot :-) -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 07/25/09 02:50 PM, David Magda wrote: Yes, it can be affected. If the snapshot's data structure / record is underneath the corrupted data in the tree then it won't be able to be reached. Can you comment on if/how mirroring or raidz mitigates this, or tree corruption in general? I have yet to lose a pool even on a machine with fairly pathological problems, but it is mirrored (and copies=2). I was also wondering if you could explain why the ZIL can't repair such damage. Finally, a number of posters blamed VB for ignoring a flush, but according to the evil tuning guide, without any application syncs, ZFS may wait up to 5 seconds before issuing a synch, and there must be all kinds of failure modes even on bare hardware where it never gets a chance to do one at shutdown. This is interesting if you do ZFS over iscsi because of the possibility of someone tripping over a patch cord or a router blowing a fuse. Doesn't this mean /any/ hardware might have this problem, albeit with much lower probability? Thanks ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] The importance of ECC RAM for ZFS
On 07/24/09 04:35 PM, Bob Friesenhahn wrote: Regardless, it [VirtualBox] has committed a crime. But ZFS is a journalled file system! Any hardware can lose a flush; it's just more likely in a VM, especially when anything Microsoft is involved, and the whole point of journalling is to prevent things like this happening. However the issue is moot since CR 6667683 is being addressed. Here's a related thought - does it make sense to mirror ZFS on iscsi if the host drives are themselves ZFS mirrors? The whole question of the requirement for ECC depends on your tolerance for loss of files vs. errors in files. As Richard Elling points out, there are other sources of error (e.g., no checking of PCI parity). But that isn't relevant to the ECC on main memory question. You can disable checksumming, and then ZFS is no worse in this regard than any other file system; bad files get read and you either notice or you don't, but you won't lose any because of fatal checksum errors and you still have all the other great features of ZFS, If you don't mirror, all bets are off. You should set copies=2 or higher and cross your fingers. You should also disable file checksumming in ZFS and in that sense degenerate to the behavior of lesser file systems. However mirroring doesn't buy you much here because it evidently doesn't double buffer the write before calculating the checksum, so a stray bitflip can cause metatdata or data corruption, causing a mirrored file to have an unrecoverable checksum failure (of course there are many other reasons to mirror). The real question is - what is the probability of this occurring? IMO the typical SOHO user has a 1 in 10 to 1 in 100 chance of this happening in a year of reasonably constant operation (a few dozen writes/day). I believe that this can be mitigated by setting copies=2, a good idea anyway if you have biggish disks since, as Richard Elling has pointed out in his excellent blogs, if you need to resilver after a disk failure you have a rather large possibility of a disk read error causing file loss and copies=2 also mitigates this. Note that hopefully fixing CR 6667683 should eliminate any possibility of losing an entire mirrored or raidz pool. So, it seem to me ZFS has a definite dependency on ECC for reliable operation. However, for non-commercial uses (i.e., less than an hour or so a day of writes) the probability of losing a file is fairly small and can be mitigated still further by setting copies=2. But to eliminate the possibility entirely, you must have ECC. You should also make sure that the buses have at least parity if not ECC and that this is actually checked - maybe Richard can comment on this since I believe he thinks this is a more likely source of errors. HTH -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 07/21/09 01:21 PM, Richard Elling wrote: I never win the lottery either :-) Let's see. Your chance of winning a 49 ball lottery is apparently around 1 in 14*10^6, although it's much better than that because of submatches (smaller payoffs for matches on less than 6 balls). There are about 32*10^6 seconds in a year. If ZFS saves its writes for 30 seconds and batches them out, that means 1 write leaves the buffer exposed for roughly one millionth of a year. If you have 4GB of memory, you might get 50 errors a year, but you say ZFS uses only 1/10 of this for writes, so that memory could see 5 errors/year. If your single write was 1/70th of that (say around 6 MB), your chance of a hit is around 5/70/10^-6 or 1 in 14*10^6, so you are correct! So if you do one 6MB write/year, your chances of a hit in a year are about the same as that of winning a grand slam lottery. Hopefully not every hit will trash a file or pool, but odds are that you'll do many more writes than that, so on the whole I think a ZFS hit is quite a bit more likely than winning the lottery each year :-). Conversely, if you average one big write every 3 minutes or so (20% occupancy), odds are almost certain that you'll get one hit a year. So some SOHO users who do far fewer writes won't see any hits (say) over a 5 year period. But some will, and they will be most unhappy -- calculate your odds and then make a decision! I daresay the PC makers have done this calculation, which is why PCs don't have ECC, and hence IMO make for insufficiently reliable servers. Conclusions from what I've gleaned from all the discussions here: if you are too cheap to opt for mirroring, your best bet is to disable checksumming and set copies=2. If you mirror but don't have ECC then at least set copies=2 and consider disabling checksums. Actually, set copies=2 regardless, so that you have some redundancy if one half of the mirror fails and you have a 10 hour resilver, in which time you could easily get a (real) disk read error. It seems to me some vendor is going to cotton onto the SOHO server problem and make a bundle at the right price point. Sun's offerings seem unfortunately mostly overkill for the SOHO market, although the X4140 looks rather interesting... Shame there aren't any entry level SPARCs any more :-(. Now what would doctors' front offices do if they couldn't blame the computer for being down all the time? It is quite simple -- ZFS sent the flush command and VirtualBox ignored it. Therefore the bits on the persistent store are consistent. But even on the most majestic of hardware, a flush command could be lost, could it not? An obvious case in point is ZFS over iscsi and a router glitch. But the discussion seems to be moot since CR 6667683 is being addressed. Now about those writes to mirrored disks :) Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 07/19/09 06:10 PM, Richard Elling wrote: Not that bad. Uncommitted ZFS data in memory does not tend to live that long. Writes are generally out to media in 30 seconds. Yes, but memory hits are instantaneous. On a reasonably busy system there may be buffers in queue all the time. You may have a buffer in memory for 100uS but it only takes 1nS for that buffer to be clobbered. If that happened to be metadata about to be written to both sides of a mirror than you are toast. Good thing this never happens, right :-) Beware, if you go down this path of thought for very long, you'll soon be afraid to get out of bed in the morning... wait... most people actually die in beds, so perhaps you'll be afraid to go to bed instead :-) Not at all. As with any rational business, my servers all have ECC, and getting up and out isn't a problem :-). Maybe I've had too many disks go bad, so I have ECC, mirrors, and backup to a system with ECC and mirrors (and copies=2, as well). Maybe I've read too many of your excellent blogs :-). Sun doesn't even sell machines without ECC. There's a reason for that. Yes, but all of the discussions in this thread can be classified as systems engineering problems, not product design problems. Not sure I follow. We've had this discussion before. OSOL+ZFS lets you build enterprise class systems on cheap hardware that has errors. ZFS gives the illusion of being fragile because it, uniquely, reports these errors. Running OSOL as a VM in VirtualBox using MSWanything as a host is a bit like building on sand, but there's nothing in documentation anywhere to even warn folks that they shouldn't rely on software to get them out of trouble on cheap hardware. ECC is just one (but essential) part of that. On 07/19/09 08:29 PM, David Magda wrote: It's a nice-to-have, but at some point we're getting into the tinfoil hat-equivalent of data protection. But it is going to happen! Sun sells only machines with ECC because that is the only way to ensure reliability. Someone who spends weeks building a media server at home isn't going to be happy if they lose one media file let alone a whole pool. At least they should be warned that without ECC at some point they will lose files. I'm not convinced that there is any reasonable scenario for losing an entire pool though, which was the original complaint in this thread. Even trusty old SPARCs occasionally hang without a panic (in my experience especially when a disk is about to go bad). If this happens, and you have to power cycle because even stop-A doesn't respond, are you all saying that there is a risk of losing a pool at that point? Surely the whole point of a journalled file system is that it is pretty much proof against any catastrophe, even the one described initially. There have been a couple of (to me) unconvincing explanations of how this pool was lost. Surely if there is a mechanism whereby unflushed i/os can cause fatal metadata corruption, this should be a high priority bug since this can happen on /any/ hardware; it is just more likely if the foundations are shaky, so the explanation must require more than that if it isn't a bug. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 07/19/09 05:00 AM, dick hoogendijk wrote: (i.e. non ECC memory should work fine!) / mirroring is a -must- ! Yes, mirroring is a must, although it doesn't help much if you have memory errors (see several other threads on this topic): http://en.wikipedia.org/wiki/Dynamic_random_access_memory#Errors_and_error_correction Tests[ecc]give widely varying error rates, but about 10^-12 error/bit·h is typical, roughly one bit error, per month, per gigabyte of memory. That's roughly 1 per week in 4GB. If 1 error in 50 results in a ZFS hit, that's one/year per user on average. Some get more, some get less.That sounds like pretty bad odds... In most computers used for serious scientific or financial computing and as servers, ECC is the rule rather than the exception, as can be seen by examining manufacturers' specifications. Sun doesn't even sell machines without ECC. There's a reason for that. IMO you'd be nuts to run ZFS on a machine without ECC unless you don't care about losing some or all of the data. Having said that, we have yet to lose an entire pool - this is pretty hard to do! I should add that since setting copies=2 and forcing the files to be copied, there have been no more unrecoverable errors on a particularly low end machine that was plagued with them even with mirrors (and a UPS with a bad battery :-) ). On 19-Jul-09, at 7:12 AM, Russel wrote: As this was not clear to me. I use VB like others use vmware etc to run solaris because its the ONLY way I can, Given that PC hardware is so cheap these days (used SPARCS even cheaper), surely it makes far more sense to build a nice robust OSOL/ZFS based file server *with* ECC. Then you can use iscsi for your VirtualBox VMs and solve all kinds of interesting problems. But you still need to do backups. My solution for that is to replicate the server and backup to it using zfs send/recv. If a disk fails, you switch to the backup and no worries about the second disk of the mirror failing during a resilver. A small price to pay for peace of mind. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] no pool_props for OpenSolaris 2009.06 with old SPARC hardware
On 06/03/09 09:10 PM, Aurélien Larcher wrote: PS: for the record I roughly followed the steps of this blog entry = http://blogs.sun.com/edp/entry/moving_from_nevada_and_live Thanks for posting this link! Building pkg with gdb was an interesting exercise, but it worked, with the additional step of making the packages and pkgadding them. Curious as to why pkg isn't available as a pkgadd package. Is there any reason why someone shouldn't make them available for download? It would make it much less painful for those of us who are OBP version deprived - but maybe that's the point :-) During the install cycle, ran into this annoyance (doubtless this is documented somewhere): # zpool create rpool c2t2d0 creates a good rpool that can be exported and imported. But it seems to create an EFI label, and, as documented, attempting to boot results in a bad magic number error. Why does zpool silently create an apparently useless disk configuration for a root pool? Anyway, it was a good opportunity to test zfs send/recv of a root pool (it worked like a charm). Using format -e to relable the disk so that slice 0 and slice 2 both have the whole disk resulted in this odd problem: # zpool create -f rpool c2t2d0s0 # zpool list NAMESIZE USED AVAILCAP HEALTH ALTROOT rpool 18.6G 73.5K 18.6G 0% ONLINE - space 1.36T 294G 1.07T21% ONLINE - # zpool export rpool # zpool import rpool cannot import 'rpool': no such pool available # zpool list NAMESIZE USED AVAILCAP HEALTH ALTROOT space 1.36T 294G 1.07T21% ONLINE - # zdb -l /dev/dsk/c2t2d0s0 lists 3 perfectly good looking labels. Format says: ... selecting c2t2d0 [disk formatted] /dev/dsk/c2t2d0s0 is part of active ZFS pool rpool. Please see zpool(1M). /dev/dsk/c2t2d0s2 is part of active ZFS pool rpool. Please see zpool(1M). However this disk boots ZFS OpenSolaris just fine and this inability to import an exported pool isn't a problem. Just wondering if any ZFS guru had a comment about it. (This is with snv103 on SPARC). FWIW this is an old ide drive connected to a sas controller via a sata/pata adapter... Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] no pool_props for OpenSolaris 2009.06 with old SPARC hardware
On 06/03/09 09:10 PM, Aurélien Larcher wrote: PS: for the record I roughly followed the steps of this blog entry = http://blogs.sun.com/edp/entry/moving_from_nevada_and_live Thanks for posting this link! Building pkg with gcc 4.3.2 was an interesting exercise, but it worked, with the additional step of making the packages and pkgadding them. Curious as to why pkg isn't available as a pkgadd package. Is there any reason why someone shouldn't make them available for download? It would make it much less painful for those of us who are OBP version deprived - but maybe that's the point :-) During the install cycle, ran into this annoyance (doubtless this is documented somewhere): # zpool create rpool c2t2d0 creates a good rpool that can be exported and imported. But it seems to create an EFI label, and, as documented, attempting to boot results in a bad magic number error. Why does zpool silently create an apparently useless disk configuration for a root pool? Anyway, it was a good opportunity to test zfs send/recv of a root pool (it worked like a charm). Using format -e to relable the disk so that slice 0 and slice 2 both have the whole disk resulted in this odd problem: # zpool create -f rpool c2t2d0s0 # zpool list NAMESIZE USED AVAILCAP HEALTH ALTROOT rpool 18.6G 73.5K 18.6G 0% ONLINE - space 1.36T 294G 1.07T21% ONLINE - # zpool export rpool # zpool import rpool cannot import 'rpool': no such pool available # zpool list NAMESIZE USED AVAILCAP HEALTH ALTROOT space 1.36T 294G 1.07T21% ONLINE - # zdb -l /dev/dsk/c2t2d0s0 lists 3 perfectly good looking labels. Format says: ... selecting c2t2d0 [disk formatted] /dev/dsk/c2t2d0s0 is part of active ZFS pool rpool. Please see zpool(1M). /dev/dsk/c2t2d0s2 is part of active ZFS pool rpool. Please see zpool(1M). However this disk boots ZFS OpenSolaris just fine and this inability to import an exported pool isn't a problem. Just wondering if any ZFS guru had a comment about it. (This is with snv103 on SPARC). FWIW this is an old ide drive connected to a sas controller via a sata/pata adapter... Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] rpool mirroring
On 06/04/09 06:44 PM, cindy.swearin...@sun.com wrote: Hi Noz, This problem was reported recently and this bug was filed: 6844090 zfs should be able to mirror to a smaller disk Is this filed on bugs or defects? I had the exact same problem, and it turned out to be a rounding error in Solaris format/fdisk. The only way I could fix it was to use Linux (well, Fedora) sfdisk to make both partitions exactly the same number of bytes. The alternates partition seems to be hard wired on older disks and AFAIK there's no way to use that space. sfdisk is on the Fedora live CD if you don't have a handy Linux system to get it from. BTW the disks were nominally the same size but had different geometries. Since I can't find 6844090, I have no idea what it says, but this really seems to be a bug in fdisk, not ZFS, although I would think ZFS should be able to mirror to a disk that is only a tiny bit smaller... -- Frank I believe slice 9 (alternates) is an older method for providing alternate disk blocks on x86 systems. Apparently, it can be removed by using the format -e command. I haven't tried this though. I don't think removing slice 9 will help though if these two disks are not identical, hence the bug. You can workaround this problem by attaching a slightly larger disk. Cindy noz wrote: I've been playing around with zfs root pool mirroring and came across some problems. I have no problems mirroring the root pool if I have both disks attached during OpenSolaris installation (installer sees 2 disks). The problem occurs when I only have one disk attached to the system during install. After OpenSolaris installation completes, I attach the second disk and try to create a mirror but I cannot. Here are the steps I go through: 1) install OpenSolaris onto 16GB disk 2) after successful install, shutdown, and attach second disk (also 16GB) 3) fdisk -B 4) partition 5) zfs attach Step 5 fails, giving a disk too small error. What I noticed about the second disk is that it has a 9th partition called alternates that takes up about 15MBs. This partition doesn't exist in the first disk and I believe is what's causing the problem. I can't figure out how to delete this partition and I don't know why it's there. How do I mirror the root pool if I don't have both disks attached during OpenSolaris installation? I realize I can just use a disk larger than 16GBs, but that would be a waste. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Errors on mirrored drive
On 05/26/09 13:07, Kjetil Torgrim Homme wrote: also thank you, all ZFS developers, for your great job :-) I'll second that! A great achievement - puts Solaris in a league of its own, so much so, you'd want to run it on all your hardware, however crappy the hardware might be ;-) There are too many branches in this thread now. Going to summarize here without responding to some of the less than helpful comments, although death and taxes seems an ironic metaphor in the current climate :-) In some ways this isn't a technical issue. This much maligned machine and its ilk are running Solaris and ZFS quite happily and the users are pleased with the stability and performance. But their applications are running on machines (via xdmcp) with ECC, and ZFS mirror/raidz doesn't have a problem there. Picture a new convert with enthusiasm for ZFS, but has a less than perfect PC which has otherwise been apparently quite reliable. Perhaps it already has mirrored drives. He/she installs Solaris from the live CD (and finds that the installer doesn't support mirroring). The install fails, or worse, afterwords he/she loses that movie of Aunt Minnie playing golf, because a checksum error makes the file unrecoverable. This could be very frustrating and make the blogosphere go crazy, especially if the PC passes every diagnostic. Be even worse if a file is lost on a mirror. Unrecoverable files on mirrored drives simply shouldn't happen. What kind of hardware error (other than a rare bit flip) could conceivably cause 5 out of 15 checksum errors to be unrecoverable when mirrored during the write of around 20*10^10 bits? ZFS has both a larger spatial and temporal footprint than other file systems, so it is slightly more vulnerable to the once-a-month on average bit flip that will afflict many a PC with 4GB of memory. Perhaps someone with a statistical bent could step in and actually calculate the probability of random errors, perhaps assuming that half of available memory is used to queue writes, that there is a 95% chance of one bit flip per month per 4GB, and there is a (say) 10% duty cycle over say a period of a year. Alternatively, the chance of a 1 bit flip over a period of 6 hours at a 100% duty cycle repeated 1461 times (1461 installs per year at 100%). Seems to me intuitively that 6 out of 1461 installs will fail due to an unrecoverable checksum failure, but I'm not a statistician. Multiply that failure rate by the number of Live CD installs you expect over the next year (noting that *all* checksum failures are unrecoverable without mirroring) and you'll count quite a few frustrated would-be installers. Maybe ZFS without ECC and no mirroring should disable checksumming by default - it would be a little worse than UFS and ext3 (due to its larger spatial and temporal footprints) but still provide all the other great features. Proposed RFE #1 Add option to make files with unrecoverable checksum failures readable and to pass the best image possible back to the application. [How much do you bet most folks would select this option?] If both sides of the mirror could be read, it might help to diagnose the problem, which obviously must be in the hardware somewhere. If both images are identical, then it surely must be memory. If they differ, then what could it be? Proposed RFE#2 Add an option for machines with mirrored drives but without ECC to double buffer and only then calculate the checksums (for those who are reasonably paranoid about cosmic rays). Proposed RFE#3 (or is this a bug report?) Add diagnostics to the ZFS recv to help understand why a perfectly good ZFS send can't be received when the same machine can successfully compute a md5sum over the same stream. Even something like recv failed at block nnn would help. For example, it seems to fail suspiciously close to 2GB on a 32 bit machine. Proposed RFE #4 Disable checksumming by default if no mirroring and no ECC is detected. (Of course this assumes a install to mirror option). If it could still checksum, but make it a warning instead of an error, this could turn into a great feature for cheapskates with machines that have no ECC. --- 1 and #2 above could be fixed in the documentation. Random memory bit flips can theoretically cause unrecoverable checksum failures, even if the data is mirrored. Either disable the checksum feature or only run ZFS on systems with ECC memory if you have any data you don't want to risk losing [even with a 1 bit error]. None of this is meant as a criticism of ZFS, just suggestions to help make a merely superb file system into the unbeatable one it should be. (I suppose it really is a system of file systems, but ZFS it is...) Regards -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Errors on mirrored drive
On 05/23/09 10:21, Richard Elling wrote: preface This forum is littered with claims of zfs checksums are broken where the root cause turned out to be faulty hardware or firmware in the data path. /preface I think that before you should speculate on a redesign, we should get to the root cause. The hardware is clearly misbehaving. No argument. The questions is - how far out of reasonable behavior is it? Redesign? I'm not sure I can conceive an architecture that would make double buffering difficult to do. It is unclear how faulty hardware or firmware could be responsible for such a low error rate (1 in 4*10^10). Just asking if an option for machines with no ecc and their inevitable memory errors is a reasonable thing to suggest in an RFE. The checksum occurs in the pipeline prior to write to disk. So if the data is damaged prior to checksum, then ZFS will never know. Nor will UFS. Neither will be able to detect this. In Solaris, if the damage is greater than the ability of the memory system and CPU to detect or correct, then even Solaris won't know. If the memory system or CPU detects a problem, then Solaris fault management will kick in and do something, preempting ZFS. Exactly. My whole point. And without ECC there's no way of knowing. But if the data is damaged /after/ checksum but /before/ write, then you have a real problem... Memory diagnostics just test memory. Disk diagnostics just test disks. This is not completely accurate. Disk diagnostics also test the data path. Memory tests also test the CPU. The difference is the amount of test coverage for the subsystem. Quite. But the disk diagnostic doesn't really test memory beyond what it uses to run itself. Likewise it may not test the FPU forexample. ZFS keeps disks pretty busy, so perhaps it loads the power supply to the point where it heats up and memory glitches are more likely. In general, for like configurations, ZFS won't keep a disk any more busy than other file systems. In fact, because ZFS groups transactions, it may create less activity than other file systems, such as UFS. That's a point in it's favor, although not really relevant. If the disks are really busy they will load the PSU more and that could drag the supply down which in turn might make errors occur that otherwise wouldn't. Ironically, the Open Solaris installer does not allow for ZFS mirroring at install time, one time where it might be really important! Now that sounds like a more useful RFE, especially since it would be relatively easy to implement. Anaconda does it... This is not an accurate statement. The OpenSolaris installer does support mirrored boot disks via the Automated Installer method. http://dlc.sun.com/osol/docs/content/2008.11/AIinstall/index.html You can also install Solaris 10 to mirrored root pools via JumpStart. Talking about the live CD here. I prefer to install via jumpstart, but AFAIK Open Solaris (indiana) isn't available as an installable DVD. But most consumers are going to be installing from the live CD and they are the ones with the low end hardware without ECC. There was recently a suggestion on another thread about an RFE to add mirroring as an install option. I think a better test would be to md5 the file from all systems and see if the md5 hashes are the same. If they are, then yes, the finger would point more in the direction of ZFS. The send/recv protocol hasn't changed in quite some time, but it is arguably not as robust as it could be. Thanks! md5 hash is exactly the kind of test I was looking for. ms5sum on SPARC 9ec4f7da41741b469fcd7cb8c5040564 (local ZFS) md5sum on X86 9ec4f7da41741b469fcd7cb8c5040564 (remote NFS) ZFS send/recv use fletcher4 for the checksums. ZFS uses fletcher2 for data (by default) and fletcher4 for metadata. The same fletcher code is used. So if you believe fletcher4 is broken for send/recv, how do you explain that it works for the metadata? Or does it? There may be another failure mode at work here... (see comment on scrubs at the end of this extended post) [Did you forget the scrubs comment?] Never said it was broken. I assume the same code is used for both SPARC and X86, and it works fine on SPARC. It would seem that this machine gets memory errors so often (even though it passes the Linux memory diagnostic) that it can never get to the end of a 4GB recv stream. Odd that it can do the md5sum, but as mentioned, perhaps doing the i/o puts more strain on the machine and stresses it to where more memory faults occur. I can't quite picture a software bug that would cause random failures on specific hardware and I am happy to give ZFS the benefit of the doubt. It would have been nice if we were able to recover the contents of the file; if you also know what was supposed to be there, you can diff and then we can find out what was wrong. file on those files resulted in bus error. Is there a way to actually read a file reported by ZFS as unrecoverable to do just that (and to separately
Re: [zfs-discuss] Errors on mirrored drive
On 05/26/09 03:23, casper@sun.com wrote: And where exactly do you get the second good copy of the data? From the first. And if it is already bad, as noted previously, this is no worse than the UFS/ext3 case. If you want total freedom from this class of errors, use ECC. If you copy the code you've just doubled your chance of using bad memory. The original copy can be good or bad; the second copy cannot be better than the first copy. The whole point is that the memory isn't bad. About once a month, 4GB of memory of any quality can experience 1 bit being flipped, perhaps more or less often. If that bit happens to be in the checksummed buffer then you'll get an unrecoverable error on a mirrored drive. And if I understand correctly, ZFS keeps data in memory for a lot longer than other file systems and uses more memory doing so. Good features, but makes it more vulnerable to random bit flips. This is why decent machine have ECC. To argue that ZFS should work reliably on machines without ECC flies in the face of statistical reality and the reason for ECC in the first place. You can disable the checksums if you don't care. But I do care. I'd like to know if my files have been corrupted, or at least as much as possible. But there are huge classes of files for which the odd flipped bit doesn't matter and the loss of which would be very painful. Email archives and videos come to mind. An easy workaround is to simply store all important stuff on a machine with ECC. Problem solved... One broken bit may not have cause serious damage most things work. Exactly. Absolutely, memory diags are essential. And you certainly run them if you see unexpected behaviour that has no other obvious cause. Runs for days, as noted. Doesn't proof anything. Quite. But nonetheless, the unrecoverable errors did occur on mirrored drives and it seems to defeat the whole purpose of mirroring, which is AFAIK, keeping two independent copies of every file in case one gets lost. Writing both images from one buffer appears to violate the premise. I can think of two RFEs 1) Add an option to buffer writes on machines without ECC memory to avoid the possibility of random memory flips causing unrecoverable errors with mirrored drives. 2) An option to read files even if they have failed checksums. 1) could be fixed in the documentation - ZFS should be used with caution on machines with no ECC since random bit flips can cause unrecoverable checksum failures on mirrored drives. Or ZFS isn't supported on machines with memory that has no ECC. Disabling checksums is one way of working around 2). But it also disables a cool feature. I suppose you could optionally change checksum failure from an error to a warning, but ideally it would be file by file... Ironically, I wonder if this is even a problem with raidz? But grotty machines like these can't really support 3 or more internal drives... Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Errors on mirrored drive
On 05/22/09 21:08, Toby Thain wrote: Yes, the important thing is to *detect* them, no system can run reliably with bad memory, and that includes any system with ZFS. Doing nutty things like calculating the checksum twice does not buy anything of value here. All memory is bad if it doesn't have ECC. There are only varying degrees of badness. Calculating the checksum twice on its own would be nutty, as you say, but doing so on a separate copy of the data might prevent unrecoverable errors after writes to mirrored drives. You can't detect memory errors if you don't have ECC. But you can try to mitigate them. Without doing so makes ZFS less reliable than the memory it is running on. The problem is that ZFS makes any file with a bad checksum inaccessible, even if one really doesn't care if the data has been corrupted. A workaround might be a way to allow such files to be readable despite the bad checksum... In hindsight I probably should have merely reported the problem and left those with more knowledge to propose a solution. Oh well. If the memory is this bad then applications will be dying all over the place, compilers will be segfaulting, and databases will be writing bad data even before it reaches ZFS. But it isn't. Applications aren't dying, compilers are not segfaulting (it was even possible to compile GCC 4.3.2 with the supplied gcc); gdm is staying up for weeks at a time... And I wouldn't consider running a non-trivial database application on a machine without ECC. Absolutely, memory diags are essential. And you certainly run them if you see unexpected behaviour that has no other obvious cause. Runs for days, as noted. Your logic is rather tortuous. If the hardware is that crappy then there's not much ZFS can do about it. Well, it could. For example, it could make copies of the data before checksumming so that one memory hit doesn't result in an unrecoverable file on a mirrored drive. Either that or there's a bug in ZFS. I am more inclined to blame the memory, especially since the failure rate isn't much higher than the expected rate as reported elsewhere. Maybe this should be a new thread, but I suspect the following proves that the problem must be memory, and that begs the question as to how memory glitches can cause fatal ZFS checksum errors. Of course they can; but they will also break anything else on the machine. But they don't. Checksum errors are reasonable, but not unrecoverable ones on mirrors. How can a machine with bad memory work fine with ext3? It does. It works fine with ZFS too. Just really annoying unrecoverable files every now and then on mirrored drives. This shouldn't happen even with lousy memory and wouldn't (doesn't) with ECC. If there was a way to examine the files and their checksums, I would be surprised if they were different (If they were, it would almost certainly be the controller or the PCI bus itself causing the problem). But I speculate that it is predictable memory hits. -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Errors on mirrored drive
There have been a number of threads here on the reliability of ZFS in the face of flaky hardware. ZFS certainly runs well on decent (e.g., SPARC) hardware, but isn't it reasonable to expect it to run well on something less well engineered? I am a real ZFS fan, and I'd hate to see folks trash it because it appears to be unreliable. In an attempt to bolster the proposition that there should at least be an option to buffer the data before checksumming and writing, we've been doing a lot of testing on presumed flaky (cheap) hardware, with a peculiar result - see below. On 04/21/09 12:16, casper@sun.com wrote: And so what? You can't write two different checksums; I mean, we're mirroring the data so it MUST BE THE SAME. (A different checksum would be wrong: I don't think ZFS will allow different checksums for different sides of a mirror) Unless it does a read after write on each disk, how would it know that the checksums are the same? If the data is damaged before the checksum is calculated then it is no worse than the ufs/ext3 case. If data + checksum is damaged whilst the (single) checksum is being calculated, or after, then the file is already lost before it is even written! There is a significant probability that this could occur on a machine with no ecc. Evidently memory concerns /are/ an issue - this thread http://opensolaris.org/jive/thread.jspa?messageID=338148 even suggests including a memory diagnostic with the distribution CD (Fedora already does so). Memory diagnostics just test memory. Disk diagnostics just test disks. ZFS keeps disks pretty busy, so perhaps it loads the power supply to the point where it heats up and memory glitches are more likely. It might also explain why errors don't really begin until ~15 minutes after the busy time starts. You might argue that this problem could only affect systems doing a lot of disk i/o and such systems probably have ecc memory. But doing an o/s install is the one time where a consumer grade computer does a *lot* of disk i/o for quite a long time and is hence vulnerable. Ironically, the Open Solaris installer does not allow for ZFS mirroring at install time, one time where it might be really important! Now that sounds like a more useful RFE, especially since it would be relatively easy to implement. Anaconda does it... A Solaris install writes almost 4*10^10 bits. Quoting Wikipedia, look at Cypress on ECC, see http://www.edn.com/article/CA454636.html. Possibly, statistically likely random memory glitches could actually explain the error rate that is occurring. You are assuming that the error is the memory being modified after computing the checksums; I would say that that is unlikely; I think it's a bit more likely that the data gets corrupted when it's handled by the disk controller or the disk itself. (The data is continuously re-written by the DRAM controller) See below for an example where a checksum error occurs without the disk subsystem being involved. There seems to be no other plausible explanation other than an improbable bug in X86 ZFS itself. It would have been nice if we were able to recover the contents of the file; if you also know what was supposed to be there, you can diff and then we can find out what was wrong. file on those files resulted in bus error. Is there a way to actually read a file reported by ZFS as unrecoverable to do just that (and to separately retrieve the copy from each half of the mirror)? Maybe this should be a new thread, but I suspect the following proves that the problem must be memory, and that begs the question as to how memory glitches can cause fatal ZFS checksum errors. Here is the peculiar result (same machine) After several attempts, succeeded in doing a zfs send to a file on a NFS mounted ZFS file system on another machine (SPARC) followed by a ZFS recv of that file there. But every attempt to do a ZFS recv of the same snapshot (i.e., from NFS) on the local machine (X86) has failed with a checksum mismatch. Obviously, the file is good, since it was possible to do a zfs recv from it. You can't blame the IDE drivers (or the bus, or the disks) for this. Similarly, piping the snapshot though SSH fails, so you can't blame NFS either. Something is happening to cause checksum failures between after when the data is received by the PC and when ZFS computes its checksums. Surely this is either a highly repeatable memory glitch, or (most unlikely) a bug in X86 ZFS. ZFS recv to another SPARC over SSH to the same physical disk (accessed via a sata/pata adapter) was also successful. Does this prove that the data+checksum is being corrupted by memory glitches? Both NFS and SSH over TCP/IP provide reliable transport (via checksums), so the data is presumably received correctly. ZFS then calculates its own checksum and it fails. Oddly, it /always/ fails, but not at the same point, and far into the stream when both disks have been very busy for a while. It would be interesting to see if the
Re: [zfs-discuss] Errors on mirrored drive
On 04/17/09 12:37, casper@sun.com wrote: I'd like to submit an RFE suggesting that data + checksum be copied for mirrored writes, but I won't waste anyone's time doing so unless you think there is a point. One might argue that a machine this flaky should be retired, but it is actually working quite well, and perhaps represents not even the extreme of bad hardware that ZFS might encounter. I think it's a stupid idea. If you get two checksums, what can you do? The second copy is most likely suspect and you double your chance that you use bad memory. Casper If there were permanently bad memory locations, surely the diagnostics would reveal them. Here's an interesting paper on memory errors: http://www.ece.rochester.edu/~mihuang/PAPERS/hotdep07.pdf Given the inevitability of relatively frequent transient memory errors, I would think it behooves the file system to minimize the effects of such errors. But I won't belabor the point except to suggest that the cost of adding the suggested step would not be very expensive (either to implement or run). Memory diagnostics ran for a full 12 hours with no errors. Same goes for both disks, using Solaris format/ana/verify. So far, after creating 400,000 files, two files had permanent, apparently truly unrecoverable errors and could not be read by anything. Now it gets really funky. I detached one of the disks, and then found it couldn't be reattached. Turns out there is a rounding problem with Solaris fdisk (run from format) that can cause identical partitions on identical disks to have different sizes. I used the Linux sfdisk utility to repair the MBR and fix the Solaris2 partition sizes. Then it was possible to reattach the disk. Unfortunately it wasn't possible to boot from the result, but a reinstall went perfectly with no ZFS errors being reported at all. So it appears that the problem may be with the OpenSolaris fdisk. Is this worth reporting as a bug? It is likely to be quite hard to reproduce... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Errors on mirrored drive
Experimenting with OpenSolaris on an elderly PC with equally elderly drives, zpool status shows errors after a pkg image-update followed by a scrub. It is entirely possible that one of these drives is flaky, but surely the whole point of a zfs mirror is to avoid this? It seems unlikely that both drives failed at the same time. Could someone explain how this can happen? Another question (perhaps for the indiana folks) is how to restore these files? # zpool status -v pool: rpool state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: scrub completed after 0h24m with 2 errors on Wed Apr 15 09:15:40 2009 config: NAMESTATE READ WRITE CKSUM rpool ONLINE 0 069 mirrorONLINE 0 0 144 c3d0s0 ONLINE 0 0 145 128K repaired c3d1s0 ONLINE 0 0 151 168K repaired errors: Permanent errors have been detected in the following files: //lib/amd64/libsec.so.1 //lib/libdlpi.so.1 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Errors on mirrored drive
On 04/15/09 14:30, Bob Friesenhahn wrote: On Wed, 15 Apr 2009, Frank Middleton wrote: zpool status shows errors after a pkg image-update followed by a scrub. If a corruption occured in the main memory, the backplane, or the disk controller during the writes to these files, then the original data written could be corrupted, even though you are using mirrors. If the system experienced a physical shock, or power supply glitch, while the data was written, then it could impact both drives. Quite. Sounds like an architectural problem. This old machine probably doesn't have ecc memory (AFAIK still rare on most PCs), but it is on a serial UPS and isolated from shocks, and this has happened more than once. These drives on this machine recently passed both the purge and verify cycles (format/analyze) several times. Unless the data is written to both drives from the same buffer and checksum (surely not!), it is still unclear how it could get written to *both* drives with a bad checksum. It looks like the files really are bad - neither of them can be read - unless ZFS sensibly refuses to allow possibly good files with bad checksums to be read (cannot read: I/O error). BTW fmdump -ev doesn't seem to report any disk errors at all. So my question remains - even with the grottiest hardware, how can several files get written with bad checksums to mirrored drives? ZFS has so many cool features this would be easy to live with if there was a reasonably simple way to get copies of these files to restore them, short of getting the source and recompiling, or pkg uninstall followed by install (if you can figure out which pkg(s) the bad files are in), but it seems to defeat the purpose of softwaremirroring... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] jigdo or lofi can crash nfs+zfs
These problems both occur when accessing a ZFS dataset from Linux (FC10) via NFS. Jigdo is a fairly new bit-torrent-like downloader. It is not entirely bug free, and the one time I tried it, it recursively downloaded one directory's worth until ZFS eventually sort of died. It put all the disks into error, and even the (UFS) root disks became unreadable. It took a reboot to free everything up and some twiddling to get ZFS going again. I really don't want to even try to reproduce this! With 4GB physical, 10GB swap, and almost 3TB of raidz, it probably didn't run out of memory or disk space. There wasn't room on the boot disks to save the crash dump after halt, sync. Is there any point in submitting a bug report, and if so, what would you call it? Is there a practical way to force the crash dump to go to a ZFS dataset instead of the UFS boot disks? Also, there is a reasonably reproducible problem that causes a panic doing an NFS network install when the DVD image is copied to a ZFS dataset on snv103. I submitted this as a bug report to bugs.opensolaris.org, and it was acknowledged, but then it vanished. This is actually an NFS/ZFS problem, so maybe it was applied against the wrong group, or perhaps this was a transition issue. I wasn't able to get a crash core saved because there wasn't enough space on the boot (UFS) disks. I do have the panic traces for the 3 times I reproduced this. Should this be resubmitted to defect.opensolaris.org, and if so, against what? This problem doesn't happen of the DVD image is itself mounted via NFS, or is on a UFS partition. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can this be done?
On 03/29/09 11:58, David Magda wrote: On Mar 29, 2009, at 00:41, Michael Shadle wrote: Well I might back up the more important stuff offsite. But in theory it's all replaceable. Just would be a pain. And what is the cost of the time to replace it versus the price of a hard disk? Time ~ money. So what is best if you get a 4th drive for a 3 drive raidz? Is it better to keep it separate and use it for backups of the replaceable data (perhaps on a different machine), as a hot spare, second parity, or something else? Seems so un-green to have it spinning uselessly :-) LTO-4 tape drives at $200 just for the media? I guess not... There used to be a time when I like fiddling with computer parts. I now have other, more productive ways of wasting my time. :) Quite. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Growing a zpool mirror breaks on Adaptec 1205sa PCI
On 03/28/09 20:01, Harry Putnam wrote: Finding a sataII card is proving to be very difficult. The reason is that I only have PCI no PCI express. I haven't see a single one listed as SATAII compatible and have spent a bit time googling. It's even worse if you have an old SPARC system. We've had great results with some LSI LOGIC SAS3041XL-S cards we got on E-Bay in conjunction with 3x1.5TB Seagate drives, for 2.7TiB of raidz. The combination proved faster than mirrored 10,000 RPM SCSI disks using UFS, in an unscientific benchmark (bonnie). I don't think this LSI controller is SATA II, but it has no problems with the 1.5TB Seagates; they are Sun branded cards and worked right of of the box in 3.3V 66MHz PCI slots. 2.7TiB of raidz for around $400. Amazing... And ZFS is just plain incredible - makes every other file system look so antiquated :-) Hope this helps -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss