[zfs-discuss] zpool scrub bad block list
Hi list, from ZFS documentation it appears unclear to me if a zpool scrub will black list any found bad blocks so they won't be used anymore. I know Netapp's WAFL scrub does reallocate bad blocks and mark them as unsable. Does ZFS have this kind of strategy ? Thanks. -- Didier ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool scrub bad block list
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Didier Rebeix from ZFS documentation it appears unclear to me if a zpool scrub will black list any found bad blocks so they won't be used anymore. If there are any physically bad blocks, such that the hardware (hard disk) will return an error every time that block is used, then the disk should be replaced. All disks have a certain amount of error detection/correction built in, and remap bad blocks internally and secretly behind the scenes, transparent to the OS. So if there are any blocks regularly reporting bad to the OS, then it means there is a growing problem inside the disk. Offline the disk and replace it. It is ok to get an occasional cksum error. Say, once a year. Because the occasional cksum error will be re-read and as long as the data is correct the second time, no problem. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool scrub bad block list
ZFS detects far more errors that traditional filesystems will simply miss. This means that many of the possible causes for those errors will be something other than a real bad block on the disk. As Edward said, the disk firmware should automatically remap real bad blocks, so if ZFS did that too, we'd not use the remapped block, which is probably fine. For other errors, there's nothing wrong with the real block on the disk - it's going to be firmware, driver, cache corruption, or something else, so blacklisting the block will not solve the issue. Also, with some types of disk (SSD), block numbers are moved around to achieve wear leveling, so blacklistinng a block number won't stop you reusing that real block. -- Andrew Gabriel (from mobile) --- Original message --- From: Edward Ned Harvey opensolarisisdeadlongliveopensola...@nedharvey.com To: didier.reb...@u-bourgogne.fr, zfs-discuss@opensolaris.org Sent: 8.11.'11, 12:50 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Didier Rebeix from ZFS documentation it appears unclear to me if a zpool scrub will black list any found bad blocks so they won't be used anymore. If there are any physically bad blocks, such that the hardware (hard disk) will return an error every time that block is used, then the disk should be replaced. All disks have a certain amount of error detection/correction built in, and remap bad blocks internally and secretly behind the scenes, transparent to the OS. So if there are any blocks regularly reporting bad to the OS, then it means there is a growing problem inside the disk. Offline the disk and replace it. It is ok to get an occasional cksum error. Say, once a year. Because the occasional cksum error will be re-read and as long as the data is correct the second time, no problem. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool scrub bad block list
Very interesting... I didn't know disk firwares were responsible for automagically relocating bad blocks. Knowing this, it makes no sense for a filesystem to try to deal with this kind of errors. For now, any disk with read/write errors detected will be discarded from my filers and replaced... Thanks ! Le Tue, 08 Nov 2011 13:03:57 +, Andrew Gabriel andrew.gabr...@oracle.com a écrit : ZFS detects far more errors that traditional filesystems will simply miss. This means that many of the possible causes for those errors will be something other than a real bad block on the disk. As Edward said, the disk firmware should automatically remap real bad blocks, so if ZFS did that too, we'd not use the remapped block, which is probably fine. For other errors, there's nothing wrong with the real block on the disk - it's going to be firmware, driver, cache corruption, or something else, so blacklisting the block will not solve the issue. Also, with some types of disk (SSD), block numbers are moved around to achieve wear leveling, so blacklistinng a block number won't stop you reusing that real block. -- Didier REBEIX Universite de Bourgogne Direction des Systèmes d'Information BP 27877 21078 Dijon Cedex Tel: +33 380395205 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs sync=disabled property
Hi all, I'm trying to evaluate what are the risks of running NFS share of zfs dataset with sync=disabled property. The clients are vmware hosts in our environment and server is SunFire X4540 Thor system. Though general recommendation tells not to do this, but after testing performance with default setting and sync=disabled - it's night and day, so it's really tempting to do sync=disabled ! Thanks for any suggestion. Best regards, ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sync=disabled property
On Nov 8, 2011, at 6:38 AM, Evaldas Auryla wrote: Hi all, I'm trying to evaluate what are the risks of running NFS share of zfs dataset with sync=disabled property. The clients are vmware hosts in our environment and server is SunFire X4540 Thor system. Though general recommendation tells not to do this, but after testing performance with default setting and sync=disabled - it's night and day, so it's really tempting to do sync=disabled ! Thanks for any suggestion. The risks are, any changes your software clients expect to be written to disk -- after having gotten a confirmation that they did get written -- might not actually be written if the server crashes or loses power for some reason. You should consider a high performance low-latency SSD (doesn't have to be very big) as an SLOG… it will do a lot for your performance without having to give up the commit guarantees that you lose with sync=disabled. Of course, if the data isn't precious to you, then running with sync=disabled is probably ok. But if you love your data, don't do it. - Garrett ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sync=disabled property
On Tue, November 8, 2011 09:38, Evaldas Auryla wrote: I'm trying to evaluate what are the risks of running NFS share of zfs dataset with sync=disabled property. The clients are vmware hosts in our environment and server is SunFire X4540 Thor system. Though general recommendation tells not to do this, but after testing performance with default setting and sync=disabled - it's night and day, so it's really tempting to do sync=disabled ! Thanks for any suggestion. You may want to examine getting some good SSDs and attaching them as (mirrored?) slog devices instead: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Separate_Log_Devices You probably want to zpool version of 22 or better to do this, as from that point onward it becomes possible to remove the slog device/s if desired. Previous to v22 once you add them you're stuck with them. Some interesting benchmarks on offloading the ZIL can be found at: https://blogs.oracle.com/brendan/entry/slog_screenshots Your SSD/s don't have to be that large either: by default the ZIL can be at most 50% of RAM, so if your server has (say) 48 GB of RAM, then the an SSD larger than 24 GB would really be a bit of a waste (though you can use the 'extra' space as L2ARC perhaps). Given that, it's probably better value to get a faster SLC SSD that's smaller, rather than a 'cheaper' MLC that's larger. Past discussions on zfs-discuss have favourably mentioned devices based on the SandForce SF-1500 and SF-2500/2600 chipsets (they often come with supercaps and such). Intel's 311 could be another option. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool scrub bad block list
On Tue, Nov 8, 2011 at 9:14 AM, Didier Rebeix didier.reb...@u-bourgogne.fr wrote: Very interesting... I didn't know disk firwares were responsible for automagically relocating bad blocks. Knowing this, it makes no sense for a filesystem to try to deal with this kind of errors. In the dark ages, hard drives came with bad block lists taped to them so you could load them into the device driver for that drive. New bad blocks would be mapped out by the device driver. All that functionality was moved into the drive a long time ago (at least 10-15 years). Under Solaris, you can see the size of the bad block lists through FORMAT - DEFECT - PRIMARY will give you the size of the list from the factory and FORMAT - DEFECT - GROWN will give you those added since the drive left the factory. I tend to open a support case to have a drive replaced if the GROWN list is much above 0 or is growing. Keep in mind that any type of hardware RAID should report back 0 for both to the OS. -- {1-2-3-4-5-6-7-} Paul Kraus - Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) - Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) - Technical Advisor, RPI Players ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Couple of questions about ZFS on laptops
Hello all, I am thinking about a new laptop. I see that there are a number of higher-performance models (incidenatlly, they are also marketed as gamer ones) which offer two SATA 2.5 bays and an SD flash card slot. Vendors usually position the two-HDD bay part as either get lots of capacity with RAID0 over two HDDs, or get some capacity and some performance by mixing one HDD with one SSD. Some vendors go as far as suggesting a highest performance with RAID0 over two SSDs. Now, if I were to use this for work with ZFS on an OpenSolaris-descendant OS, and I like my data enough to want it mirrored, but still I want an SSD performance boost (i.e. to run VMs in real-time), I seem to have a number of options: 1) Use a ZFS mirror of two SSDs - seems too pricey 2) Use a HDD with redundant data (copies=2 or mirroring over two partitions), and an SSD for L2ARC (+maybe ZIL) - possible unreliability if the only HDD breaks 3) Use a ZFS mirror of two HDDs - lowest performance 4) Use a ZFS mirror of two HDDs and an SD card for L2ARC. Perhaps add another built-in flash card with PCMCIA adapters for CF, etc. Now, there is a couple of question points for me here. One was raised in my recent questions about CF ports in a Thumper. The general reply was that even high-performance CF cards are aimed for linear RW patterns and may be slower than HDDs for random access needed as L2ARCs, so flash cards may actually lower the system performance. I wonder if the same is the case with SD cards, and/or if anyone encountered (and can advise) some CF/SD cards with good random access performance (better than HDD random IOPS). Perhaps an extra IO path can be beneficial even if random performances are on the same scale - HDDs would have less work anyway and can perform better with their other tasks? On another hand, how would current ZFS behave if someone ejects an L2ARC device (flash card) and replaces it with another unsuspecting card, i.e. one from a photo camera? Would ZFS automatically replace the L2ARC device and kill the photos, or would the cache be disabled with no fatal implication for the pools nor for the other card? Ultimately, when the ex-L2ARC card gets plugged back in, would ZFS automagically attach it as the cache device, or does this have to be done manually? Second question regards single-HDD reliability: I can do ZFS mirroring over two partitions/slices, or I can configure copies=2 for the datasets. Either way I think I can get protection from bad blocks of whatever nature, as long as the spindle spins. Can these two methods be considered equivalent, or is one preferred (and for what reason)? Also, how do other list readers place and solve their preferences with their OpenSolaris-based laptops? ;) Thanks, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Single-disk rpool with inconsistent checksums, import fails
Hello all, I have an oi_148a PC with a single root disk, and since recently it fails to boot - hangs after the copyright message whenever I use any of my GRUB menu options. Booting with an oi_148a LiveUSB I had around since installation, I ran some zdb traversals over the rpool and zpool import attempts. The imports fail by running the kernel out of RAM (as recently discussed in the list with Paul Kraus's problems). However, in my current case, the rpool has just 11.2Gb allocated with 8.7Gb available. So almost all of it could fit in the 8Gb RAM of this computer (no more can be placed into the motherboard). And I don't believe there is so much metadata as to exhaust the RAM during an import attempt. I have also tried rollback imports with -F, but they have also failed so far. I am not ready to copypaste the zdb/zpool outputs here (I have to get text files off that box), but in short: 1) zdb -bsvL -e rpool-GUID showed that there are some problems: * deferred free block count is not zero, although small (144 blocks amounting to 1.4Mbytes), and it remained at this value over several import attempts. I have removed a swap volume some time before the failure, so this might be its leftovers. * It had also output this line: block traversal size 11986202624 != alloc 11986203136 (unreachable 512) I believe this refers to the allocated data size in bytes, and that one sector (512b) is deemed unreachable. Is that so fatal? 2) zdb -bsvc -e rpool-GUID showed that there are some consistency problems. Namely, five blocks had mismatching checksums. They were named plain file blocks with no further details (like what files they might be parts of). But I hope that this means no metadata was hurt so far. 3) I've tried importing the pool in several ways (including normal and rollback mounts, readonly and -n), but so far all attempts led to to the computer hanging within a minute (vmstat 1 shows that free RAM plummets towards the zero mark). I've tried preparing the system tunables as well: :; echo aok/W 1 | mdb -kw :; echo zfs_recover/W 1 | mdb -kw and sometimes adding: :; echo zfs_vdev_max_pending/W0t5 | mdb -kw :; echo zfs_resilver_delay/W0t0 | mdb -kw :; echo zfs_resilver_min_time_ms/W0t2 | mdb -kw :; echo zfs_txg_synctime/W0t1 | mdb -kw In this case I am not very hesitant to recreate the rpool and reinstall the OS - it was mostly needed to server the separate data pool. However this option is not always an acceptable one, so I wonder if anything can be done to repair an inconsistent non-redundant pool - at least to make it importable again in order to evacuate some of the settings and tunings that I've made over time. //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Couple of questions about ZFS on laptops
On Tue, 8 Nov 2011, Jim Klimov wrote: Second question regards single-HDD reliability: I can do ZFS mirroring over two partitions/slices, or I can configure copies=2 for the datasets. Either way I think I can get protection from bad blocks of whatever nature, as long as the spindle spins. Can these two methods be considered equivalent, or is one preferred (and for what reason)? Using two partitions on the same disk seems to give you most of the headaches associated with more disks without much of the benefit. If there is any minor issue, you will see zfs resilvering partitions and resilvering will be slow due to the drive heads flailing back and forth between partitions. There is also the issue that the block allocation is not likely to be very efficient in terms of head movement if two partitions are used. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Single-disk rpool with inconsistent checksums, import fails
2011-11-08 22:30, Jim Klimov wrote: Hello all, I have an oi_148a PC with a single root disk, and since recently it fails to boot - hangs after the copyright message whenever I use any of my GRUB menu options. Thanks to my wife's sister, who is my hands and eyes near the problematic PC, here's some ZDB output from this rpool: # zpool import pool: rpool id: 17995958177810353692 state: ONLINE status: The pool was last accessed by another system. action: The pool can be imported using its name or numeric identifier and the '-f' flag. see: http://www.sun.com/msg/ZFS-8000-EY config: rpool ONLINE c4t1d0s0 ONLINE So here it is - a single-device rpool. There are some on-disk errors, so some of zdb walks fail: root@openindiana:~# time zdb -bb -e 17995958177810353692 Traversing all blocks to verify nothing leaked ... Assertion failed: ss-ss_start = start (0x79e22600 = 0x79e1dc00), file ../../../uts/common/fs/zfs/space_map.c, line 173 Abort (core dumped) real0m12.184s user0m0.367s sys 0m0.474s root@openindiana:~# time zdb -bsvc -e 17995958177810353692 Traversing all blocks to verify checksums and verify nothing leaked ... Assertion failed: ss-ss_start = start (0x79e22600 = 0x79e1dc00), file ../../../uts/common/fs/zfs/space_map.c, line 173 Abort (core dumped) real0m12.019s user0m0.360s sys 0m0.458s However -bsvL and -bsvcL (with checksum-checks) do finish, results of the former test (more complete) are listed below: root@openindiana:~# time zdb -bsvcL -e 17995958177810353692 Traversing all blocks to verify checksums ... zdb_blkptr_cb: Got error 50 reading 182, 19177, 0, 1 DVA[0]=0:a8c8e600:2 [L0 ZFS plain file] fletcher4 uncompressed LE contiguous unique single size=2L/2P birth=82L/82P fill=1 cksum=3401f5fe522b:109ee10ba48ed38c:e7f49c220f7b8bc:ff405ef051b91e65 -- skipping zdb_blkptr_cb: Got error 50 reading 182, 19202, 0, 1 DVA[0]=0:a9030a00:2 [L0 ZFS plain file] fletcher4 uncompressed LE contiguous unique single size=2L/2P birth=82L/82P fill=1 cksum=11c4c738b0ba:7bb81bce3313913:8f85a7abf1b9e34:58e8746d63119393 -- skipping zdb_blkptr_cb: Got error 50 reading 182, 24924, 0, 0 DVA[0]=0:b1aaec00:14a00 [L0 ZFS plain file] fletcher4 uncompressed LE contiguous unique single size=14a00L/14a00P birth=85L/85P fill=1 cksum=270679cd905d:6119a969a134566:6f0f7da64c4d2d90:3ab86aa985abef02 -- skipping zdb_blkptr_cb: Got error 50 reading 182, 24944, 0, 0 DVA[0]=0:b1cdf000:10800 [L0 ZFS plain file] fletcher4 uncompressed LE contiguous unique single size=10800L/10800P birth=85L/85P fill=1 cksum=1ebb4d1ae9f5:3cf5f42afa9a332:757613fc2d2de7b3:5f197017333a4f89 -- skipping zdb_blkptr_cb: Got error 50 reading 493, 947, 0, 165 DVA[0]=0:b3efc200:2 [L0 ZFS plain file] fletcher4 uncompressed LE contiguous unique single size=2L/2P birth=26691L/26691P fill=1 cksum=2cdc2ae22d10:b33d31bcbc0d8da:f1571c9975e151b0:a037073594569635 -- skipping Error counts: errno count 50 5 block traversal size 11986202624 != alloc 11986203136 (unreachable 512) bp count: 405927 bp logical:15030449664 avg: 37027 bp physical: 12995855872 avg: 32015 compression: 1.16 bp allocated: 13172434944 avg: 32450 compression: 1.14 bp deduped:1186232320ref1: 12767 deduplication: 1.09 SPA allocated: 11986203136 used: 56.17% Blocks LSIZE PSIZE ASIZE avgcomp %Total Type - - - - - -- unallocated 232K 4K 12.0K 6.00K8.00 0.00 object directory 3 1.50K 1.50K 4.50K 1.50K1.00 0.00 object array 116K 1.50K 4.50K 4.50K 10.67 0.00 packed nvlist - - - - - -- packed nvlist size 197 24.2M 1.87M 5.61M 29.2K 12.92 0.04 bpobj - - - - - -- bpobj header - - - - - -- SPA space map header 1.27K 6.79M 3.25M9.8M 7.70K2.09 0.08 SPA space map 8 144K144K144K 18.0K1.00 0.00 ZIL intent log 26.6K 426M 91.1M182M 6.86K4.67 1.45 DMU dnode 75 150K 39.0K 80.0K 1.07K3.85 0.00 DMU objset - - - - - -- DSL directory 23 12.0K 11.5K 34.5K 1.50K1.04 0.00 DSL directory child map 21 11.5K 10.5K 31.5K 1.50K1.10 0.00 DSL dataset snap map 49 707K 79.5K239K 4.87K8.89 0.00 DSL props - - - - - -- DSL dataset - - - - - -- ZFS znode - - - - - -- ZFS V0 ACL 321K 12.0G 10.5G 10.5G 33.4K1.1485.46 ZFS plain file 26.8K 41.5M 19.1M 38.2M 1.42K
Re: [zfs-discuss] Couple of questions about ZFS on laptops
2011-11-08 23:36, Bob Friesenhahn wrote: On Tue, 8 Nov 2011, Jim Klimov wrote: Second question regards single-HDD reliability: I can do ZFS mirroring over two partitions/slices, or I can configure copies=2 for the datasets. Either way I think I can get protection from bad blocks of whatever nature, as long as the spindle spins. Can these two methods be considered equivalent, or is one preferred (and for what reason)? Using two partitions on the same disk seems to give you most of the headaches associated with more disks without much of the benefit. If there is any minor issue, you will see zfs resilvering partitions and resilvering will be slow due to the drive heads flailing back and forth between partitions. There is also the issue that the block allocation is not likely to be very efficient in terms of head movement if two partitions are used. Thanks, Bob, I figured so... And would copies=2 save me from problems of data loss and/or inefficient resilvering? Does all required data and metadata get duplicated this way, so any broken sector can be amended? I read on this list recently, that some metadata is already copies=2 or =3. To what extent?.. Should the trunk of the ZFS block tree be expected always secured, even on one disk? Thanks, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Couple of questions about ZFS on laptops
On Wed, 9 Nov 2011, Jim Klimov wrote: Thanks, Bob, I figured so... And would copies=2 save me from problems of data loss and/or inefficient resilvering? Does all required data and metadata get duplicated this way, so any broken sector can be amended? I read on this list recently, that some metadata is already copies=2 or =3. To what extent?.. Should the trunk of the ZFS block tree be expected always secured, even on one disk? With only one disk partition in a vdev, then there will be no resilvering since there is nothing to resilver. Metadata has always stored at least two copies. It is always possible to lose the whole pool if the device does not work according to specification (or you drop the laptop on the ground). Using copies=2 and doing a 'zfs scrub' at least once after bulk data has been written should help avoid media read errors. Zfs will still resilver blocks which failed to read as long as there is a redundant copy. If you do want to increase reliability then you should mirror between disks, even if you feel that this will be slow. It will still be faster (for reads) than using just one disk. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris Based Systems Lock Up - Possibly ZFS/memory related?
Hi All, On Wed, Nov 2, 2011 at 5:24 PM, Lachlan Mulcahy lmulc...@marinsoftware.comwrote: Now trying another suggestion sent to me by a direct poster: * Recommendation from Sun (Oracle) to work around a bug: * 6958068 - Nehalem deeper C-states cause erratic scheduling behavior set idle_cpu_prefer_mwait = 0 set idle_cpu_no_deep_c = 1 Was apparently the cause of a similar symptom for them and we are using Nehalem. At this point I'm running out of options, so it can't hurt to try it. So far the system has been running without any lock ups since very late Monday evening -- we're now almost 48 hours on. So far so good, but it's hard to be certain this is the solution, since I could never prove it was the root cause. For now I'm just continuing to test and build confidence level. More time will make me more confident. Maybe a week or so We're now over a week running with C-states disabled and have not experienced any further system lock ups. I am feeling much more confident in this system now -- it will probably see at least another week or two in addition to more load/QA testing and then be pushed into production. Will update if I see the issue crop up again, but for anyone else experiencing a similar symptom, I'd highly recommend trying this as a solution. So far it seems to have worked for us. Regards, -- Lachlan Mulcahy Senior DBA, Marin Software Inc. San Francisco, USA AU Mobile: +61 458 448 721 US Mobile: +1 (415) 867 2839 Office : +1 (415) 671 6080 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea
Hello all, A couple of months ago I wrote up some ideas about clustered ZFS with shared storage, but the idea was generally disregarded as not something to be done in near-term due to technological difficultes. Recently I stumbled upon a Nexenta+Supermicro report [1] about cluster-in-a-box with shared storage boasting an active-active cluster with transparent failover. Now, I am not certain how these two phrases fit in the same sentence, and maybe it is some marketing-people mixup, but I have a couple of options: 1) The shared storage (all 16 disks are accessible to both motherboards) is split into two ZFS pools, each mounted by one node normally. If a node fails, another imports the pool and continues serving it. 2) All disks are aggregated into one pool, and one node serves it while another is in hot standby. Ideas (1) and (2) may possibly contradict the claim that the failover is seamless and transparent to clients. A pool import usually takes some time, maybe long if fixups are needed; and TCP sessions are likely to get broken. Still, maybe the clusterware solves this... 3) Nexenta did implement a shared ZFS pool with both nodes accessing all of the data instantly and cleanly. Can this be true? ;) If this is not a deeply-kept trade secret, can the Nexenta people elaborate in technical terms how this cluster works? [1] http://www.nexenta.com/corp/sbb?gclid=CIzBg-aEqKwCFUK9zAodCSscsA Thanks, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea
On Wed, Nov 09, 2011 at 03:52:49AM +0400, Jim Klimov wrote: Recently I stumbled upon a Nexenta+Supermicro report [1] about cluster-in-a-box with shared storage boasting an active-active cluster with transparent failover. Now, I am not certain how these two phrases fit in the same sentence, and maybe it is some marketing-people mixup, One way they can not be in conflict, is if each host normally owns 8 disks and is active with it, and standby for the other 8 disks. Not sure if this is what the solution in question is doing, just saying. -- Dan. pgpzJ4iippP0L.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sync=disabled property
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Evaldas Auryla I'm trying to evaluate what are the risks of running NFS share of zfs dataset with sync=disabled property. The clients are vmware hosts in our environment and server is SunFire X4540 Thor system. Though general recommendation tells not to do this, but after testing performance with default setting and sync=disabled - it's night and day, so it's really tempting to do sync=disabled ! Thanks for any suggestion. I know a lot of people will say don't do it, but that's only partial truth. The real truth is: At all times, if there's a server crash, ZFS will come back along at next boot or mount, and the filesystem will be in a consistent state, that was indeed a valid state which the filesystem actually passed through at some moment in time. So as long as all the applications you're running can accept the possibility of going back in time as much as 30 sec, following an ungraceful ZFS crash, then it's safe to disable ZIL (set sync=disabled). In your case, you have vm's inside the ZFS filesystem. In the event ZFS crashes ungracefully, you don't want the VM disks to go back in time while the VM's themselves are unaware anything like that happened. If you run with sync=disabled, you want to ensure your ZFS / NFS server doesn't come back up automatically. If ZFS crashes, you want to force the guest VM's to crash. Force power down the VM's, then bring up NFS, remount NFS, and reboot the guest VM's. All the guest VM's will have gone back in time, by as much as 30 sec. This is generally acceptable for things like web servers and file servers and windows VMs in a virtualized desktop environment etc. It's also acceptable for things running databases, as long as all the DB clients can go back in time (reboot them whatever). It is NOT acceptable if you're processing credit card transactions, or if you're running a mailserver and you're unwilling to silently drop any messages, or ... stuff like that. Long story short, if you're willing to allow your server and all of the dependent clients to go back in time as much as 30 seconds, and you're willing/able to reboot everything that depends on it, then you can accept the sync=disabled That's a lot of thinking. And a lot of faith or uncertainty. And in your case, it's kind of inconvenient. Needing to manually start your NFS share every time you reboot your ZFS server. The safer/easier thing to do is add dedicated log devices to the server instead. It's not as fast as running with ZIL disabled, but it's much faster than running without a dedicated log. When choosing a log device, focus on FAST. You really don't care about size. Even 4G is usually all you need. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Couple of questions about ZFS on laptops
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Jim Klimov 1) Use a ZFS mirror of two SSDs - seems too pricey 2) Use a HDD with redundant data (copies=2 or mirroring over two partitions), and an SSD for L2ARC (+maybe ZIL) - possible unreliability if the only HDD breaks 3) Use a ZFS mirror of two HDDs - lowest performance 4) Use a ZFS mirror of two HDDs and an SD card for L2ARC. Perhaps add another built-in flash card with PCMCIA adapters for CF, etc. The performance of a SSD or flash drive or SD card is almost entirely dependent on the robustness/versatility of the built-in controller circuit. You can rest assured that no SD card and no USB device is going to have performance even remotely close to a decent SSD, except under the conditions that are specifically optimized for that device. The manufacturers, of course, will publish their maximum specs, and the real world usage of the device might be an order of magnitude lower. A little while back, I performed an experiment - I went out and bought the best rated, most expensive USB3 flash drives I could find, and I benchmarked them against the cheapest USB2 hard drives I could find. The hard drives won by a clear margin, like 4x to 8x faster, except when running large sequential dd to/from the raw flash device on the first boot - in which case the flash won by a small margin (like 10%) Given your hardware limitations, the only way to go fast is to use a SSD, and the only way to go fast with redundancy is to use a mirror of two SSD's. If you don't go for the SSD's, then your HDD's will be the second fastest option. Do not put any SD card into the mix. It will only hurt you. Second question regards single-HDD reliability: I can do ZFS mirroring over two partitions/slices, or I can configure copies=2 for the datasets. Either way I think I can get protection from bad blocks of whatever nature, as long as the spindle spins. Can these two methods be considered equivalent, or is one preferred (and for what reason)? I would opt for the copies=2 method, because it's reconfigurable if you want, and it's designed to work within a single pool, so it more closely resembles your actual usage. If you mirror across two partitions on the same disk, there may be unintended performance consequences because nobody expected you to do that when they wrote the code. Also, how do other list readers place and solve their preferences with their OpenSolaris-based laptops? ;) I'm sorry to say, there is no ZFS-based OS and no laptop hardware that I consider to be a reliable combination. Of course I haven't tested them all, but I don't believe in any of them because it's unintended, uncharted, untested, unsupported. I think you'll find the best support for this subject on the openindiana mailing lists. After oracle acquired sun, most of the home users and laptop users left the opensolaris mailing lists in favor of the openindiana lists. The people that remain here are primarily focused on enterprise and servers. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea
This is accomplished with the Nexenta HA cluster plugin. The plugin is written by RSF, and you can read more about it here : http://www.high-availability.com/ You can do either option 1 or two that you put forth. There is some failover time, but in the latest version of Nexenta (3.1.1) there are some additional tweaks that bring the failover time down significantly. Depending on pool configuration and load, failover can be done in under 10 seconds based on some of my internal testing. -Matt Breitbach -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Jim Klimov Sent: Tuesday, November 08, 2011 5:53 PM To: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea Hello all, A couple of months ago I wrote up some ideas about clustered ZFS with shared storage, but the idea was generally disregarded as not something to be done in near-term due to technological difficultes. Recently I stumbled upon a Nexenta+Supermicro report [1] about cluster-in-a-box with shared storage boasting an active-active cluster with transparent failover. Now, I am not certain how these two phrases fit in the same sentence, and maybe it is some marketing-people mixup, but I have a couple of options: 1) The shared storage (all 16 disks are accessible to both motherboards) is split into two ZFS pools, each mounted by one node normally. If a node fails, another imports the pool and continues serving it. 2) All disks are aggregated into one pool, and one node serves it while another is in hot standby. Ideas (1) and (2) may possibly contradict the claim that the failover is seamless and transparent to clients. A pool import usually takes some time, maybe long if fixups are needed; and TCP sessions are likely to get broken. Still, maybe the clusterware solves this... 3) Nexenta did implement a shared ZFS pool with both nodes accessing all of the data instantly and cleanly. Can this be true? ;) If this is not a deeply-kept trade secret, can the Nexenta people elaborate in technical terms how this cluster works? [1] http://www.nexenta.com/corp/sbb?gclid=CIzBg-aEqKwCFUK9zAodCSscsA Thanks, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Data distribution not even between vdevs
Hi list, My zfs write performance is poor and need your help. I create zpool with 2 raidz1. When the space is to be used up, I add 2 another raidz1 to extend the zpool. After some days, the zpool is almost full, I remove some old data. But now, as show below, the first 2 raidz1 vdev usage is about 78% and the last 2 raidz1 vdev usage is about 93%. I have line in /etc/system set zfs:metaslab_df_free_pct=4 So the performance degrade will happen when the vdev usage is above 90%. All my file is small files which size is about 150KB. Now the questions is: 1. Should I balance the data between the vdevs by copy the data and remove the data which locate in last 2 vdevs? 2. Is there any method to automatically re-balance the data? or Any better solution to resolve this problem? root@nas-01:~# zpool iostat -v capacity operations bandwidth pool used avail read write read write -- - - - - - - datapool21.3T 3.93T 26 96 81.4K 2.81M raidz14.93T 1.39T 8 28 25.7K 708K c3t600221900085486703B2490FB009d0 - - 3 10 216K 119K c3t600221900085486703B4490FB063d0 - - 3 10 214K 119K c3t6002219000852889055F4CB79C10d0 - - 3 10 214K 119K c3t600221900085486703B8490FB0FFd0 - - 3 10 215K 119K c3t600221900085486703BA490FB14Fd0 - - 3 10 215K 119K c3t6002219000852889041C490FAFA0d0 - - 3 10 215K 119K c3t600221900085486703C0490FB27Dd0 - - 3 10 214K 119K raidz14.64T 1.67T 8 32 24.6K 581K c3t600221900085486703C2490FB2BFd0 - - 3 10 224K 98.2K c3t6002219000852889041F490FAFD0d0 - - 3 10 222K 98.2K c3t60022190008528890428490FB0D8d0 - - 3 10 222K 98.2K c3t60022190008528890422490FB02Cd0 - - 3 10 223K 98.3K c3t60022190008528890425490FB07Cd0 - - 3 10 223K 98.3K c3t60022190008528890434490FB24Ed0 - - 3 10 223K 98.3K c3t6002219000852889043949100968d0 - - 3 10 224K 98.2K raidz15.88T 447G 5 17 16.0K 67.7K c3t6002219000852889056B4CB79D66d0 - - 3 12 215K 12.2K c3t600221900085486704B94CB79F91d0 - - 3 12 216K 12.2K c3t600221900085486704BB4CB79FE1d0 - - 3 12 214K 12.2K c3t600221900085486704BD4CB7A035d0 - - 3 12 215K 12.2K c3t600221900085486704BF4CB7A0ABd0 - - 3 12 216K 12.2K c3t6002219000852889055C4CB79BB8d0 - - 3 12 214K 12.2K c3t600221900085486704C14CB7A0FDd0 - - 3 12 215K 12.2K raidz15.88T 441G 4 1 14.9K 12.4K c3t6002219000852889042B490FB124d0 - - 1 1 131K 2.33K c3t600221900085486704C54CB7A199d0 - - 1 1 132K 2.33K c3t600221900085486704C74CB7A1D5d0 - - 1 1 130K 2.33K c3t600221900085288905594CB79B64d0 - - 1 1 133K 2.33K c3t600221900085288905624CB79C86d0 - - 1 1 132K 2.34K c3t600221900085288905654CB79CCCd0 - - 1 1 131K 2.34K c3t600221900085288905684CB79D1Ed0 - - 1 1 132K 2.33K c3t6B8AC6FF837605864DC9E9F1d0 0 928G 0 16289 1.47M -- - - - - - - root@nas-01:~# ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea
On Wed, Nov 09, 2011 at 11:09:45AM +1100, Daniel Carosone wrote: On Wed, Nov 09, 2011 at 03:52:49AM +0400, Jim Klimov wrote: Recently I stumbled upon a Nexenta+Supermicro report [1] about cluster-in-a-box with shared storage boasting an active-active cluster with transparent failover. Now, I am not certain how these two phrases fit in the same sentence, and maybe it is some marketing-people mixup, One way they can not be in conflict, is if each host normally owns 8 disks and is active with it, and standby for the other 8 disks. Which, now that I reread it more carefully, is your case 1. Sorry for the noise. -- Dan. pgphTDpO9Oucq.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea
To some people active-active means all cluster members serve the same filesystems. To others active-active means all cluster members serve some filesystems and can serve all filesystems ultimately by taking over failed cluster members. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss