Re: [zfs-discuss] Optimal raidz3 configuration
From: David Magda [mailto:dma...@ee.ryerson.ca] On Wed, October 13, 2010 21:26, Edward Ned Harvey wrote: I highly endorse mirrors for nearly all purposes. Are you a member of BAARF? http://www.miracleas.com/BAARF/BAARF2.html Never heard of it. I don't quite get it ... They want people to stop talking about pros/cons of various types of raid? That's definitely not me. I think there are lots of pros/cons, and many of them have nuances, and vary by implementation... I think it's important to keep talking about it, and all us experts in the field can keep current on all this ... Take, for example, the number of people discussing things in this mailing list, who say they still use hardware raid. That alone demonstrates misinformation (in most cases) and warrants more discussion. ;-) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Finding corrupted files
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Toby Thain I don't want to heat up the discussion about ZFS managed discs vs. HW raids, but if RAID5/6 would be that bad, no one would use it anymore. It is. And there's no reason not to point it out. The world has Well, neither one of the above statements is really fair. The truth is: radi5/6 are generally not that bad. Data integrity failures are not terribly common (maybe one bit per year out of 20 large disks or something like that.) And in order to reach the conclusion nobody would use it, the people using it would have to first *notice* the failure. Which they don't. That's kind of the point. Since I started using ZFS in production, about a year ago, on three servers totaling approx 1.5TB used, I have had precisely one checksum error, which ZFS corrected. I have every reason to believe, if that were on a raid5/6, the error would have gone undetected and nobody would have noticed. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance issues with iSCSI under Linux
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Ian D ok... we're making progress. After swapping the LSI HBA for a Dell H800 the issue disappeared. Now, I'd rather not use those controllers because they don't have a JBOD mode. We have no choice but to make individual RAID0 volumes for each disks which means we need to reboot the server every time we replace a failed drive. That's not good... I believe those are rebranded LSI controllers. I know the PERC controllers are. I use MegaCLI on Perc systems for this purpose. You should be able to find a utility which allows you to do this sort of thing while the OS is running. If you happen to find that MegaCLI is the right tool for your hardware, let me know, and I'll paste a few commands here, which will simplify your life. When I first started using it, I found it terribly cumbersome. But now I've gotten used to it, and MegaCLI commands just roll off the tongue. To resume the issue, when we copy files from/to the JBODs connected to that HBA using NFS/iSCSI, we get slow transfer rate 20M/s and a 1-2 second pause between each file. When we do the same experiment locally using the external drives as a local volume (no NFS/iSCSI involved) then it goes upward of 350M/sec with no delay between files. Baffling. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance issues with iSCSI under Linux [SEC=UNCLASSIFIED]
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Wilkinson, Alex can you paste them anyway ? Note: If you have more than one adapter, I believe you can specify -aALL in the commands below, instead of -a0 I have 2 disks (slots 4 5) that are removable and rotate offsite for backups. To remove disks safely: zpool export removable-pool export EnclosureID=`MegaCli -PDList -a0 | grep 'Enclosure Device ID' | uniq | sed 's/.* //'` for DriveNum in 4 5 ; do MegaCli -PDOffline PhysDrv[${EnclosureID}:${DriveNum}] -a0 ; done Disks blink alternate orange green. Safe to remove. To insert disks safely: Insert disks. MegaCli -CfgForeign -Clear -a0 MegaCli -CfgEachDskRaid0 -a0 devfsadm -Cv zpool import -a To clear foreign config off drives: MegaCli -CfgForeign -Clear -a0 To create a one-disk raid0 for each disk that's not currently part of another group: MegaCli -CfgEachDskRaid0 -a0 To configure all drives WriteThrough MegaCli -LDSetProp WT Lall -aALL To configure all drives WriteBack MegaCli -LDSetProp WB Lall -aALL ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] adding new disks and setting up a raidz2
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Derek G Nokes r...@dnokes.homeip.net:~# zpool create marketData raidz2 c0t5000C5001A6B9C5Ed0 c0t5000C5001A81E100d0 c0t5000C500268C0576d0 c0t5000C500268C5414d0 c0t5000C500268CFA6Bd0 c0t5000C500268D0821d0 cannot label 'c0t5000C500268CFA6Bd0': try using fdisk(1M) and then provide a specific slice Any idea what this means? I think it means there is something pre-existing on that drive. Maybe ZFS related, maybe not. You should probably double-check everything, to make sure there's no valuable data on that device... And then ... Either zero the drive the long way via dd ... or use your raid controller to initialize the device, which will virtually zero it the short way ... In some cases you have no choice, and you need to do it the long way. time dd if=/dev/zero of=/dev/rdsk/c0t5000C500268CFA6Bd0 bs=1024k ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Running on Dell hardware?
I have a Dell R710 which has been flaky for some time. It crashes about once per week. I have literally replaced every piece of hardware in it, and reinstalled Sol 10u9 fresh and clean. I am wondering if other people out there are using Dell hardware, with what degree of success, and in what configuration? The failure seems to be related to the perc 6i. For some period around the time of crash, the system still responds to ping, and anything currently in memory or running from remote storage continues to function fine. But new processes that require the local storage ... Such as inbound ssh etc, or even physical login at the console ... those are all hosed. And eventually the system stops responding to ping. As soon as the problem starts, the only recourse is power cycle. I can't seem to reproduce the problem reliably, but it does happen regularly. Yesterday it happened several times in one day, but sometimes it will go 2 weeks without a problem. Again, just wondering what other people are using, and experiencing. To see if any more clues can be found to identify the cause. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Running on Dell hardware?
From: Markus Kovero [mailto:markus.kov...@nebula.fi] Sent: Wednesday, October 13, 2010 10:43 AM Hi, we've been running opensolaris on Dell R710 with mixed results, some work better than others and we've been struggling with same issue as you are with latest servers. I suspect somekind powersaving issue gone wrong, system disks goes to sleep and never wake up or something similar. Personally, I cannot recommend using them with solaris, support is not even close to what it should be. How consistent are your problems? If you change something and things get better or worse, will you be able to notice? Right now, I think I have improved matters by changing the Perc to WriteThrough instead of WriteBack. Yesterday the system crashed several times before I changed that, and afterward, I can't get it to crash at all. But as I said before ... Sometimes the system goes 2 weeks without a problem. Do you have all your disks configured as individual disks? Do you have any SSD? WriteBack or WriteThrough? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Running on Dell hardware?
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Steve Radich, BitShop, Inc. Do you have dedup on? Removing large files, zfs destroy a snapshot, or a zvol and you'll see hangs like you are describing. Thank you, but no. I'm running sol 10u9, which does not have dedup yet, because dedup is not yet considered stable for reasons like you mentioned. I will admit, when dedup is available in sol 11, I do want it. ;-) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Running on Dell hardware?
From: edmud...@mail.bounceswoosh.org [mailto:edmud...@mail.bounceswoosh.org] On Behalf Of Eric D. Mudama Out of curiosity, did you run into this: http://blogs.everycity.co.uk/alasdair/2010/06/broadcom-nics-dropping- out-on-solaris-10/ I personally haven't had the broadcom problem. When my system crashes, surprisingly, it continues responding to ping, answers on port 22 (but you can't ssh in), and if there are any cron jobs that run from NFS, they're able to continue. For some period of time, and eventually the whole thing crashes. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Running on Dell hardware?
Dell R710 ... Solaris 10u9 ... With stability problems ... Notice that I have several CPU's whose current_cstate is higher than the supported_max_cstate. Logically, that sounds like a bad thing. But I can't seem to find documentation that defines the meaning of supported_max_cstates, to verify that this is a bad thing. I'm looking for other people out there ... with and without problems ... to try this too, and see if a current_cstate higher than the supported_max_cstate might be a simple indicator of system instability. kstat | grep current_cstate ; kstat | grep supported_max_cstate current_cstate 1 current_cstate 3 current_cstate 3 current_cstate 3 current_cstate 1 current_cstate 3 current_cstate 3 current_cstate 3 current_cstate 0 current_cstate 3 current_cstate 3 current_cstate 3 current_cstate 1 current_cstate 3 current_cstate 3 current_cstate 3 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 supported_max_cstates 2 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Running on Dell hardware?
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Edward Ned Harvey Dell R710 ... Solaris 10u9 ... With stability problems ... Notice that I have several CPU's whose current_cstate is higher than the supported_max_cstate. One more data point: Sun x4275 ... Solaris 10u6 fully updated (equivalent of 10u9??) ... No problems ... There are no current_cstate's higher than supported_max_cstate. kstat | grep current_cstate ; kstat | grep supported_max_cstate current_cstate 2 current_cstate 1 current_cstate 1 current_cstate 1 current_cstate 1 current_cstate 1 current_cstate 0 current_cstate 1 current_cstate 1 current_cstate 1 current_cstate 1 current_cstate 1 current_cstate 1 current_cstate 1 current_cstate 2 current_cstate 2 supported_max_cstates 3 supported_max_cstates 3 supported_max_cstates 3 supported_max_cstates 3 supported_max_cstates 3 supported_max_cstates 3 supported_max_cstates 3 supported_max_cstates 3 supported_max_cstates 3 supported_max_cstates 3 supported_max_cstates 3 supported_max_cstates 3 supported_max_cstates 3 supported_max_cstates 3 supported_max_cstates 3 supported_max_cstates 3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Running on Dell hardware?
From: Henrik Johansen [mailto:hen...@scannet.dk] The 10g models are stable - especially the R905's are real workhorses. You would generally consider all your machines stable now? Can you easily pdsh to all those machines? kstat | grep current_cstate ; kstat | grep supported_max_cstates I'd really love to see if some current_cstate is higher than supported_max_cstates is an accurate indicator of system instability. So far the two data points I have support this theory. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs diff cannot stat shares
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of dirk schelfhout Wanted to test the zfs diff command and ran into this. What's zfs diff? I know it's been requested, but AFAIK, not implemented yet. Is that new feature being developed now or something? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Optimal raidz3 configuration
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Peter Taps If I have 20 disks to build a raidz3 pool, do I create one big raidz vdev or do I create multiple raidz3 vdevs? Is there any advantage of having multiple raidz3 vdevs in a single pool? whatever you do, *don't* configure one huge raidz3. Consider either: 3 vdev's of each 7-disk raidz1, or 3 vdev's of 7-disk raidz2, or something along these lines. Perhaps 3 vdev's of each 6-disk raidz1, and two hotspares. raidzN takes a really long time to resilver (code written inefficiently, it's a known problem.) If you had a huge raidz3, it would literally never finish, because it couldn't resilver as fast as new data appears. A week later you'd destroy rebuild your whole pool. If you can afford mirrors, your risk is much lower. Because although it's physically possible for 2 disks to fail simultaneously and ruin the pool, the probability of that happening is smaller than the probability of 3 simultaneous disk failures on the raidz3. Due to smaller resilver window. I highly endorse mirrors for nearly all purposes. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Finding corrupted files
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Stephan Budach You are implying that the issues resulted from the H/W raid(s) and I don't think that this is appropriate. Please quote originals when you reply. If you don't - then it's easy to follow the thread on the web forum, but not in email. So if you don't quote, you'll be losing a lot of the people following the thread. I think it's entirely appropriate to imply that your problem this time stems from hardware. I'll say it outright. You have a hardware problem. Because if there is a repeatable checksum failure (bad disk) then if anything can find it, scrub can. And scrub is the best way to find it. If you have a nonrepeatable checksum failure (such as you have) then there is only one possibility. You are experiencing a hardware problem. One possibility is that there's a failing disk in your hardware raid set, and your hardware raid controller is unable to detect it, because hardware raid doesn't do checksumming. Sometimes ZFS reads the device, and gets an error. Sometimes the hardware raid controller reads the other side of the mirror, and there is no error. This is not the only possibility. There could be some other piece of hardware yielding your intermittent checksum errors. But there's one absolute conclusion: Your intermittent checksum errors are caused by hardware. If scrub didn't find an error, then there was no error at the time of scrub. If scrub didn't find an error, and then something else *did* find an error, it means one of two things. (a) Maybe the error only occurred after the scrub. or (b) the hardware raid controller or some other piece of hardware didn't produce corrupted data during the scrub, but will produce corrupted data at some other time. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Finding corrupted files
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Stephan Budach c3t211378AC0253d0 ONLINE 0 0 0 How many disks are there inside of c3t211378AC0253d0? How are they configured? Hardware raid 5? A mirror of two hardware raid 5's? The point is: This device, as seen by ZFS, is not a pure storage device. It is a high level device representing some LUN or something, which is configured controlled by hardware raid. If there's zero redundancy in that device, then scrub would probably find the checksum errors consistently and repeatably. If there's some redundancy in that device, then all bets are off. Sometimes scrub might read the good half of the data, and other times, the bad half. But then again, the error might not be in the physical disks themselves. The error might be somewhere in the raid controller(s) or the interconnect. Or even some weird unsupported driver or something. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Finding corrupted files
From: Stephan Budach [mailto:stephan.bud...@jvm.de] I now also got what you meant by good half but I don't dare to say whether or not this is also the case in a raid6 setup. The same concept applies to raid5 or raid6. When you read the device, you never know if you're actually reading the data or the parity and in fact, they're mixed together in order to fully utilize all the hardware available. (Assuming you have some decently smart hardware.) But all of that is mostly irrelevant. One fact remains: You have checksum errors. There is only one cause for checksum errors: Hardware failure. It may be the physical disks themselves, or the raid card, or ram, or cpu, or any of the interconnect in between. I suppose it could be a driver problem, but that's less likely. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Multiple SLOG devices per pool
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Ray Van Dolson I have a pool with a single SLOG device rated at Y iops. If I add a second (non-mirrored) SLOG device also rated at Y iops will my zpool now theoretically be able to handle 2Y iops? Or close to that? Yes. But we're specifically talking about sync mode writes. Not async, and not read. And we're not comparing apples to oranges etc, not measuring an actual number of IOPS, because of aggregation etc. But I don't think that's what you were asking. I don't think you are trying to quantify the number of IOPS. I think you're trying to confirm the qualitative characteristic, If I have N slogs, I will write N times faster than a single slog. And that's a simple answer. Yes. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Moving camp, lock stock and barrel
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Harry Putnam beep beep beep beep beep beep I'm kind of having a brain freeze about this: So what are the standard tests or cmds to run to collect enough data to try to make a determination of what the problem is? Definitely hardware. To diagnose hardware, no standard test. Start replacing hardware. You'll know you fixed it when the problem stops. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Finding corrupted files
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of David Dyer-Bennet I must say that this concept of scrub running w/o error when corrupted files, detectable to zfs send, apparently exist, is very disturbing. As previously mentioned, the OP is using a hardware raid system. It is impossible for ZFS to read both sides of the mirror, which means it's pure chance. The hardware raid may fetch data from a bad disk one time, and fetch good data from another disk the next time. Or vice-versa. You should always configure JBOD and allow ZFS to manage the raid. Don't do it in hardware, as the OP of this thread is soundly demonstrating the reasons why. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS equivalent of inotify
Is there a ZFS equivalent (or alternative) of inotify? You have some thing, which wants to be notified whenever a specific file or directory changes. For example, a live sync application of some kind... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [RFC] Backup solution
From: Peter Jeremy [mailto:peter.jer...@alcatel-lucent.com] Sent: Thursday, October 07, 2010 10:02 PM On 2010-Oct-08 09:07:34 +0800, Edward Ned Harvey sh...@nedharvey.com wrote: If you're going raidz3, with 7 disks, then you might as well just make mirrors instead, and eliminate the slow resilver. There is a difference in reliability: raidzN means _any_ N disks can fail, whereas mirror means one disk in each mirror pair can fail. With a mirror, Murphy's Law says that the second disk to fail will be the pair of the first disk :-). Maybe. But in reality, you're just guessing the probability of a single failure, the probability of multiple failures, and the probability of multiple failures within the critical time window and critical redundancy set. The probability of a 2nd failure within the critical time window is smaller whenever the critical time window is decreased, and the probability of that failure being within the critical redundancy set is smaller whenever your critical redundancy set is smaller. So if raidz2 takes twice as long to resilver than a mirror, and has a larger critical redundancy set, then you haven't gained any probable resiliency over a mirror. Although it's true with mirrors, it's possible for 2 disks to fail and result in loss of pool, I think the probability of that happening is smaller than the probability of a 3-disk failure in the raidz2. How much longer does a 7-disk raidz2 take to resilver as compared to a mirror? According to my calculations, it's in the vicinity of 10x longer. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS equivalent of inotify
From: cas...@holland.sun.com [mailto:cas...@holland.sun.com] On Behalf Of casper@sun.com Is there a ZFS equivalent (or alternative) of inotify? Have you looked at port_associate and ilk? port_associate looks promising. But google is less than useful on ilk. Got any pointers, or additional search terms to narrow the context? Thanks... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [RFC] Backup solution
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Roy Sigurd Karlsbakk In addition to this comes another aspect. What if one drive fails and you find bad data on another in the same VDEV while resilvering. This is quite common these days, and for mirrors, that will mean data loss unless you mirror 3-way or more, which will be rather costy. Like the resilver, scrub goes faster with mirrors. Scrub regularly. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance issues with iSCSI under Linux
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Ian D the help to community can provide. We're running the latest version of Nexenta on a pretty powerful machine (4x Xeon 7550, 256GB RAM, 12x 100GB Samsung SSDs for the cache, 50GB Samsung SSD for the ZIL, 10GbE on a dedicated switch, 11x pairs of 15K HDDs for the pool). We're If you have a single SSD for dedicated log, that will surely be a bottleneck for you. All sync writes (which are all writes in the case of iscsi) will hit the log device before the main pool. But you should still be able to read fast... Also, with so much cache ram, it wouldn't surprise me a lot to see really low disk usage just because it's already cached. But that doesn't explain the ridiculously slow performance... I'll suggest trying something completely different, like, dd if=/dev/zero bs=1024k | pv | ssh othermachine 'cat /dev/null' ... Just to verify there isn't something horribly wrong with your hardware (network). In linux, run ifconfig ... You should see errors:0 Make sure each machine has an entry for the other in the hosts file. I haven't seen that cause a problem for iscsi, but certainly for ssh. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Finding corrupted files
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Stephan Budach I conducted a couple of tests, where I configured my raids as jbods and mapped each drive out as a seperate LUN and I couldn't notice a difference in performance in any way. Not sure if my original points were communicated clearly. Giving JBOD's to ZFS is not for the sake of performance. The reason for JBOD is reliability. Because hardware raid cannot detect or correct checksum errors. ZFS can. So it's better to skip the hardware raid and use JBOD, to enable ZFS access to each separate side of the redundant data. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Finding corrupted files
From: edmud...@mail.bounceswoosh.org [mailto:edmud...@mail.bounceswoosh.org] On Behalf Of Eric D. Mudama On Wed, Oct 6 at 22:04, Edward Ned Harvey wrote: * Because ZFS automatically buffers writes in ram in order to aggregate as previously mentioned, the hardware WB cache is not beneficial. There is one exception. If you are doing sync writes to spindle disks, and you don't have a dedicated log device, then the WB cache will benefit you, approx half as much as you would benefit by adding dedicated log device. The sync write sort-of by-passes the ram buffer, and that's the reason why the WB is able to do some good in the case of sync writes. All of your comments made sense except for this one. (etc) Your point about long-term fragmentation and significant drive emptiness are well received. I never let a pool get over 90% full, for several reasons including this one. My target is 70%, which seems to be sufficiently empty. Also, as you indicated, blocks of 128K are not sufficiently large for reordering to benefit. There's another thread here, where I calculated, you need blocks approx 40MB in size, in order to reduce random seek time below 1% of total operation time. So all that I said will only be relevant or accurate if within 30sec (or 5 sec in the future) there exists at least 40M of aggregatable sequential writes. It's really easy to measure and quantify what I was saying. Just create a pool, and benchmark it in each configuration. Results that I measured were: (stripe of 2 mirrors) 721 IOPS without WB or slog. 2114 IOPS with WB 2722 IOPS with WB and slog 2927 IOPS with slog, and no WB There's a whole spreadsheet full of results that I can't publish, but the trend of WB versus slog was clear and consistent. I will admit the above were performed on relatively new, relatively empty pools. It would be interesting to see if any of that changes, if the test is run on a system that has been in production for a long time, with real user data in it. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Swapping disks in pool to facilitate pool growth
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Kevin Walker We are a running a Solaris 10 production server being used for backup services within our DC. We have 8 500GB drives in a zpool and we wish to swap them out 1 by 1 for 1TB drives. I would like to know if it is viable to add larger disks to zfs pool to grow the pool size and then remove the smaller disks? I would assume this would degrade the pool and require it to resilver? Because it's a raidz, yes it will be degraded each time you remove one disk. You will not be using attach and detach. You will be using replace Because it's a raidz, each resilver time will be unnaturally long. Raidz resilver code is inefficient. Just be patient and let it finish each time before you replace the next disk. Performance during resilver will be exceptionally poor. Exceptionally. Because of the inefficient raidz resilver code, do everything within your power to reduce IO on the system during the resilver. Of particular importance: Don't create snapshots while the system is resilvering. This will exponentially increase the resilver time. (I'm exaggerating by saying exponentially, don't take it literally. But in reality, it *is* significant.) Because you're going to be degrading your redundancy, you *really* want to ensure all the disks are good before you do any degrading. This means, don't begin your replace until after you've completed a scrub. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Finding corrupted files
From: Cindy Swearingen [mailto:cindy.swearin...@oracle.com] I would not discount the performance issue... Depending on your workload, you might find that performance increases with ZFS on your hardware RAID in JBOD mode. Depends on the raid card you're comparing to. I've certainly seen some raid cards that were too dumb to read from 2 disks in a mirror simultaneously for the sake of read performance enhancement. And many other similar situations. But I would not say that's generally true anymore. In the last several years, all the hardware raid cards that I've bothered to test were able to utilize all the hardware available. Just like ZFS. There are performance differences... like ... the hardware raid might be able to read 15% faster in raid5, while ZFS is able to write 15% faster in raidz, and so forth. Differences that roughly balance each other out. For example, here's one data point I can share (2 mirrors striped, results normalized): 8 initial writers, 8 rewriters, 8 readers ZFS 1.432.995.05 HW 2.002.542.96 8 re-readers, 8 reverse readers, 8 stride readers ZFS 4.193.593.93 HW 3.022.802.90 8 random readers, 8 random mix, 8 random writers ZFS 2.572.401.69 HW 1.991.701.73 average ZFS 3.09 HW 2.40 There were some categories where ZFS was faster. Some where HW was faster. On average, ZFS was faster, but they were all in the same ballpark, and the results were highly dependent on specific details and tunables. AKA, not a place you should explore, unless you have a highly specialized use case that you wish to optimize. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [RFC] Backup solution
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Ian Collins I would seriously consider raidz3, given I typically see 80-100 hour resilver times for 500G drives in raidz2 vdevs. If you haven't already, If you're going raidz3, with 7 disks, then you might as well just make mirrors instead, and eliminate the slow resilver. Mirrors resilver enormously faster than raidzN. At least for now, until maybe one day the raidz resilver code might be rewritten. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Increase size of 2-way mirror
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Tony MacDoodle Is it possible to add 2 disks to increase the size of the pool below? NAME STATE READ WRITE CKSUM testpool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 c1t5d0 ONLINE 0 0 0 It's important that you know the difference between add and attach methods for increasing this size... If you add another mirror, then you'll have mirror-0, mirror-1, and mirror-2. You cannot remove any of the existing devices. If you attach a larger disk to mirror-0, and possibly fiddle with the autoexpand property and a little bit of additional futzing (pretty basic, including resilver detach the old devices) then you can effectively replace the existing devices with larger devices. No need to consume extra disk bays. It's all a matter of which is the more desirable outcome for you. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Finding corrupted files
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Stephan Budach Ian, yes, although these vdevs are FC raids themselves, so the risk is… uhm… calculated. Whenever possible, you should always JBOD the storage and let ZFS manage the raid, for several reasons. (See below). Also, as counter-intuitive as this sounds (see below) you should disable hardware write-back cache (even with BBU) because it hurts performance in any of these situations: (a) Disable WB if you have access to SSD or other nonvolatile dedicated log device. (b) Disable WB if you know all of your writes to be async mode and not sync mode. (c) Disable WB if you've opted to disable ZIL. * Hardware raid blindly assumes the redundant data written to disk is written correctly. So later, if you experience a checksum error (such as you have) then it's impossible for ZFS to correct it. The hardware raid doesn't know a checksum error has occurred, and there is no way for the OS to read the other side of the mirror to attempt correcting the checksum via redundant data. * ZFS has knowledge of both the filesystem, and the block level devices, while hardware raid has only knowledge of block level devices. Which means ZFS is able to optimize performance in ways that hardware cannot possibly do. For example, whenever there are many small writes taking place concurrently, ZFS is able to remap the physical disk blocks of those writes, to aggregate them into a single sequential write. Depending on your metric, this yields 1-2 orders of magnitude higher IOPS. * Because ZFS automatically buffers writes in ram in order to aggregate as previously mentioned, the hardware WB cache is not beneficial. There is one exception. If you are doing sync writes to spindle disks, and you don't have a dedicated log device, then the WB cache will benefit you, approx half as much as you would benefit by adding dedicated log device. The sync write sort-of by-passes the ram buffer, and that's the reason why the WB is able to do some good in the case of sync writes. Ironically, if you have WB enabled, and you have a SSD log device, then the WB hurts you. You get the best performance with SSD log, and no WB. Because the WB lies to the OS, saying some tiny chunk of data has been written... then the OS will happily write another tiny chunk, and another, and another. The WB is only buffering a lot of tiny random writes, and in aggregate, it will only go as fast as the random writes. It undermines ZFS's ability to aggregate small writes into sequential writes. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Finding corrupted files
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Stephan Budach Now, scrub would reveal corrupted blocks on the devices, but is there a way to identify damaged files as well? I saw a lot of people offering the same knee-jerk reaction that I had: Scrub. And that is the only correct answer, to make a best effort at salvaging data. But I think there is a valid question here which was neglected. *Does* scrub produce a list of all the names of all the corrupted files? And if so, how does it do that? If scrub is operating at a block-level (and I think it is), then how can checksum failures be mapped to file names? For example, this is a long-requested feature of zfs send which is fundamentally difficult or impossible to implement. Zfs send operates at a block level. And there is a desire to produce a list of all the incrementally changed files in a zfs incremental send. But no capability of doing that. It seems, if scrub is able to list the names of files that correspond to corrupted blocks, then zfs send should be able to list the names of files that correspond to changed blocks, right? I am reaching the opposite conclusion of what's already been said. I think you should scrub, but don't expect file names as a result. I think if you want file names, then tar /dev/null will be your best friend. I didn't answer anything at first, cuz I was hoping somebody would have that answer. I only know that I don't know, and the above is my best guess. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] When is it okay to turn off the verify option.
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Peter Taps As I understand, the hash generated by sha256 is almost guaranteed not to collide. I am thinking it is okay to turn off verify property on the zpool. However, if there is indeed a collision, we lose data. Scrub cannot recover such lost data. I am wondering in real life when is it okay to turn off verify option? I guess for storing business critical data (HR, finance, etc.), you cannot afford to turn this option off. Right on all points. It's a calculated risk. If you have a hash collision, you will lose data undetected, and backups won't save you unless *you* are the backup. That is, if the good data, before it got corrupted by your system, happens to be saved somewhere else before it reached your system. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] When is it okay to turn off the verify option.
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Scott Meilicke Why do you want to turn verify off? If performance is the reason, is it significant, on and off? Under most circumstances, verify won't hurt performance. It won't hurt reads of any kind, and it won't hurt writes when you're writing unique data, or if you're writing duplicate data which is warm in the read cache. It will basically hurt write performance if you are writing duplicate data, which was not read recently. This might be the case, for example, if this machine is the target for some remote machine to backup onto. The problem doesn't exist if you're copying local data, because you first read the data (now it's warm in cache) before writing it. So the verify operation is essentially zero time in that case. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] drive speeds etc
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Roy Sigurd Karlsbakk extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd1 0.5 140.30.3 2426.3 0.0 1.07.2 0 14 sd2 0.0 138.30.0 2476.3 0.0 1.5 10.6 0 18 sd3 0.0 303.90.0 2633.8 0.0 0.41.3 0 7 sd4 0.5 306.90.3 2555.8 0.0 0.41.2 0 7 sd5 1.0 308.50.5 2579.7 0.0 0.31.0 0 7 sd6 1.0 304.90.5 2352.1 0.0 0.31.1 1 7 sd7 1.0 298.90.5 2764.5 0.0 0.62.0 0 13 sd8 1.0 304.90.5 2400.8 0.0 0.30.9 0 6 Unless I'm misunderstanding this output... It looks like all disks are doing approx the same data throughput. It looks like sd1 sd2 are doing half the IOPS. So sd1 sd2 must be doing larger chunks. How are these drives configured? One vdev of raidz2? No cache/log devices, etc... It would be easy to explain, if you're striping mirrors. Difficult (at least for me) to explain if you're using raidzN. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [osol-discuss] zfs send/receive?
From: Richard Elling [mailto:richard.ell...@gmail.com] It is relatively easy to find the latest, common snapshot on two file systems. Once you know the latest, common snapshot, you can send the incrementals up to the latest. I've always relied on the snapshot names matching. Is there a way to find the latest common snapshot if the names don't match? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Long resilver time
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Jason J. W. Williams I just witnessed a resilver that took 4h for 27gb of data. Setup is 3x raid-z2 stripes with 6 disks per raid-z2. Disks are 500gb in size. No checksum errors. 27G on a 6-disk raidz2 means approx 6.75G per disk. Ideally, the disk could write 7G = 56 Gbit in a couple minutes if it were all sequential and no other activity in the system. So you're right to suspect something is suboptimal, but the root cause is inefficient resilvering code in zfs specifically for raidzN. The resilver code spends a *lot* of time seeking, because it's not optimized by disk layout. This may change some day, but not in the near future. Mirrors don't suffer the same effect. At least, if they do, it's far less dramatic. For now, all you can do is: (a) factor this into your decision to use mirror versus raidz, and (b) ensure no snapshots, and minimal IO during the resilver, and (c) if you opt for raidz, keep the number of disks in a raidz to a minimum. It is preferable to use 3 vdev's each of 7-disk raidz, instead of using a 21-disk raidz3. Your setup of 3x raidz2 is pretty reasonable, and 4h resilver, although slow, is successful. Which is more than you could say if you had a 21-disk raidz3. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup relationship between pool and filesystem
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Brad Stone For de-duplication to perform well you need to be able to fit the de- dup table in memory. Is a good rule-of-thumb for needed RAM Size=(pool capacity/avg block size)*270 bytes? Or perhaps it's Size/expected_dedup_ratio? For now, the rule of thumb is 3G ram for every 1TB of unique data, including snapshots and vdev's. After a system is running, I don't know how/if you can measure current mem usage, to gauge the results of your own predictions. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [osol-discuss] zfs send/receive?
From: opensolaris-discuss-boun...@opensolaris.org [mailto:opensolaris- discuss-boun...@opensolaris.org] On Behalf Of Roy Sigurd Karlsbakk I'm using a custom snaopshot scheme which snapshots every hour, day, week and month, rotating 24h, 7d, 4w and so on. What would be the best way to zfs send/receive these things? I'm a little confused about how this works for delta udpates... Out of curiosity, why custom? It sounds like a default config. Anyway, as long as the present destination filesystem matches a snapshot from the source system, you can incrementally send any newer snapshot. Generally speaking, you don't want to send anything that's extremely volatile such as hourly... because if the snap of the source disappears, then you have nothing to send incrementally from anymore. Make sense? I personally send incrementals once a day, and only send the daily incrementals. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup relationship between pool and filesystem
From: Roy Sigurd Karlsbakk [mailto:r...@karlsbakk.net] For now, the rule of thumb is 3G ram for every 1TB of unique data, including snapshots and vdev's. 3 gigs? Last I checked it was a little more than 1GB, perhaps 2 if you have small files. http://opensolaris.org/jive/thread.jspa?threadID=131761 The true answer is it varies depending on things like block size, etc, so if you want to say 1G or 3G, despite sounding like a big difference, it's in the noise. We're only talking rule of thumb here, based on vague (vague) and widely variable estimates of your personal usage characteristics. It's just a rule of thumb, and slightly over 1G ~= slightly under 3G in this context. Hence, the comment: After a system is running, I don't know how/if you can measure current mem usage, to gauge the results of your own predictions. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Any zfs fault injection tools?
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Freddie Cash The following works well: dd if=/dev/random of=/dev/disk-node bs=1M count=1 seek=whatever If you have long enough cables, you can move a disk outside the case and run a magnet over it to cause random errors. Plugging/unplugging the SATA/SAS cable from a disk while doing normal reads/writes is also fun. Using the controller software (if a RAID controller) to delete LUNs/disks is also fun. You don't have any friends that are computers anymore, do you. ;-) The words cruel and unusual come to mind. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup relationship between pool and filesystem
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Peter Taps The dedup property is set on a filesystem, not on the pool. However, the dedup ratio is reported on the pool and not on the filesystem. As with most other ZFS concepts, the core functionality of ZFS is implemented in zpool. Hence, zpool is up to what ... version 25 or so now? Think of ZFS (the posix filesystem) as just an interface which tightly integrates the zpool features. ZFS is only up to what, version 4 now? Perfect example: If you create a zvol in linux, without formatting it zfs, and format it ext3/4, then you can snapshot it, and I believe you can even zfs send and receive. And so on. The core functionality is mostly present. But if you want to access the snapshot, you have to create some mountpoint, and mount read-only the snapshot zvol to the mountpoint. It's not automatic. It's barely any better than the crappy snapshot concept linux has in LVM. If you want good automatic snapshot creation seamless mounting automatic mounting, then you need the ZFS filesystem on top of the zpool. Cuz the ZFS filesystem knows about that underlying zpool feature, and makes it convenient and easy good experience. ;-) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS checksum errors (ZFS-8000-8A)
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Bob Friesenhahn It is very unusual to obtain the same number of errors (probably same errors) from two devices in a pair. This should indicate a common symptom such as a memory error (does your system have ECC?), controller glitch, or a shared power supply issue. Bob's right. I didn't notice that both sides of the mirror have precisely 56 checksum errors. Ignore what I said about adding a 3rd disk to the mirror. It won't help. The 3rd mirror would have only been useful if the block corruption on these 2 disks weren't the same blocks. I think you have to acknowledge the fact that you have corrupt data. And you should run some memory diagnostics on your system to see if you can detect some failing memory. The cause is not necessarily memory, as Bob pointed out, but a typical way to produce the result you're seeing is ... ZFS calculates a checksum of a block it's about to write to disk, and of course that checksum is stored in ram. Unfortunately, if it's stored in corrupt ram, then ... when it's written to disk, of course the checksum will mismatch. And the faulty checksum gets written to both sides of the mirror. It is discovered later during your scrub. There is no un-corrupt copy of the data that ZFS thought it wrote. At least it's detected by ZFS. Without checksumming, that error would pass undetected. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver = defrag?
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of David Dyer-Bennet For example, if you start with an empty drive, and you write a large amount of data to it, you will have no fragmentation. (At least, no significant fragmentation; you may get a little bit based on random factors.) As life goes on, as long as you keep plenty of empty space on the drive, there's never any reason for anything to become significantly fragmented. Sure, if only a single thread is ever writing to the disk store at a time. This has already been discussed in this thread. The threading model doesn't affect the outcome of files being fragmented or unfragmented on disk. The OS is smart enough to know these blocks writen by process A are all sequential, and those blocks all written by process B are also sequential, but separate. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver = defrag?
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Marty Scholes What appears to be missing from this discussion is any shred of scientific evidence that fragmentation is good or bad and by how much. We also lack any detail on how much fragmentation does take place. Agreed. I've been rather lazily asserting a few things here and there that I expected to be challenged, so I've been thinking up tests to verify/dispute my claims, but then nobody challenged. Specifically, the blocks on disk are not interleaved just because multiple threads were writing at the same time. So there's at least one thing which is testable, if anyone cares. But there's also no way that I know of, to measure fragmentation in a real system that's been in production for a year. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best practice for Sol10U9 ZIL -- mirrored or not?
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Bryan Horstmann-Allen The ability to remove the slogs isn't really the win here, it's import -F. The Disagree. Although I agree the -F is important and good, I think the log device removal is the main win. Prior to log device removal, if you lose your slog, then you lose your whole pool, and probably your system halts (or does something equally bad, which isn't strictly halting). Therefore you want your slog to be as redundant as the rest of your pool. With log device removal, if you lose a slog while the system is up, worst case is performance degradation. With log device removal, there's only one thing you have to worry about: Your slog goes bad, and undetected. So the system keeps writing to it, unaware that it will never be able to read, and therefore when you get a system crash, and for the first time your system tries to read that device, you lose information. Not your whole pool. You lose up to 30 sec of writes that the system thought it wrote, but never did. You require the -F to import. Historically, people always recommend mirroring your log device, even with log device removal, to protect against the above situation. But in a recent conversation including Neil, it seems there might be a bug which causes the log device mirror to be ignored during import, thus rendering the mirror useless in the above situation. Neil, or anyone, is there any confirmation or development on that bug? Given all of this, I would say it's recommended to forget about mirroring log devices for now. In the past, the recommendation was Yes mirror. Right now, it's No don't mirror, and after the bug is fixed, the recommendation will again become Yes, mirror. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver that never finishes
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Tom Bird We recently had a long discussion in this list, about resilver times versus raid types. In the end, the conclusion was: resilver code is very inefficient for raidzN. Someday it may be better optimized, but until that day comes, you really need to break your giant raidzN into smaller vdev's. 3 vdev's of 7 disk raidz is preferable over a 21 disk raidz3. If you want this resilver to complete, you should do anything you can to (a) stop taking snapshots (b) don't scrub (c) stop all IO possible. And be patient. Most people in your situation find it faster to zfs send to some other storage, and then destroy recreate the pool. I know it stinks. But that's what you're facing. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS file system without pool
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Ramesh Babu I would like to know if I can create ZFS file system without ZFS storage pool. Also I would like to know if I can create ZFS pool/ZFS pool on Veritas Volume. Unless I'm mistaken, you seem to be confused, thinking zpools can only be created from physical devices. You can make zpools from files, sparse files, physical devices, remote devices, in memory ... basically any type of storage you can communicate with. You create the zpool, and the zfs filesystem optionally comes along with it. All the magic is done in the pool - snapshots, dedup, etc. The only reason you would want a zfs filesystem is because it's specifically designed to leverage the magic of a zpool natively. If it were possible to create a zfs filesystem without a zpool, you might as well just use ufs. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver = defrag?
From: Richard Elling [mailto:rich...@nexenta.com] Suppose you want to ensure at least 99% efficiency of the drive. At most 1% time wasted by seeking. This is practically impossible on a HDD. If you need this, use SSD. Lately, Richard, you're saying some of the craziest illogical things I've ever heard, about fragmentation and/or raid. It is absolutely not difficult to avoid fragmentation on a spindle drive, at the level I described. Just keep plenty of empty space in your drive, and you won't have a fragmentation problem. (Except as required by COW.) How on earth do you conclude this is practically impossible? For example, if you start with an empty drive, and you write a large amount of data to it, you will have no fragmentation. (At least, no significant fragmentation; you may get a little bit based on random factors.) As life goes on, as long as you keep plenty of empty space on the drive, there's never any reason for anything to become significantly fragmented. Again, except for COW. It is known that COW will cause fragmentation if you write randomly in the middle of a file that is protected by snapshots. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver = defrag?
From: Richard Elling [mailto:rich...@nexenta.com] It is practically impossible to keep a drive from seeking. It is also The first time somebody (Richard) said you can't prevent a drive from seeking, I just decided to ignore it. But then it was said twice. (Ian.) I don't get why anybody is saying drives seek. Did anybody say drives don't seek? I said you can quantify how much fragmentation is acceptable, given drive speed characteristics, and a percentage of time you consider acceptable for seeking. I suggested acceptable was 99% efficiency and 1% time waste seeking. Roughly calculated, I came up with 40 MB sequential data per random seek would yield 99% efficiency. For some situations, that's entirely possible and likely to be the norm. For other cases, it may be unrealistic, and you may suffer badly from fragmentation. Is there some point we're talking about here? I don't get why the conversation seems to have taken such a tangent. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver = defrag?
From: Haudy Kazemi [mailto:kaze0...@umn.edu] With regard to multiuser systems and how that negates the need to defragment, I think that is only partially true. As long as the files are defragmented enough so that each particular read request only requires one seek before it is time to service the next read request, further defragmentation may offer only marginal benefit. On the other Here's a great way to quantify how much fragmentation is acceptable: Suppose you want to ensure at least 99% efficiency of the drive. At most 1% time wasted by seeking. Suppose you're talking about 7200rpm sata drives, which sustain 500Mbit/s transfer, and have average seek time 8ms. 8ms is 1% of 800ms. In 800ms, the drive could read 400 Mbit of sequential data. That's 40 MB So as long as the fragment size of your files are approx 40 MB or larger, then fragmentation has a negligible effect on performance. One seek per every 40MB read/written will yield less than 1% performance impact. For the heck of it, let's see how that would have computed with 15krpm SAS drives. Sustained transfer 1Gbit/s, and average seek 3.5ms 3.5ms is 1% of 350ms In 350ms, the drive could read 350 Mbit (call it 43MB) That's certainly in the ballpark of 40 MB. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver = defrag?
From: Richard Elling [mailto:rich...@nexenta.com] With appropriate write caching and grouping or re-ordering of writes algorithms, it should be possible to minimize the amount of file interleaving and fragmentation on write that takes place. To some degree, ZFS already does this. The dynamic block sizing tries to ensure that a file is written into the largest block[1] Yes, but the block sizes in question are typically up to 128K. As computed in my email 1 minute ago ... The fragment size needs to be on the order of 40 MB in order to effectively eliminate performance loss of fragmentation. Also, ZFS has an intelligent prefetch algorithm that can hide some performance aspects of defragmentation on HDDs. Unfortunately, prefetch can only hide fragmentation on systems that have idle disk time. Prefetch isn't going to help you if you actually need to transfer a whole file as fast as possible. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedicated ZIL/L2ARC
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Wolfraider We are looking into the possibility of adding a dedicated ZIL and/or L2ARC devices to our pool. We are looking into getting 4 – 32GB Intel X25-E SSD drives. Would this be a good solution to slow write speeds? If you have slow write speeds, a dedicated log device might help. (log devices are for writes, not for reads.) It sounds like your machine is an iscsi target. In which case, you're certainly doing a lot of sync writes, and therefore hitting your ZIL hard. So it's all but certain adding dedicated log devices will help. One thing to be aware of: Once you add dedicated log, *all* of your sync writes will hit that log device. While a single SSD or pair of SSD's will have fast IOPS, they can easily become a new bottleneck with worse performance than what you had before ... If you've got 80 spindle disks now, and by any chance, you perform sequential sync writes, then a single pair of SSD's won't compete. I'd suggest adding several SSD's for log devices, and no mirroring. Perhaps one SSD for every raidz2 vdev, or every other, or every third, depending on what you can afford. If you have slow reads, l2arc cache might help. (cache devices are for read, not write.) We are currently sharing out different slices of the pool to windows servers using comstar and fibrechannel. We are currently getting around 300MB/sec performance with 70-100% disk busy. You may be facing some other problem, aside from just having cache/log devices. I suggest giving us some more detail here. Such as ... Large sequential operations are good on raidz2. But if you're performing random IO, that performs pretty poor on raidz2. What sort of network are you using? I know you said comstar and fibrechannel, and sharing slices to windows ... I assume this means you're doing iscsi, right? Dual 4Gbit links per server? You're getting 2.4 Gbit and you expect what? You have a pool made up of 18 raidz2 vdev's with 5 drives each (capacity of 3 disks each) ... Is each vdev on its own bus? What type of bus is it? (Generally speaking, it is preferable to spread vdev's across buses, instead of making 1vdev on 1 bus, for reliability purposes) ... How many disks, of what type, on each bus? What type of bus, at what speed? What are the usage characteristics, how are you making your measurement? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver = defrag?
From: Richard Elling [mailto:rich...@nexenta.com] This operational definition of fragmentation comes from the single- user, single-tasking world (PeeCees). In that world, only one thread writes files from one application at one time. In those cases, there is a reasonable expectation that a single file's blocks might be contiguous on a single disk. That isn't the world we live in, where have RAID, multi-user, or multi- threaded environments. I don't know what you're saying, but I'm quite sure I disagree with it. Regardless of multithreading, multiprocessing, it's absolutely possible to have contiguous files, and/or file fragmentation. That's not a characteristic which depends on the threading model. Also regardless of raid, it's possible to have contiguous or fragmented files. The same concept applies to multiple disks. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver = defrag?
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Orvar Korvar I was thinking to delete all zfs snapshots before zfs send receive to another new zpool. Then everything would be defragmented, I thought. You don't need to delete snaps before zfs send, if your goal is to defragment your filesystem. Just perform a single zfs send, and don't do any incrementals afterward. The receiving filesystem will layout the filesystem as it wishes. (I assume snapshots works this way: I snapshot once and do some changes, say delete file A and edit file B. When I delete the snapshot, the file A is still deleted and file B is still edited. In other words, deletion of snapshot does not revert back the changes. You are correct. A snapshot is a read-only image of the filesystem, as it was, at some time in the past. If you destroy the snapshot, you've only destroyed the snapshot. You haven't destroyed the most recent live version of the filesystem. If you wanted to, you could rollback, which destroys the live version of the filesystem, and restores you back to some snapshot. But that is a very different operation. Rollback is not at all similar to destroying a snapshot. These two operations are basically opposites of each other. All of this is discussed in the man pages. I suggest man zpool and man zfs Everything you need to know is written there. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver = defrag?
From: Richard Elling [mailto:rich...@nexenta.com] Regardless of multithreading, multiprocessing, it's absolutely possible to have contiguous files, and/or file fragmentation. That's not a characteristic which depends on the threading model. Possible, yes. Probable, no. Consider that a file system is allocating space for multiple, concurrent file writers. Process A is writing. Suppose it starts writing at block 10,000 out of my 1,000,000 block device. Process B is also writing. Suppose it starts writing at block 50,000. These two processes write simultaneously, and no fragmentation occurs, unless Process A writes more than 40,000 blocks. In that case, A's file gets fragmented, and the 2nd fragment might begin at block 300,000. The concept which causes fragmentation (not counting COW) in the size of the span of unallocated blocks. Most filesystems will allocate blocks from the largest unallocated contiguous area of the physical device, so as to minimize fragmentation. I can't say how ZFS behaves authoritatively, but I'd be extremely surprised if two processes writing different files as fast as possible result in all their blocks interleaved with each other on physical disk. I think this is possible if you have multiple processes lazily writing at less-than full speed, because then ZFS might remap a bunch of small writes into a single contiguous write. Also regardless of raid, it's possible to have contiguous or fragmented files. The same concept applies to multiple disks. RAID works against the efforts to gain performance by contiguous access because the access becomes non-contiguous. These might as well have been words randomly selected from the dictionary to me - I recognize that it's a complete sentence, but you might have said processors aren't needed in computers anymore, or something equally illogical. Suppose you have a 3-disk raid stripe set, using traditional simple striping, because it's very easy to explain. Suppose a process is writing as fast as it can, and suppose it's going to write block 0 through block 99 of a virtual device. virtual block 0 = block 0 of disk 0 virtual block 1 = block 0 of disk 1 virtual block 2 = block 0 of disk 2 virtual block 3 = block 1 of disk 0 virtual block 4 = block 1 of disk 1 virtual block 5 = block 1 of disk 2 virtual block 6 = block 2 of disk 0 virtual block 7 = block 2 of disk 1 virtual block 8 = block 2 of disk 2 virtual block 9 = block 3 of disk 0 ... virtual block 96 = block 32 of disk 0 virtual block 97 = block 32 of disk 1 virtual block 98 = block 32 of disk 2 virtual block 99 = block 33 of disk 0 Thanks to buffering and command queueing, the OS tells the RAID controller to write blocks 0-8, and the raid controller tells disk 0 to write blocks 0-2, tells disk 1 to write blocks 0-2, and tells disk 2 to write 0-2, simultaneously. So the total throughput is the sum of all 3 disks writing continuously and contiguously to sequential blocks. This accelerates performance for continuous sequential writes. It does not work against efforts to gain performance by contiguous access. The same concept is true for raid-5 or raidz, but it's more complicated. The filesystem or raid controller does in fact know how to write sequential filesystem blocks to sequential physical blocks on the physical devices for the sake of performance enhancement on contiguous read/write. If you don't believe me, there's a very easy test to prove it: Create a zpool with 1 disk in it. time writing 100G (or some amount of data larger than RAM.) Create a zpool with several disks in a raidz set, and time writing 100G. The speed scales up linearly with the number of disks, until you reach some other hardware bottleneck, such as bus speed or something like that. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs compression with Oracle - anyone implemented?
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Brad Hi! I'd been scouring the forums and web for admins/users who deployed zfs with compression enabled on Oracle backed by storage array luns. Any problems with cpu/memory overhead? I don't think your question is clear. What do you mean on oracle backed by storage luns? Do you mean on oracle hardware? Do you mean you plan to run oracle database on the server, with ZFS under the database? Generally speaking, you can enable compression on any zfs filesystem, and the cpu overhead is not very big, and the compression level is not very strong by default. However, if the data you have is generally uncompressible, any overhead is a waste. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver = defrag?
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Orvar Korvar I am not really worried about fragmentation. I was just wondering if I attach new drives and zfs send recieve to a new zpool, would count as defrag. But apparently, not. Apparently not in all situations would be more appropriate. The understanding I had was: If you send a single zfs send | receive, then it does effectively get defragmented, because the receiving filesystem is going to re-layout the received filesystem, and there is nothing pre-existing to make the receiving filesystem dance around... But if you're sending some initial, plus incrementals, then you're actually repeating the same operations that probably caused the original filesystem to become fragmented in the first place. And in fact, it seems unavoidable... Suppose you have a large file, which is all sequential on disk. You make a snapshot of it. Which means all the individual blocks must not be overwritten. And then you overwrite a few bytes scattered randomly in the middle of the file. The nature of copy on write is such that of course, the latest version of the filesystem is impossible to remain contiguous. Your only choices are: To read write copies of the whole file, including multiple copies of what didn't change, or you leave the existing data in place where it is on disk, and you instead write your new random bytes to other non-contiguous locations on disk. Hence fragmentation. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Suggested RaidZ configuration...
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Freddie Cash No, it (21-disk raidz3 vdev) most certainly will not resilver in the same amount of time. In fact, I highly doubt it would resilver at all. My first foray into ZFS resulted in a 24-disk raidz2 vdev using 500 GB Seagate ES.2 and WD RE3 drives connected to 3Ware 9550SXU and 9650SE multilane controllers. Nice 10 TB storage pool. Worked beatifully as we filled it with data. Had less than 50% usage when a disk died. No problem, it's ZFS, it's meant to be easy to replace a drive, just offline, swap, replace, wait for it to resilver. Well, 3 days later, it was still under 10%, and every disk light was still solid grrn. SNMP showed over 100 MB/s of disk I/O continuously, I don't believe your situation is typical. I think you either encountered a bug, or you had something happening that you weren't aware of (scrub, autosnapshots, etc) ... because the only time I've ever seen anything remotely similar to the behavior you described was the bug I've mentioned in other emails, which occurs when disk is 100% full and a scrub is taking place. I know it's not the same bug for you, because you said your pool was only 50% full. But I don't believe that what you saw was normal or typical. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Suggested RaidZ configuration...
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Erik Trimble the thing that folks tend to forget is that RaidZ is IOPS limited. For the most part, if I want to reconstruct a single slab (stripe) of data, I have to issue a read to EACH disk in the vdev, and wait for that disk to return the value, before I can write the computed parity value out to the disk under reconstruction. If I'm trying to interpret your whole message, Erik, and condense it, I think I get the following. Please tell me if and where I'm wrong. In any given zpool, some number of slabs are used in the whole pool. In raidzN, a portion of each slab is written on each disk. Therefore, during resilver, if there are a total of 1million slabs used in the zpool, it means each good disk will need to read 1million partial slabs, and the replaced disk will need to write 1 million partial slabs. Each good disk receives a read request in parallel, and all of them must complete before a write is given to the new disk. Each read/write cycle is completed before the next cycle begins. (It seems this could be accelerated by allowing all the good disks to continue reading in parallel instead of waiting, right?) The conclusion I would reach is: Given no bus bottleneck: It is true that resilvering a raidz will be slower with many disks in the vdev, because the average latency for the worst of N disks will increase as N increases. But that effect is only marginal, and bounded between the average latency of a single disk, and the worst case latency of a single disk. The characteristic that *really* makes a big difference is the number of slabs in the pool. i.e. if your filesystem is composed of mostly small files or fragments, versus mostly large unfragmented files. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Suggested RaidZ configuration...
From: Hatish Narotam [mailto:hat...@gmail.com] PCI-E 8X 4-port ESata Raid Controller. 4 x ESata to 5Sata Port multipliers (each connected to a ESata port on the controller). 20 x Samsung 1TB HDD's. (each connected to a Port Multiplier). Assuming your disks can all sustain 500Mbit/sec, which I find to be typical for 7200rpm sata disks, and you have groups of 5 that all have a 3Gbit upstream bottleneck, it means each of your groups of 5 should be fine in a raidz1 configuration. You think that your sata card can do 32Gbit because it's on a PCIe x8 bus. I highly doubt it unless you paid a grand or two for your sata controller, but please prove me wrong. ;-) I think the backplane of the sata controller is more likely either 3G or 6G. If it's 3G, then you should use 4 groups of raidz1. If it's 6G, then you can use 2 groups of raidz2 (because 10 drives of 500Mbit can only sustain 5Gbit) If it's 12G or higher, then you can make all of your drives one big vdev of raidz3. According to Samsungs site, max read speed is 250MBps, which translates to 2Gbps. Multiply by 5 drives gives you 10Gbps. I guarantee you this is not a sustainable speed for 7.2krpm sata disks. You can get a decent measure of sustainable speed by doing something like: (write 1G byte) time dd if=/dev/zero of=/some/file bs=1024k count=1024 (beware: you might get an inaccurate speed measurement here due to ram buffering. See below.) (reboot to ensure nothing is in cache) (read 1G byte) time dd if=/some/file of=/dev/null bs=1024k (Now you're certain you have a good measurement. If it matches the measurement you had before, that means your original measurement was also accurate. ;-) ) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Suggested RaidZ configuration...
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Edward Ned Harvey The characteristic that *really* makes a big difference is the number of slabs in the pool. i.e. if your filesystem is composed of mostly small files or fragments, versus mostly large unfragmented files. Oh, if at least some of my reasoning was correct, there is one valuable take-away point for hatish: Given some number X total slabs used in the whole pool. If you use a single vdev for the whole pool, you will have X partial slabs written on each disk. If you have 2 vdev's, you'll have approx X/2 partial slabs written on each disk. 3 vdevs ~ X/3 partial slabs on each disk. Therefore, the resilver time approximately divides by the number of separate vdev's you are using in your pool. So the largest factor affecting resilver time of a single large vdev versus many smaller vdev's is NOT the quantity of data written on each disk, but just the fact that fewer slabs are used on each disk when using smaller vdev's. If you want to choose between (a) 21disk raidz3 versus (b) 3 vdevs of each 7disk raidz1, then: The raidz3 provides better redundancy, but has the disadvantage that every slab must be partially written on every disk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NetApp/Oracle-Sun lawsuit done
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Bob Friesenhahn There should be little doubt that NetApp's goal was to make money by suing Sun. Nexenta does not have enough income/assets to make a risky lawsuit worthwhile. But in all likelihood, Apple still won't touch ZFS. Apple would be worth suing. A big fat juicy... On interesting take-away point, however: Oracle is now in a solid position to negotiate with Apple. If Apple wants to pay for ZFS and indemnification against netapp lawsuit, Oracle can grant it. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver = defrag?
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Orvar Korvar A) Resilver = Defrag. True/false? I think everyone will agree false on this question. However, more detail may be appropriate. See below. B) If I buy larger drives and resilver, does defrag happen? Scores so far: 2 No 1 Yes C) Does zfs send zfs receive mean it will defrag? Scores so far: 1 No 2 Yes ... Does anybody here know what they're talking about? I'd feel good if perhaps Erik ... or Neil ... perhaps ... answered the question with actual knowledge. Thanks... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Suggested RaidZ configuration...
From: Haudy Kazemi [mailto:kaze0...@umn.edu] There is another optimization in the Best Practices Guide that says the number of devices in a vdev should be (N+P) with P = 1 (raidz), 2 (raidz2), or 3 (raidz3) and N equals 2, 4, or 8. I.e. 2^N + P where N is 1, 2, or 3 and P is the RAIDZ level. I.e. Optimal sizes RAIDZ1 vdevs should have 3, 5, or 9 devices in each vdev RAIDZ2 vdevs should have 4, 6, or 10 devices in each vdev RAIDZ3 vdevs should have 5, 7, or 11 devices in each vdev This sounds logical, although I don't know how real it is. The logic seems to be ... Assuming slab sizes of 128K, the amount of data written to each disk within the vdev gets divided into something which is a multiple of 512b or 4K (newer drives supposedly starting to use 4K block sizes instead of 512b). But I have doubts about the real-ness here, because ... An awful lot of times, your actual slabs are smaller than 128K just because you're not performing sustained sequential writes very often. But it seems to make sense, whenever you *do* have some sequential writes, you would want the data written to each disk to be a multiple of 512b or 4K. If you had a 128K slab, divided into 5, then each disk would write 25.6K and even for sustained sequential writes, some degree of fragmentation would be impossible to avoid. Actually, I don't think fragmentation is techinically the correct term for that behavior. It might be more appropriate to simply say it forces a less-than-100% duty cycle. And another thing ... Doesn't the checksum take up some space anyway? Even if you obeyed the BPG and used ... let's say ... 4 disks for N ... then each disk has 32K of data to write, which is a multiple of 4K and 512b ... but each disk also needs to write the checksum. So each disk writes 32K + a few bytes. Which defeats the whole purpose anyway, doesn't it? The effect, if real at all, might be negligible. I don't know how small it is, but I'm quite certain it's not huge. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Suggested RaidZ configuration...
From: pantz...@gmail.com [mailto:pantz...@gmail.com] On Behalf Of Mattias Pantzare It is about 1 vdev with 12 disk or 2 vdev with 6 disks. If you have 2 vdev you have to read half the data compared to 1 vdev to resilver a disk. Let's suppose you have 1T of data. You have 12-disk raidz2. So you have approx 100G on each disk, and you replace one disk. Then 11 disks will each read 100G, and the new disk will write 100G. Let's suppose you have 1T of data. You have 2 vdev's that are each 6-disk raidz1. Then we'll estimate 500G is on each vdev, so each disk has approx 100G. You replace a disk. Then 5 disks will each read 100G, and 1 disk will write 100G. Both of the above situations resilver in equal time, unless there is a bus bottleneck. 21 disks in a single raidz3 will resilver just as fast as 7 disks in a raidz1, as long as you are avoiding the bus bottleneck. But 21 disks in a single raidz3 provides better redundancy than 3 vdev's each containing a 7 disk raidz1. In my personal experience, approx 5 disks can max out approx 1 bus. (It actually ranges from 2 to 7 disks, if you have an imbalance of cheap disks on a good bus, or good disks on a crap bus, but generally speaking people don't do that. Generally people get a good bus for good disks, and cheap disks for crap bus, so approx 5 disks max out approx 1 bus.) In my personal experience, servers are generally built with a separate bus for approx every 5-7 disk slots. So what it really comes down to is ... Instead of the Best Practices Guide saying Don't put more than ___ disks into a single vdev, the BPG should say Avoid the bus bandwidth bottleneck by constructing your vdev's using physical disks which are distributed across multiple buses, as necessary per the speed of your disks and buses. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris 10u9
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of David Magda The 9/10 Update appears to have been released. Some of the more noticeable ZFS stuff that made it in: More at: http://docs.sun.com/app/docs/doc/821-1840/gijtg Awesome! Thank you. :-) Log device removal in particular, I feel is very important. (Got bit by that one.) Now when is dedup going to be ready? ;-) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Suggested RaidZ configuration...
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of hatish I have just read the Best Practices guide, and it says your group shouldnt have 9 disks. I think the value you can take from this is: Why does the BPG say that? What is the reasoning behind it? Anything that is a rule of thumb either has reasoning behind it (you should know the reasoning) or it doesn't (you should ignore the rule of thumb, dismiss it as myth.) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Suggested RaidZ configuration...
On Tue, Sep 7, 2010 at 4:59 PM, Edward Ned Harvey sh...@nedharvey.com wrote: I think the value you can take from this is: Why does the BPG say that? What is the reasoning behind it? Anything that is a rule of thumb either has reasoning behind it (you should know the reasoning) or it doesn't (you should ignore the rule of thumb, dismiss it as myth.) Let's examine the myth that you should limit the number of drives in a vdev because of resilver time. The myth goes something like this: You shouldn't use more than ___ drives in a vdev raidz_ configuration, because all the drives need to read during a resilver, so the more drives are present, the longer the resilver time. The truth of the matter is: Only the size of used data is read. Because this is ZFS, it's smarter than a hardware solution which would have to read all disks in their entirety. In ZFS, if you have a 6-disk raidz1 with capacity of 5 disks, and a total of 50G of data, then each disk has roughly 10G of data in it. During resilver, 5 disks will each read 10G of data, and 10G of data will be written to the new disk. If you have a 11-disk raidz1 with capacity of 10 disks, then each disk has roughly 5G of data. 10 disks will each read 5G of data, and 5G of data will be written to the new disk. If anything, more disks means a faster resilver, because you're more easily able to saturate the bus, and you have a smaller amount of data that needs to be written to the replaced disk. Let's examine the myth that you should limit the number of disks for the sake of redundancy. It is true that a carefully crafted system can survive things like SCSI controller or tray failure. Suppose you have 3 scsi cards. Suppose you construct a raidz2 device using 2 disks from controller 0, 2 disks from controller 1, and 2 disks from controller 2. Then if a controller dies, you have only lost 2 disks, and you are degraded but still functional as long as you don't lose another disk. But you said you have 20 disks all connected to a single controller. So none of that matters in your case. Personally, I can't imagine any good reason to generalize don't use more than ___ devices in a vdev. To me, a 12-disk raidz2 is just as likely to fail as a 6-disk raidz1. But a 12-disk raidz2 is slightly more reliable than having two 6-disk raidz1's. Perhaps, maybe, a 64bit processor is able to calculate parity on an 8-disk raidz set in a single operation, but requires additional operations to calculate parity if your raidz has 9 or more disks in it ... But I am highly skeptical of this line of reasoning, and AFAIK, nobody has ever suggested this before me. I made it up just now. I'm grasping at straws and stretching my imagination to find *any* merit in the statement, don't use more than ___ disks in a vdev. I see no reasoning behind it, and unless somebody can say anything to support it, I think it's bunk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool question
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of bear [b]Short Version[/b] I used zpool add instead of zpool replace while trying to move drives from an si3124 controller card. I can backup the data to other drives and destroy the pool, but would prefer not to since it involved around 4 tb of data and will take forever. [b]zpool add mypool c4t2d0[/b] instead of [b]zpool replace mypool c2t1d0 c4t2d0[/b] Yeah ... Unfortunately, you cannot remove a vdev from a pool once it's been added. So ... Temporarily, in order to get c4t2d0 back into your control for other purposes, you could create a sparse file somewhere, and replace this device with the sparse file. This should be very fast, and should not hurt performance, as long as you haven't written any significant amount of data to the pool since adding that device, and won't be writing anything significant until after all is said and done. Don't create the sparse file inside the pool. Create the sparsefile somewhere in rpool, so you don't have a gridlock mount order problem. Rather than replacing each device one-by-one, I might suggest creating a new raidz2 on the new hardware, and then use zfs send | zfs receive to replicate the contents of the first raid set to the 2nd raid set... Then, just destroy (or export, or unmount) the first raid set, while changing the mountpoint of the 2nd raid set. (And export/import or unmount/mount.) since you have data that's mostly not changing, the send/receive method should be extremely efficient. You do one send/receive, and you don't even have to follow up with any incrementals later... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs set readonly=on does not entirely go into read-only mode
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Ian Collins However writes to already opened files are allowed. Think of this from the perspective of an application. How would write failure be reported? Both very good points. But I agree with Robert. write() has a known failure mode when disk is full. I agree bad things can happen to applications that attempt write() when disk is full ... however ... Only a user with root privs is able to set readonly property. I expect the root user is doing this for a reason. Willing, able, and aware to take responsibility for the consequences. The intuitive (generally expected) thing, when you're root and you make a filesystem readonly, is that it becomes readonly. If that is not the behavior ... Well, I can think of at least one really specific, important example problem. Suppose an application writes to a file infinitely. Fills up the filesystem. This is a known bad thing for ZFS, sometimes causing unrecoverable infinite IO and forcing power-cycle (I don't have a bug # but see here: http://opensolaris.org/jive/thread.jspa?threadID=132383tstart=0 ) ... If you find yourself in the infinite IO, would-be-forced to power cycle situation, the workaround is to reduce some reservation to free up space. Then you should be able to rm, destroy, and stop scrub. But if the application is still infinitely writing to the open file handle that it already owns ... then any space you can free up will just get consumed again immediately by the bad application. Another specific example ... Suppose you zfs send from a primary server to a backup server. You want the filesystems to be readonly on the backup fileserver, in order to receive incrementals. If you make a mistake, and start writing to the backup server filesystem, you want to be able to correct your mistake. Make it readonly, stop anything from writing to it, rollback to the unmodified snapshot, so you're able to receive incrementals again. If setting readonly doesn't stop open filehandles from writing ... What can you do? You either have to flex your brain muscle to figure out some technique to find which application is performing writes (not always easy to do) or you basically have to unmount remount the filesystem to force writes to stop, which might not be easy to do, because filehandles are in use. You might feel the need to simply reboot, instead of figuring out a way to do all this. You just complain to your colleagues and say yeah, the stupid thing made me reboot in order to make the filesystem readonly. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs set readonly=on does not entirely go into read-only mode
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Ian Collins so it should behave in the same way as an unmount in the presence of open files. +1 You can unmount lazy, or force, or by default, the unmount fails in the presence of open files. (I think.) So to keep everybody happy, let people do whatever they want. ;-) Setting readonly property should fail in the presence of open files, or you can force it, which would truly sweep the rug out from under the writing processes. And if the developer(s) are feeling ambitious, implement lazy too. ;-) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs set readonly=on does not entirely go into read-only mode
From: Ian Collins [mailto:i...@ianshome.com] On 08/28/10 12:45 PM, Edward Ned Harvey wrote: Another specific example ... Suppose you zfs send from a primary server to a backup server. You want the filesystems to be readonly on the backup fileserver, in order to receive incrementals. If you make a mistake, and start writing to the backup server filesystem, you want to be able to correct your mistake. Make it readonly, stop anything from writing to it, rollback to the unmodified snapshot, so you're able to receive incrementals again. I think you have lost a not in there somewhere! Didn't miss any not, but it may not have been written clearly. If you *intended* to set the destination filesystem readonly before, and you only discovered it's not readonly later, evident by the fact that something wrote to it and now you can't receive incremental zfs snapshots... Then you want to correct your mistake. Whatever was writing to the backup fileserver, it shouldn't have been. So set the filesystem readonly, rollback to the latest snapshot that corresponds to the primary server, so you can again start receiving incrementals. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS offline ZIL corruption not detected
From: Neil Perrin [mailto:neil.per...@oracle.com] Hmm, I need to check, but if we get a checksum mismatch then I don't think we try other mirror(s). This is automatic for the 'main pool', but of course the ZIL code is different by necessity. This problem can of course be fixed. (It will be a week and a bit before I can report back on this, as I'm on vacation). Thanks... If indeed that is the behavior, then I would conclude: * Call it a bug. It needs a bug fix. * Prior to log device removal (zpool 19) it is critical to mirror log device. * After introduction of ldr, before this bug fix is available, it is pointless to mirror log devices. * After this bug fix is introduced, it is again recommended to mirror slogs. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS offline ZIL corruption not detected
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of StorageConcepts So would say there are 2 bugs / missing features in this: 1) zil needs to report truncated transactions on zilcorruption 2) zil should need mirrored counterpart to recover bad block checksums Add to that: During scrubs, perform some reads on log devices (even if there's nothing to read). In fact, during scrubs, perform some reads on every device (even if it's actually empty.) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS offline ZIL corruption not detected
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Neil Perrin This is a consequence of the design for performance of the ZIL code. Intent log blocks are dynamically allocated and chained together. When reading the intent log we read each block and checksum it with the embedded checksum within the same block. If we can't read a block due to an IO error then that is reported, but if the checksum does not match then we assume it's the end of the intent log chain. Using this design means we the minimum number of writes to add write an intent log record is just one write. So corruption of an intent log is not going to generate any errors. I didn't know that. Very interesting. This raises another question ... It's commonly stated, that even with log device removal supported, the most common failure mode for an SSD is to blindly write without reporting any errors, and only detect that the device is failed upon read. So ... If an SSD is in this failure mode, you won't detect it? At bootup, the checksum will simply mismatch, and we'll chug along forward, having lost the data ... (nothing can prevent that) ... but we don't know that we've lost data? Worse yet ... In preparation for the above SSD failure mode, it's commonly recommended to still mirror your log device, even if you have log device removal. If you have a mirror, and the data on each half of the mirror doesn't match each other (one device failed, and the other device is good) ... Do you read the data from *both* sides of the mirror, in order to discover the corrupted log device, and correctly move forward without data loss? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Storage server hardwae
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Dr. Martin Mundschenk devices attached. Unfortunately the USB and sometimes the FW devices just die, causing the whole system to stall, forcing me to do a hard reboot. Well, I wonder what are the components to build a stable system without having an enterprise solution: eSATA, USB, FireWire, FibreChannel? There is no such thing as reliable external disks. Not unless you want to pay $1000 each, which is dumb. You have to scrap your mini, and use internal (or hotswappable) disks. Never expect a mini to be reliable. They're designed to be small and cute. Not reliable. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedup and handling corruptions - impossible?
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of devsk If dedup is ON and the pool develops a corruption in a file, I can never fix it because when I try to copy the correct file on top of the corrupt file, the block hash will match with the existing blocks and only reference count will be updated. The only way to fix it is to delete all snapshots (to remove all references) and then delete the file and then copy the valid file. This is a pretty high cost if it is so (empirical evidence so far, I don't know internal details). Um ... If dedup is on, and a file develops corruption, the original has developed corruption too. It was probably corrupt before it was copied. This is what zfs checksumming and mirrors/redundancy are for. If you have ZFS, and redundancy, this won't happen. (Unless you have failing ram/cpu/etc) If you have *any* filesystem without redundancy, and this happens, you should stop trying to re-copy the file, and instead throw away your disk and restore from backup. If you run without redundancy, and without backup, you got what you asked for. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedup and handling corruptions - impossible?
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of devsk What do you mean original? dedup creates only one copy of the file blocks. The file was not corrupt when it was copied 3 months ago. Please describe the problem. If you copied the file 3 months ago, and the new old copies are both referencing the same blocks on disk thanks to dedup, and the new copy has become corrupt, then the original has also become corrupt. In the OP, you seem to imply that the original is not corrupt, but the new copy is corrupt, and you can't fix the new copy by overwriting it with a fresh copy of the original. This makes no sense. If you have ZFS, and redundancy, this won't happen. (Unless you have ailing ram/cpu/etc) You are saying ZFS will detect and rectify this kind of corruption in a deduped pool automatically if enough redundancy is present? Can that fail sometimes? Under what conditions? I'm saying ZFS checksums every block on disk, read or written, and if any checksum mismatches, then ZFS automatically checks the other copy ... from the other disk in the mirror, or reconstructed from the redundancy in raid, or whatever. By having redundancy, ZFS will automatically correct any checksum mismatches it encounters. If a checksum is mismatched on *both* sides of the mirror, it means either (a) both disks went bad at the same time, which is unlikely, but nonzero probability, or (b) there's faulty ram or cpu or some other single-point of failure in the system. I raised a technical question and you are going all personal on me. Woah. Where did that come from??? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS development moving behind closed doors
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Eric D. Mudama On Sat, Aug 21 at 4:13, Orvar Korvar wrote: And by the way: Wasn't there a comment of Linus Torvals recently that people shound move their low-quality code into the codebase ??? ;) Anyone knows the link? Good against the Linux fanboys. :o) Can't find the original reference, but I believe he was arguing that by moving code into the kernel and marking as experimental, it's more likely to be tested and have the bugs worked out, than if it forever lives as patchsets. Given the test environment, can't say I can argue against that point of view. Besides defending the point of view (checkin experimental changes to an experimental area, to accelerate code review) ... which seems like a fair point of view ... Who finds it necessary to have ammunition against linux fanboys? Linux is good in its own way. You got something against linux? Just converse on the points of merit, and both you and they will reach the best conclusions you can, rather than pushing an agenda or encouraging unnecessary bias. Each OS is better in its own way. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS development moving behind closed doors
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Linder, Doug there are an awful lot of places that actively DO NOT want the latest and greatest, and for good reason. Agreed. Latest-greatest has its place, which is not 24/7 must-stay-up core servers. Each OS - sol10 vs osol (or more appropriately now ... something like fedora vs rhel) Each OS has its place. Each one satisfies different requirements. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 64-bit vs 32-bit applications
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Peter Jeremy My interpretation of those results is that you can't generalise: The only way to determine whether your application is faster in 32-bit or 64-bit more is to test it. And your choice of algorithm is at least as important as whether it's 32-bit or 64-bit. Not just your choice of algorithm, but architecture. Consider the dramatic architecture difference between Intel and AMD. Though they may have the same instruction set (within reason) the internal circuits to process those instructions are dramatically different, and hence, performance is dramatically different. Intel might be 4x faster at some instruction, while AMD is 4x faster at some other instruction. The same dramatic difference is present for 32 vs 64. As soon as you change modes of your CPU, the architecture of the chip might as well be totally different. If you want to optimize performance, you have to first be able to classify your work load. If you cannot create a job which is truly typical of your work load, all bets are off. Don't even bother. For general computing, the more you spend, the faster it goes. Only if you have some task which will be repeated for long periods of time ... Then you can benefit by trying this CPU, or that CPU, or this mode, or that mode, or this chipset, or tweaking the compile flags, etc. If you have one task which is faster 32bit, it's not representative of 32 vs 64 in general. And vice-versa. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS in Linux (was Opensolaris is apparently dead)
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Joerg Schilling 1) The OpenSource definition http://www.opensource.org/docs/definition.php section 9 makes it very clear that an OSS license must not restrict other software and must not prevent to bundle different works under different licenses on one medium. 2) given the fact that the GPL is an aproved OSS licensse, it obviously complies with the OSS definition. Even if there is a compatibility problem between GPL and ZFS, it's all but irrelevant. Because the linux kernel can load modules which aren't required to be GPL. If they're compiled as modules, separately from the kernel, then there's no argument over derived work or anything like that ... All you would need is a /boot partition, where the kernel is able to load the ZFS modules, and then you're home free. Much as we do today, with grub loading solaris kernel, and then the solaris kernel using the bootfs property to determine which ZFS filesystem to mount as / So even if there is a license compatibility problem, I think it's all but irrelevant. Because it's easily legally solvable, or avoidable. The reasons for ZFS not in Linux must be more than just the license issue. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris startup script location
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Alxen4 Disabling ZIL converts all synchronous calls to asynchronous which makes ZSF to report data acknowledgment before it actually was written to stable storage which in turn improves performance but might cause data corruption in case of server crash. Is it correct ? It is partially correct. With the ZIL disabled, you could lose up to 30 sec of writes, but it won't cause an inconsistent filesystem, or corrupt data. If you make a distinction between corrupt and lost data, then this is valuable for you to know: Disabling the ZIL can result in up to 30sec of lost data, but not corrupt data. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris startup script location
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Alxen4 For example I'm trying to use ramdisk as ZIL device (ramdiskadm ) Other people have already corrected you about ramdisk for log. It's already been said, use SSD, or disable ZIL completely. But this was not said: In many cases, you can gain a large performance increase by enabling the WriteBack buffer of your ZFS server raid controller card. You only want to do this if you have a BBU enabled on the card. The performance gain is *not* quite as good as using a nonvolatile log device, but certainly worth checking anyway. Because it's low cost, and doesn't consume slots... Also, if you get a log device, you want two of them, and mirror them. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Opensolaris is apparently dead
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Ethan Erchinger We've had a failed disk in a fully support Sun system for over 3 weeks, Explorer data turned in, and been given the runaround forever. The 7000 series support is no better, possibly worse. That is really weird. What are you calling failed? If you're getting either a red blinking light, or a checksum failure on a device in a zpool... You should get your replacement with no trouble. I have had wonderful support, up to and including recently, on my Sun hardware. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Opensolaris is apparently dead
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Garrett D'Amore interpretation. Since it no longer is relevant to the topic of the list, can we please either take the discussion offline, or agree to just let the topic die (on the basis that there cannot be an authoritative answer until there is some case law upon which to base it?) Compatibility of ZFS Linux, as well as the future development of ZFS, and the health and future of opensolaris / solaris, oracle sun ... Are definitely relevant to this list. People are allowed to conjecture. If you don't have interest in a thread, just ignore the thread. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 64-bit vs 32-bit applications
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Will Murnane I am surprised with the performances of some 64-bit multi-threaded applications on my AMD Opteron machine. For most of the applications, the performance of 32-bit version is almost same as the performance of 64-bit version. However, for a couple of applications, 32-bit versions This list discusses the ZFS filesystem. Perhaps you'd be better off posting to perf-discuss or tools-gcc? That said, you need to provide more information. What compiler and flags did you use? What does your program (broadly speaking) do? What did you measure to conclude that it's slower in 64-bit mode? Not only that, for most things the 32 vs 64bit architectures are expected to perform about the same. The 64bit architecture exists mostly for higher memory addressing bits, not for twice the performance. YMMV. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Opensolaris is apparently dead
From: Garrett D'Amore [mailto:garr...@nexenta.com] Sent: Sunday, August 15, 2010 8:17 PM (The only way I could see this changing would be if there was a sudden license change which would permit either ZFS to overtake btrfs in the Linux kernel, or permit btrfs to overtake zfs in the Solaris kernel. I Of course this has been discussed extensively, but I believe, the reasons for ZFS not to be in Linux kernel go beyond just the license incompatibility. ZFS does raid, and mirroring, and resilvering, and partitioning, and NFS, and CIFS, and iSCSI, and device management via vdev's, and so on. So ZFS steps on a lot of linux peoples' toes. They already have code to do this, or that, why should they kill off all these other projects, and turn the world upside down, and bow down and acknowledge that anyone else did anything better than what they did? No, they just want a copy-on-write filesystem, and nothing more. Something which more closely complies to the architecture model that they're already using. Something which doesn't hurt their ego when they accept it... And of course by they I'm mostly referring to Linus. And all the people who work on kernel, ext fs, software raid, and all these other things which already exist in a More Linuxy way... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Opensolaris is apparently dead
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of David Dyer-Bennet However, if Oracle makes a binary release of BTRFS-derived code, they must release the source as well; BTRFS is under the GPL. When a copyright holder releases something under GPL, it only means they've granted you and the rest of the world permission to use it according to the terms of GPL. The copyright holder always retains permission for themselves to redistribute in any form, under a different license if they want to. If you (Microsoft) are a developer of a proprietary product, and you want to link in some GPL library and keep it private and proprietary, you can attempt negotiations with the copyright holder, to get that code released to you for your purposes, under terms which are not GPL. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Opensolaris is apparently dead
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- Can someone provide a link to the requisite source files so that we can see the copyright statements? It may well be that Oracle assigned the copyright to some other party. BTRFS is inside the linux kernel. Copyright (C) 1989, 1991 Free Software Foundation, Inc. There is no other copyright written in there (that I can find with grep) but the GPL does say something to contributors, which could fuzz the line between copyright owner for contributions added by somebody outside the FSF. it is not the intent of this section to claim rights or contest your rights to work written entirely by you So maybe the contributor retains some rights to reproduce their work in other situations, under a different license. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Opensolaris is apparently dead
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Jerome Warnier Do not forget Btrfs is mainly developed by ... Oracle. Will it survive better than Free Solaris/ZFS? It's gpl. Just as zfs is cddl. They cannot undo, or revoke the free license they've granted to use and develop upon whatever they've released. ZFS is not dead, although it is yet to be seen if future development will be closed source. BTRFS is not dead, and cannot be any more dead than zfs. So honestly ... your comment above ... really has no bearing in reality. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Opensolaris is apparently dead
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Bob Friesenhahn The $400 number is bogus since the amount that Oracle quotes now depends on the value of the hardware that the OS will run on. For my Using the same logic, if I said MS Office costs $140, that's a bogus number, because different vendors sell it at different prices. It's $450 for 1yr, or $1200 for 3yrs to buy solaris 10 with basic support on a dell server. It costs more with a higher level of support, and it costs less if you have a good relationship with Dell with a strong corporate discount, or if you buy it at the end of Dell's quarter, when they have the best sales going on. I don't know how much it costs at other vendors. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Opensolaris is apparently dead
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Tim Cook The cost discussion is ridiculous, period. $400 is a steal for support. You'll pay 3x or more for the same thing from Redhat or Novell. Actually, as a comparison with the message I sent 1 minute ago... in order to compare apples to apples ... [Solaris is] $450 for 1yr, or $1200 for 3yrs to buy solaris 10 with basic support on a dell server. It costs more with a higher level of support, and it costs less if you have a good relationship with Dell with a strong corporate discount, or if you buy it at the end of Dell's quarter, when they have the best sales going on. If you buy RHEL ES support with the same dell servers, the cost would be $350/yr for basic support. Plus or minus, based on AS and level of support and your relationship with Dell. Solaris costs more, but the ballpark is certainly the same. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and VMware
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Edward Ned Harvey #3 I previously believed that vmfs3 was able to handle sparse files amazingly well, like, when you create a new vmdk, it appears almost instantly regardless of size, and I believed you could copy sparse vmdk's efficiently, not needing to read all the sparse consecutive zeroes. I was wrong. Correction: I was originally right. ;-) In ESXi, if you go to command line (which is busybox) then sparse copies are not efficient. If you go into vSphere, and browse the datastore, and copy vmdk files via gui, then it DOES copy efficiently. The behavior is the same, regardless of NFS vs iSCSI. You should always copy files via GUI. That's the lesson here. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Compress ratio???
From: cyril.pli...@gmail.com [mailto:cyril.pli...@gmail.com] On Behalf Of Cyril Plisko The compressratio shows you how much *real* data was compressed. The file in question, however, can be sparse file and have its size vastly different from what du says, even without compression. Ahhh. Thank you. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Opensolaris is apparently dead
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Russ Price For me, Solaris had zero mindshare since its beginning, on account of being prohibitively expensive. I hear that a lot, and I don't get it. $400/yr does move it out of peoples' basements generally, and keeps sol10 out of enormous clustering facilities that don't have special purposes or free alternatives. But I wouldn't call it prohibitively expensive, for a whole lot of purposes. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Opensolaris is apparently dead
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Andrej Podzimek Or Btrfs. It may not be ready for production now, but it could become a serious alternative to ZFS in one year's time or so. (I have been using I will much sooner pay for sol11 instead of use btrfs. Stability speed maturity greatly outweigh a few hundred dollars a year, if you run your business on it. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] one ZIL SLOG per zpool?
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Chris Twa My plan now is to buy the ssd's and do extensive testing. I want to focus my performance efforts on two zpools (7x146GB 15K U320 + 7x73GB 10k U320). I'd really like two ssd's for L2ARC (one ssd per zpool) and then slice the other two ssd's and then mirror the slices for SLOG (one mirrored slice per zpool). I'm worried that the ZILs won't be significantly faster than writing to disk. But I guess that's what testing is for. If the ZIL in this arrangement isn't beneficial then I can have four disks for L2ARC instead of two (or my wife and I get ssd's for our laptops). Remember that ZIL is only for sync writes. So if you're not doing sync writes, there is no benefit of a dedicated log device. Also, for a lot of purposes, disabling ZIL is actually viable. It's zero cost which guarantees absolute optimal performance on spindle disks. Nothing is faster. To quantify the risk, here's what you need to know: In the event of an ungraceful crash, up to 30sec of async writes are lost. Period. But as long as you have not disabled ZIL, then all the sync writes were not lost. If you have ZIL disabled, then sync=async. Up to 30sec of all writes are lost. Period. But there is no corruption or data written out-of-order. The end result is as-if you halted the server suddenly, flushed all the buffers to disk, and then powered off. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss