Re: [zfs-discuss] Zpool with data errors
> didn't seem to we would need zfs to provide that redundancy also. There was a time when I fell for this line of reasoning too. The problem (if you want to call it that) with zfs is that it will show you, front and center, the corruption taking place in your stack. > Since we're on SAN with Raid internally Your situation would suggest that your RAID silently corrupted data and didn't even know about it. Until you can trust the volumes behind zfs (and I don't trust any of them anymore, regardless of the brand name on the cabinet), give zfs at least some redundancy so that it can pick up the slack. By the way, I used to trust storage because I didn't believe it was corrupting data, but I had no proof one way or the other, so I gave it the benefit of the doubt. Since I have been using zfs, my standards have gone up considerably. Now I trust storage because I can *prove* it's correct. If someone can't prove that a volume is returning correct data, don't trust it. Let zfs manage it. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] # disks per vdev
Funny you say that. My Sun v40z connected a pair of Sun A5200 arrays running OSol 128a can't see the enclosures. The luxadm command comes up blank. Except for that annoyance (and similar other issues) the Sun gear works well with a Sun operating system. Sent from Yahoo! Mail on Android ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] # disks per vdev
> Lights. Good. Agreed. In a fit of desperation and stupidity I once enumerated disks by pulling them one by one from the array to see which zfs device faulted. On a busy array it is hard even to use the leds as indicators. It makes me wonder how large shops with thousands of spindles handle this. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] # disks per vdev
> Lights. Good. Agreed. In a fit of desperation and stupidity I once enumerated disks by pulling them one by one from the array to see which zfs device faulted. On a busy array it is hard even to use the leds as indicators. It makes me wonder how large shops with thousands of spindles handle this. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Server with 4 drives, how to configure ZFS?
> Has there been any change to the server hardware with > respect to number of > drives since ZFS has come out? Many of the servers > around still have an even > number of drives (2, 4) etc. and it seems far from > optimal from a ZFS > standpoint. All you can do is make one or two > mirrors, or a 3 way mirror and > a spare, right? With four drives you could also make a RAIDZ3 set, allowing you to have the lowest usable space, poorest performance and worst resilver times possible. Sorry, couldn't resist. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] # disks per vdev
It sounds like you are getting a good plan together. > The only thing though I seem to remember reading that adding vdevs to > pools way after the creation of the pool and data had been written to it, > that things aren't spread evenly - is that right? So it might actually make > sense to buy all the disks now and start fresh with the final build. In this scenario, balancing would not impact your performance. You would start with the performance of a single vdev. Adding the second vdev later will only increase performance, even if horribly imbalanced. Over time it will start to balance itself. If you want it balanced, you can force zfs to start balancing by copying files then deleting the originals. > Starting with only 6 disks would leave growth for another 6 disk > raid-z2 (to keep matching geometry) leaving 3 disks spare which is > not ideal. Maintaining identical geometry only matters if all of the disks are identical. If you later add 2TB disks, then pick whatever geometry works for you. The most important thing is to maintain consistent vdev types, e.g. all RAIDZ2. > I do like the idea of having a hot spare I'm not sure I agree. In my anecdotal experience, sometimes my array would offline (for whatever reason) and zfs would try to replace as many disks as it could with the hot spares. If there weren't enough hot spares for the whole array, then the pool was left irreversibly damaged, having several disks in the middle of being replaced. This has only happened once or twice and in the panic I might have handled it incorrectly, but it has spooked me from having hot spares. > This is a bit OT, but can you have one vdev that is a duplicate of > another vdev? By that I mean say you had 2x 7 disk raid-z2 vdevs, > instead of them both being used in one large pool could you have one > that is a backup of the other, allowing you to destroy one of them > and re-build without data loss? Absolutely. I do this very thing with large, slow disks holding a backup for the main disks. My home server has an SMF service which regularly synchronizes the time-slider snapshots from each main pool to the backup pool. This has saved me when a whole pool disappeared (see above) and has allowed me to make changes to the layout of the main pools. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS for Linux?
Just for completeness, there is also VirtualBox which runs Solaris nicely. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] # disks per vdev
I am asssuming you will put all of the vdevs into a single pool, which is a good idea unless you have a specific reason for keeping them separate, e.g. you want to be able to destroy / rebuild a particular vdev while leaving the others intact. Fewer disks per vdev implies more vdevs, providing better random performance, lower scrub and resilver times and the ability to expand a vdev by replacing only the few disks in it. The downside of more vdevs is that you dedicate your parity to each vdev, e.g. a RAIDZ2 would need two parity disks per vdev. > I'm in two minds with mirrors. I know they provide > the best performance and protection, and if this was > a business critical machine I wouldn't hesitate. > > But as it for a home media server, which is mainly > WORM access and will be storing (legal!) DVD/Bluray > rips i'm not so sure I can sacrify the space. For a home media server, all accesses are essentially sequential, so random performance should not be a deciding factor. > 7x 2 way mirrors would give me 7TB usable with 1 hot > spare, using 1TB disks, which is a big drop from > 12TB! I could always jump to 2TB disks giving me 14TB > usable but I already have 6x 1TB disks in my WHS > build which i'd like to re-use. I would be tempted to start with a 4+2 (six disk RAIDZ2) vdev using your current disks and plan from there. There is no reason you should feel compelled to buy more 1TB disks just because you already have some. > Am I right in saying that single disks cannot be > added to a raid-z* vdev so a minimum of 3 would be > required each time. However a mirror is just 2 disks > so if adding disks over a period of time mirrors > would be cheaper each time. That is not correct. You cannot ever add disks to a vdev. Well, you can add additional disks to a mirror vdev, but otherwise, once you set the geometry, a vdev is stuck for life. However, you can add any vdev you want to an existing pool. You can take a pool with a single vdev set up as a 6x RAIDZ2 and add a single disk to that pool. The previous example is a horrible idea because it makes the entire pool dependent upon a single disk. The example also illustrates that you can add any type of vdev to a pool. Most agree it is best to make the pool from vdevs of identical geometry, but that is not enforced by zfs. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS receive checksum mismatch
> I stored a snapshot stream to a file The tragic irony here is that the file was stored on a non-zfs filesystem. You had had undetected bitrot which unknowingly corrupted the stream. Other files also might have been silently corrupted as well. You may have just made one of the strongest cases yet for zfs and its assurances. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS receive checksum mismatch
> If it is true that unlike ZFS itself, the replication > stream format has > no redundancy (even of ECC/CRC sort), how can it be > used for > long-term retention "on tape"? It can't. I don't think it has been documented anywhere, but I believe that it has been well understood that if you don't trust your storage (tape, disk, floppies, punched cards, whatever), then you shouldn't trust your incremental streams on that storage. It's as if the ZFS design assumed that all incremental streams would be either perfect or retryable. This is a huge problem for tape retention, not so much for disk retention. On a personal level I have handled this with a separate pool of fewer, larger and slower drives which serves solely as backup, taking incremental streams from the main pool every 20 minutes or so. Unfortunately that approach breaks the legacy backup strategy of pretty much every company. I think the message is that unless you can ensure the integrity of the stream, either backups should go to another pool or zfs send/receive should not be a critical part of the backup strategy. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] L2ARC and poor read performance
> This is not a true statement. If the primarycache > policy is set to the default, all data will > be cached in the ARC. Richard, you know this stuff so well that I am hesitant to disagree with you. At the same time, I have seen this myself, trying to load video files into L2ARC without success. > The ARC statistics are nicely documented in arc.c and > available as kstats. And I looked in the source. My C is a little rusty, yet it appears that prefetch items are not stored in L2ARC by default. Prefetches will satisfy a good portion of sequential reads but won't go to L2ARC. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] L2ARC and poor read performance
> > Are some of the reads sequential? Sequential reads > don't go to L2ARC. > > That'll be it. I assume the L2ARC is just taking > metadata. In situations > such as mine, I would quite like the option of > routing sequential read > data to the L2ARC also. The good news is that it is almost a certaintly that actual iSCSI usage will be of a (more) random nature than your tests, suggesting higher L2ARC usage in real world application. I'm not sure how zfs makes the distinction between a random and sequential read, but the more you think about it, not caching sequential requests makes sense. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] L2ARC and poor read performance
I'll throw out some (possibly bad) ideas. Is ARC satisfying the caching needs? 32 GB for ARC should almost cover the 40GB of total reads, suggesting that the L2ARC doesn't add any value for this test. Are the SSD devices saturated from an I/O standpoint? Put another way, can ZFS put data to them fast enough? If they aren't taking writes fast enough, then maybe they can't effectively load for caching. Certainly if they are saturated for writes they can't do much for reads. Are some of the reads sequential? Sequential reads don't go to L2ARC. What does iostat say for the SSD units? What does arc_summary.pl (maybe spelled differently) say about the ARC / L2ARC usage? How much of the SSD units are in use as reported in zpool iostat -v? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to properly read "zpool iostat -v" ? ;)
While I am by no means on expert on this, I went through a similar mental exercise previously and came to the conclusion that in order to service a particular read request, zfs may need to read more from the disk. For example, a 16KB request in a stripe might need to retrieve the full 128KB stripe, if only to verify the checksum of the stripe prior to returning 16KB to the OS. If I have understand it correctly, then the vdev numbers refer to the amount of data returned to the OS to satisfy requests, while the individual disk numbers refer to the amount of disk I/O required to satisfy the requests. Does that make sense? Standard disclaimers apply: I could be wrong, I often am wrong, etc. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] optimal layout for 8x 1 TByte SATA (consumer)
> 2011/5/26 Eugen Leitl : > > How bad would raidz2 do on mostly sequential writes > and reads > > (Athlon64 single-core, 4 GByte RAM, FreeBSD 8.2)? > > > > The best way is to go is striping mirrored pools, > right? > > I'm worried about losing the two "wrong" drives out > of 8. > > These are all 7200.11 Seagates, refurbished. I'd > scrub > > once a week, that'd probably suck on raidz2, too? > > > > Thanks. > > Sequential? Let's suppose no spares. > > 4 mirrors of 2 = sustained bandwidth of 4 disks > raidz2 with 8 disks = sustained bandwidth of 6 disks > > So :) Turn it around and discuss writes. Reads may or may not give 8x throughput with mirrors. In either setup, writes will require 8x storage bandwidth since all drives will be written to. Mirrors will deliver 4x throughput and RAIDZ2 will deliver 6x throughput. For what it's worth, I ran a 22 disk home array as a single RAIDZ3 vdev (19+3)for several months and it was fine. These days I run a 32 disk array laid out as four vdevs, each an 8 disk RAIDZ2, i.e. 4x 6+2. The best advice is simply to test your workload against different configurations. ZFS lets you pick what works for you. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Myth? 21 disk raidz3: "Don't put more than ___ disks in a vdev"
Richard wrote: > > Untrue. The performance of a 21-disk raidz3 will be nowhere near the > performance of a 20 disk 2-way mirrror. You know this stuff better than I do. Assuming no bus/cpu bottlenecks, a 21 disk raidz3 should provide sequential throughput of 18 disks and random throughput of 1 disk. A 20 disk 2-way mirror should provide sequential read throughput of (at best) 20 disks, sequential write throughput of (at best) 10 disks, random read throughput of between 2 and 20 disks and random write throughput of between 1 and 10 disks. At one extreme, mirrors are marginally better and at the other extreme mirrors are 10x the write and 20x the read performance. That's a wide range. > Taking this to a limit, would you say a 1,000 disk > raidz3 set is a good thing? > 10,000 disks? I don't know, maybe. Even If we accept that there is some magic X where stripes wider than X are bad, what is that X and how do we determine it? Likely, it depends on the several factors, including r/w iops (both of which can be mitigated by L2ARC and SLOG) and resilver times. If seek time was a non-issue (flash?) then there is no real case for mirrors. Mirrors can, if the data is laid out perfectly, provide sequential throughput which grows linearly with the vdev count. RAIDZN always will provide sequential throughput which grows linearly with the stripe width. Therefore, with low access time and low throughput storage (flash?), RAIDZN with very wide stripes makes an awful lot of sense. > FS is open source, feel free to modify and share your > ideas for improvement. And that's what we are doing here: sharing ideas. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver
> Richard wrote: > Yep, it depends entirely on how you use the pool. As soon as you > come up with a credible model to predict that, then we can optimize > accordingly :-) You say that somewhat tongue-in-cheek, but Edward's right. If the resliver code progresses in slab/transaction-group/whatever-the-correct-term-is order, then a pool with any significant use will have the resilver code seeking all over the disk. If instead, resilver blindly moved in block number order, then it would have very little seek activity and the effective throughput would be close to that of pure sequential i/o for both the new disk and the remaining disks in the vdev. Would it make sense for scrub/resilver to be more aware of operating in disk order instead of zfs order? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Optimal raidz3 configuration
> On Fri, Oct 15, 2010 at 3:16 PM, Marty Scholes > wrote: > > My home server's main storage is a 22 (19 + 3) disk > RAIDZ3 pool backed up hourly to a 14 (11+3) RAIDZ3 > backup pool. > > How long does it take to resilver a disk in that > pool? And how long > does it take to run a scrub? > > When I initially setup a 24-disk raidz2 vdev, it died > trying to > resilver a single 500 GB SATA disk. I/O under 1 > MBps, all 24 drives > thrashing like crazy, could barely even login to the > system and type > onscreen. It was a nightmare. > > That, and normal (no scrub, no resilver) disk I/O was > abysmal. > > Since then, I've avoided any vdev with more than 8 > drives in it. MY situation is kind of unique. I picked up 120 15K 73GB FC disks early this year for $2 per. As such, spindle count is a non-issue. As a home server, it has very little need for write iops and I have 8 disks for L2ARC on the main pool. Main pool is at 40% capacity and backup pool is at 65% capacity. Both take about 70 minutes to scrub. The last time I tested a resilver it took about 3 hours. The difference is that these are low capacity 15K FC spindles and the pool has very little sustained I/O; it only bursts now and again. Resilvers would go mostly uncontested, and with RAIDZ3 + autoreplace=off, I can actually schedule a resilver. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Optimal raidz3 configuration
Sorry, I can't not respond... Edward Ned Harvey wrote: > whatever you do, *don't* configure one huge raidz3. Peter, whatever you do, *don't* make a decision based on blanket generalizations. > If you can afford mirrors, your risk is much lower. > Because although it's > hysically possible for 2 disks to fail simultaneously > and ruin the pool, > the probability of that happening is smaller than the > probability of 3 > simultaneous disk failures on the raidz3. Edward, I normally agree with most of what you have to say, but this has gone off the deep end. I can think of counter-use-cases far faster than I can type. > Due to > smaller resilver window. Coupled with a smaller MTTDL, smaller cabinet space yield, smaller $/GB ratio, etc. > I highly endorse mirrors for nearly all purposes. Clearly. Peter, go straight to the source. http://blogs.sun.com/roch/entry/when_to_and_not_to In short: 1. vdev_count = spindle_count / (stripe_width + parity_count) 2. IO/s is proprotional to vdev_count 3. Usable capacity is proportional to stripe_width * vdev_count 4. A mirror can be approximated by a stripe of width one 5. Mean time to data loss increases exponentially with parity_count 6. Resilver time increases (super)linearly with stripe width Balance capacity available, storage needed, performance needed and your own level of paranoia regarding data loss. My home server's main storage is a 22 (19 + 3) disk RAIDZ3 pool backed up hourly to a 14 (11+3) RAIDZ3 backup pool. Clearly this is not a production Oracle server. Equally clear is that my paranoia index is rather high. ZFS will let you choose the combination of stripe width and parity count which works for you. There is no "one size fits all." -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance issues with iSCSI under Linux
> I've had a few people sending emails directly > suggesting it might have something to do with the > ZIL/SLOG. I guess I should have said that the issue > happen both ways, whether we copy TO or FROM the > Nexenta box. You mentioned a second Nexenta box earlier. To rule out client-side issues, have you considered testing with Nexenta as the iSCSI/NFS client? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance issues with iSCSI under Linux
> Here are some more findings... > > The Nexenta box has 3 pools: > syspool: made of 2 mirrored (hardware RAID) local SAS > disks > pool_sas: made of 22 15K SAS disks in ZFS mirrors on > 2 JBODs on 2 controllers > pool_sata: made of 42 SATA disks in 6 RAIDZ2 vdevs on > a single controller > > When we copy data from any linux box to either the > pool_sas or pool_sata, it is painfully slow. > > When we copy data from any linux box directly to the > syspool, it is plenty fast > > When we copy data locally on the Nexenta box from the > syspool to either the pool_sas or pool_sata, it is > crazy fast. > > We also see the same pattern whether we use iSCSI or > NFS. We've also tested using different NICs (some at > 1GbE, some at 10GbE) and even tried bypassing the > switch by directly connecting the two boxes with a > cable- and it didn't made any difference. We've also > tried not using the SSD for the ZIL. > > So... > We've ruled out iSCSI, the networking, the ZIL > device, even the HBAs as it is fast when it is done > locally. > > Where should we look next? > > Thank you all for your help! > Ian Looking at the list suggested earlier: 1. Linux network stack 2. Linux iSCSI issues 3. Network cabling/switch between the devices 4. Nexenta CPU constraints (unlikely, I know, but let's be thorough) 5. Nexenta network stack 6. COMSTAR problems It looks like you have ruled out everything. The only thing that still stands out is that network operations (iSCSI and NFS) to external drives are slow, correct? Just for completeness, what happens if you scp a file to the three different pools? If the results are the same as NFS and iSCSI, then I think the network can be ruled out. I would be leaning toward thinking there is some mismatch between the network protocols and the external controllers/cables/arrays. Are the controllers the same hardware/firmware/driver for the internal vs. external drives? Keep digging. I think you are getting close. Cheers, Marty -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Ubuntu iSCSI install to COMSTAR zfs volume Howto
I apologize if this has been covered before. I have not seen a blow-by-blow installation guide for Ubuntu onto an iSCSI target. The install guides I have seen assume that you can make a target visible to all, which is a problem if you want multiple iSCSI installations on the same COMSTAR target. During install Ubuntu generates three random initiators and you have to deal with them to get things working correctly. I did this for a few reasons: 1. I have some PCs which already have another OS installed on them and want Ubuntu available without any changes to the local drive 2. I want each PC to netboot Ubuntu with no interaction from the user and some assurance that each machine will boot the correct image 3. It's cool 4. Because I can I am confident that there are things here which can be done better. Any and all feedback is appreciated. Server is OpenSolaris build 128a at 192.168.223.147. Client is Acer laptop with pxe boot enabled. DHCP server is dd-wrt router with DHCP modifications I have the following modifications made to the DHCP server. dhcp-match=gpxe,175 dhcp-option=175,8:1:1 dhcp-boot=net:#gpxe,gpxe-1.0.1-undionly.kpxe,v40z,192.168.223.147 dhcp-boot=net:gpxe,menu.gpxe,v40z,192.168.223.147 I have added the following files to /tftpboot. * /tftp/gpxe-1.0.1-undionly.kpxe This is available from www.etherboot.org * /tftp/menu.pxe This file is needed to get gpxe to do an iSCSI boot to a target using an initiator based on the client uuid. The contents of my file follow. #!gpxe # initialize dhcp net0 # keep our iSCSI mappings around even if the drive does not resolve set keep-san 1 # set the initiator using our uuid set initiator-iqn iqn.1993-08.org.debian:${uuid} # set the target set root-path iscsi:192.168.223.147iqn.1986-03.com.sun:02:41fb1720-66ce-c72a-81fb-bbf396db7849 # try to boot from the iSCSI device echo "Attempting to boot from san ${root-path}" sanboot ${root-path} # if we made it here, then boot failed, probably a new disk, chainload # ubuntu installer chain pxelinux.0 # for some reason, the silly system stalls and doesn't bother to chainload * The Ubuntu Lucid netboot files, found at http://archive.ubuntu.com/ubuntu/dists/lucid/main/installer-amd64/current/images/netboot/netboot.tar.gz Just follow the 8 steps below, and you have a fully installed Ubuntu client on iSCSI STEP 1 -- Create sparse zfs volume on OpenSolaris bash-4.0$ pfexec zfs create -s -V 320G tank/export/iscsi/acer-ubuntu bash-4.0$ zfs get all tank/export/iscsi/acer-ubuntu NAME PROPERTY VALUE SOURCE tank/export/iscsi/acer-ubuntu type volume - tank/export/iscsi/acer-ubuntu creation Mon Oct 11 13:30 2010 - tank/export/iscsi/acer-ubuntu used 54.5K - tank/export/iscsi/acer-ubuntu available 709G - tank/export/iscsi/acer-ubuntu referenced 54.5K - tank/export/iscsi/acer-ubuntu compressratio 1.00x - tank/export/iscsi/acer-ubuntu reservationnone default tank/export/iscsi/acer-ubuntu volsize320G - tank/export/iscsi/acer-ubuntu checksum on default tank/export/iscsi/acer-ubuntu compressionon inherited from tank tank/export/iscsi/acer-ubuntu readonly off default tank/export/iscsi/acer-ubuntu shareiscsi off inherited from tank/export/iscsi tank/export/iscsi/acer-ubuntu copies 1 default tank/export/iscsi/acer-ubuntu refreservation none default tank/export/iscsi/acer-ubuntu primarycache all default tank/export/iscsi/acer-ubuntu secondarycache all default tank/export/iscsi/acer-ubuntu usedbysnapshots0 - tank/export/iscsi/acer-ubuntu usedbydataset 54.5K - tank/export/iscsi/acer-ubuntu usedbychildren 0 - tank/export/iscsi/acer-ubuntu usedbyrefreservation 0 - tank/export/iscsi/acer-ubuntu logbiaslatency default tank/export/iscsi/acer-ubuntu dedup off default tank/export/iscsi/acer-ubuntu mlslabel none default tank/export/iscsi/acer-ubuntu com.sun:auto-snapshot true inherited from tank/export/iscsi =
Re: [zfs-discuss] Performance issues with iSCSI under Linux
Ok, Let's think about this for a minute. The log drive is c1t11d0 and it appears to be almost completely unused, so we probably can rule out a ZIL bottleneck. I run Ubuntu booting iSCSI against OSol 128a and the writes do not appear to be synchronous. So, writes aren't the issue. >From the Linux side, it appears the drive in question is either sdb or dm-3, >and both appear to be the same drive. Since switching to zfs, my >Linux-disk-fu has become a bit rusty. Is one an alias for the other? The >Linux disk appears to top out at around 50MB/s or so. That looks suspiciously >like it is running on a gigabit connection with some problems. I agree that the zfs side looks like it has plenty of bandwidth and iops to spare. >From what I can see, you can narrow the search down to a few things: 1. Linux network stack 2. Linux iSCSI issues 3. Network cabling/switch between the devices 4. Nexenta CPU constraints (unlikely, I know, but let's be thorough) 5. Nexenta network stack 6. COMSTAR problems As another poster pointed out, testing some NFS and ssh traffic can eliminate 1, 3 and 5 above. I recommend going down the list and testing every piece in isolation as much as possible to narrow the list. Good luck and let us know what you learn. Cheers, Marty -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Bursty writes - why?
I think you are seeing ZFS store up the writes, coalesce them, then flush to disk every 30 seconds. Unless the writes are synchronous, the ZIL won't be used, but the writes will be cached instead, then flushed. If you think about it, this is far more sane than flushing to disk every time the write() system call is used. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] scrub doesn't finally finish?
Have you had a lot of activity since the scrub started? I have noticed what appears to be extra I/O at the end of a scrub when activity took place during the scrub. It's as if the scrub estimator does not take the extra activity into account. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] drive speeds etc
Roy Sigurd Karlsbakk wrote: > device r/s w/s kr/s kw/s wait actv svc_t %w %b > cmdk0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > cmdk1 0.0 163.6 0.0 20603.7 1.6 0.5 12.9 24 24 > fd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > sd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > sd1 0.5 140.3 0.3 2426.3 0.0 1.0 7.2 0 14 > sd2 0.0 138.3 0.0 2476.3 0.0 1.5 10.6 0 18 > sd3 0.0 303.9 0.0 2633.8 0.0 0.4 1.3 0 7 > sd4 0.5 306.9 0.3 2555.8 0.0 0.4 1.2 0 7 > sd5 1.0 308.5 0.5 2579.7 0.0 0.3 1.0 0 7 > sd6 1.0 304.9 0.5 2352.1 0.0 0.3 1.1 1 7 > sd7 1.0 298.9 0.5 2764.5 0.0 0.6 2.0 0 13 > sd8 1.0 304.9 0.5 2400.8 0.0 0.3 0.9 0 6 Something is going on with how these writes are ganged together. The first two drives average 17KB per write and the other six 8.7KB per write. The aggregate statistics listed show less of a disparity, but one still exists. I have to wonder if there is some "max transfer length" type of setting on each drive which is different, allowing the Hitachi drives to allow larger transfers, resulting in fewer I/O operations, each having a longer service time. Just to avoid confusion, the svc_t field it "service time" and not "seek time." The service time is the total time to service a request, including seek time, controller overhead, time for the data to transit the SATA bus and time to write the data. If the requests are larger, all else being equal, the service time will ALWAYS be higher, but that does NOT imply the drive is slower. On the contrary, it often implies a faster drive which can service more data per request. At any rate, there is a reason that the Hitachi drives are handling larger requests than the WD drives. I glanced at the code for a while but could not figure out where the max transfer size is determined or used. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] drive speeds etc
Is this a sector size issue? I see two of the disks each doing the same amount of work in roughly half the I/O operations each operation taking about twice the time compared to each of the remaining six drives. I know nothing about either drive, but I wonder if one type of drive has twice the sector size of the other? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sliced iSCSI device for doing RAIDZ?
Alexander Skwar wrote: > Okay. This contradicts the ZFS Best Practices Guide, > which states: > > # For production environments, configure ZFS so that > # it can repair data inconsistencies. Use ZFS > redundancy, > # such as RAIDZ, RAIDZ-2, RAIDZ-3, mirror, or copies > > 1, > # regardless of the RAID level implemented on the > # underlying storage device. With such redundancy, > faults in the > # underlying storage device or its connections to the > host can > # be discovered and repaired by ZFS. > Anyway. Without redundancy, ZFS cannot do recovery, > can > it? As far as I understand, it could detect block > level corruption, > even if there's not redundancy. But it could not > correct such a > corruption. > > Or is that a wrong understanding? > > If I got the gist of what you wrote, it boils down to > how reliable > the SAN is? But also SANs could have "block level" > corruption, > no? I'm a bit confused, because of the (perceived?) > contra- > diction to the Best Practices Guide… :) This comes down to how much you trust your "storage device" whatever that may be. If you have full faith in your SAN (and I don't have full faith in it, no matter what its make/model), then ignore ZFS redundancy. When I first deployed a hardware RAID solution around 1995, the vendor proudly stated that the device could scrub mirrors and correct errors. I asked when it found a discrepancy, how did it know which side of the mirror was correct? He stammered for a while, but it basically came down to the device flipping a coin. ZFS will ensure integrity, even when the underlying device fumbles. When you mirror the iSCSI devices, be sure that they are configured in such a way that a failure on one iSCSI "device" does not imply a failure on the other iSCSI device. As a simple example, if you sliced a disk into three partitions and then presented them as a three way mirror to ZFS, then a single disk failure will wipe out everything, even though you have the illusion of redundancy at the ZFS level. I have seen some systems where the SAN has presented what appeared to be independent devices, but a failure on the underlying disk faulted both devices, rendering ZFS helpless. Good luck, Marty -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver = defrag?
David Dyer-Bennet wote: > Sure, if only a single thread is ever writing to the > disk store at a time. > > This situation doesn't exist with any kind of > enterprise disk appliance, > though; there are always multiple users doing stuff. Ok, I'll bite. Your assertion seems to be that "any kind of enterprise disk appliance" will always have enough simultaneous I/O requests queued that any sequential read from any application will be sufficiently broken up by requests from other applications, effectively rendering all read requests as random. If I follow your logic, since all requests are essentially random anyway, then where they fall on the disk is irrelevant. I might challenge a couple of those assumptions. First, if the data is not fragmented, then ZFS would coalesce multiple contiguous read requests into a single large read request, increasing total throughput regardless of competing I/O requests (which also might benefit from the same effect). Second, I am unaware of an enterprise requirement that disk I/O run at 100% busy, any more than I am aware of the same requirement for full network link utilization, CPU utilization or PCI bus utilization. What appears to be missing from this discussion is any shred of scientific evidence that fragmentation is good or bad and by how much. We also lack any detail on how much fragmentation does take place. Let's see if some people in the community can get some real numbers behind this stuff in real world situations. Cheers, Marty -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver = defrag?
Richard Elling wote: > Define "fragmentation"? Maybe this is the wrong thread. I have noticed that an old pool can take 4 hours to scrub, with a large portion of the time reading from the pool disks at the rate of 150+ MB/s but zpool iostat reports 2 MB/s read speed. My naive interpretation is that the data scrub is looking for has become fragmented. Should I refresh the pool by zfs sending it to another pool then zfs receiving the data back again, the same scrub can take less than an hour with zpool iostat reporting more sane throughput. On an old pool which had lots of snapshots come and go, the scrub throughput is awful. On that same data, refreshed via zfs send/receive, the throughput much better. It would appear to me that this is an artifact of fragmentation, although I have nothing scientific on which to base this. Additional unscientific observations leads me to believe these same "refreshed" pools also perform better for non-scrub activities. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver = defrag?
I am speaking from my own observations and nothing scientific such as reading the code or designing the process. > A) Resilver = Defrag. True/false? False > B) If I buy larger drives and resilver, does defrag > happen? No. The first X sectors of the bigger drive are identical to the smaller drive, fragments and all. > C) Does zfs send zfs receive mean it will defrag? Yes. The data is laid out on the receiving side in a sane manner, until it later becomes fragmented. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Suggested RaidZ configuration...
Erik wrote: > Actually, your biggest bottleneck will be the IOPS > limits of the > drives. A 7200RPM SATA drive tops out at 100 IOPS. > Yup. That's it. > So, if you need to do 62.5e6 IOPS, and the rebuild > drive can do just 100 > IOPS, that means you will finish (best case) in > 62.5e4 seconds. Which > is over 173 hours. Or, about 7.25 WEEKS. My OCD is coming out and I will split that hair with you. 173 hours is just over a week. This is a fascinating and timely discussion. My personal (biased and unhindered by facts) preference is wide stripes RAIDZ3. Ned is right that I kept reading that RAIDZx should not exceed _ devices and couldn't find real numbers behind those conclusions. Discussions in this thread have opened my eyes a little and I am in the middle of deploying a second 22 disk fibre array on home server, so I have been struggling with the best way to allocate pools. Up until reading this thread, the biggest downside to wide stripes, that I was aware of, has been low iops. And let's be clear: while on paper the iops of a wide stripe is the same as a single disk, it actually is worse. In truth, the service time for any request on wide stripe is the service time of the SLOWEST disk for that request. The slowest disk may vary from request to request, but will always delay the entire stripe operation. Since all of the 44 spindles are 15K disks, I am about to convince myself to go with two pools of wide stripes and keep several spindles for L2ARC and SLOG. The thinking is that other background operations (scrub and resilver) can take place with little impact to application performance, since those will be using L2ARC and SLOG. Of course, I could be wrong on any of the above. Cheers, Marty -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] shrink zpool
> Is it currently or near future possible to shrink a > zpool "remove a disk" As other's have noted, no, not until the mythical bp_rewrite() function is introduced. So far I have found no documentation on bp_rewrite(), other than it is the solution to evacuating a vdev, restriping a vdev, defragmenting a vdev, solving world hunger and bringing peace to the Middle East. If you search the forums you will find all sorts of discussion around this evasive feature, but nothing concrete. I think it's hiding behind the unicorn located at the end of the rainbow. With Oracle withdrawing/inhousing/whatever development, it's a safe bet that bp_rewrite() now rests in the hands of the community, possibly to be born in Nexenta-land. Maybe it's time for me to quit whining, dust off my K&R book and get to work on the weekends coming up with an honest implementation plan. Anyone want to join a task force for getting bp_rewrite() implemented as a community effort? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (preview) Whitepaper - ZFS Pools Explained - feedback welcome
This paper is exactly what is needed -- giving an overview to a wide audience of the ZFS fundamental components and benefits. I found several grammar errors -- to be expected in a draft and I think at least one technical error. The paper seems to imply that multiple vdevs will induce striping across the vdevs, ala RAIDx0. Though I haven't looked at the code, my understanding is that records are contained to a single vdev. The clarification that each vdev gives iops roughly equivalent to a single disk is useful information not generally understood. I was glad to see it there. Overall, this is a terrific step forward for understanding ZFS and encouraging its adoption. Now if only SRSS would work under Nexenta... -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Backup zpool
Script attached. Cheers, Marty -- This message posted from opensolaris.org zfs_sync Description: Binary data ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Backup zpool
> Hello, > > I would like to backup my main zpool (originally > called "data") inside an equally originally named > "backup"zpool, which will also holds other kinds of > backups. > > Basically I'd like to end up with > backup/data > backup/data/dataset1 > backup/data/dataset2 > backup/otherthings/dataset1 > backup/otherthings/dataset2 > > this is quite simply doable by using zfs send / zfs > receive. > > the problem is with compression. I have default > compression enabled on my data pool, but I'd like to > use gzip-2 on backup/data. > I am using b134 with zpool version 22, which I read > had some new features regarding this use case > (http://arc.opensolaris.org/caselog/PSARC/2009/510/200 > 90924_tom.erickson). The problem is, I don't > understand how to to this. I don't really care about > mantaining former properties but of course that would > be a plus. I have a similar situation where dedup is enabled on the backup, but not the main pool, for performance reasons. Once the pools are set, I have a script which does exactly what you are looking for using the time-slider snaps. It finds the latest snap common to the main and backup pool, rolls back the backup to that snap, then sends the incrementals in between. It also handles the case of no destination file system and tries to send the first snap. At least in 128a, the auto snapshot seems to delete the old snaps from both pools, even though it is not configured to snap the backup pool, which keeps the snap count sane on the backup pool. I would never claim the script is world-class, but I run it hourly from cron and it keeps the stuff in sync without me having to do anything. Say the word and I'll send you a copy. Good luck, Marty -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Raidz - what is stored in parity?
Peter wrote: > One question though. Marty mentioned that raidz > parity is limited to 3. But in my experiment, it > seems I can get parity to any level. > > You create a raidz zpool as: > > # zpool create mypool raidzx disk1 diskk2 > > Here, x in raidzx is a numeric value indicating the > desired parity. > > In my experiment, the following command seems to > work: > > # zpool create mypool raidz10 disk1 disk2 ... > > In my case, it gives an error that I need at least 11 > disks (which I don't) but the point is that raidz > parity does not seem to be limited to 3. Is this not > true? You have my curiousity. I was asking for that feature in these forums last year. What OS, version and ZFS version are you running? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Raidz - what is stored in parity?
Erik Trimble wrote: > On 8/10/2010 9:57 PM, Peter Taps wrote: > > Hi Eric, > > > > Thank you for your help. At least one part is clear > now. > > > > I still am confused about how the system is still > functional after one disk fails. > > > > Consider my earlier example of 3 disks zpool > configured for raidz-1. To keep it simple let's not > consider block sizes. > > > > Let's say I send a write value "abcdef" to the > zpool. > > > > As the data gets striped, we will have 2 characters > per disk. > > > > disk1 = "ab" + some parity info > > disk2 = "cd" + some parity info > > disk3 = "ef" + some parity info > > > > Now, if disk2 fails, I lost "cd." How will I ever > recover this? The parity info may tell me that > something is bad but I don't see how my data will get > recovered. > > > > The only good thing is that any newer data will now > be striped over two disks. > > > > Perhaps I am missing some fundamental concept about > raidz. > > > > Regards, > > Peter > > Parity is not intended to tell you *if* something is > bad (well, it's not > *designed* for that). It tells you how to RECONSTRUCT > something should > it be bad. ZFS uses Checksums of the data (which are > stored as data > themselves) to tell if some data is bad, and thus > needs to be re-written To follow up Erik's post, parity is used both to detect and correct errors in a string of equal sized numbers, each parity is equal in size to each of the numbers. In the old serial protocols, one bit was used to detect an error in a string of 7 bits, so each "number" in the string was a one bit. In the case of ZFS, each "number" in the string is a disk block. The length of the string of numbers is completely arbitrary. I am rusty on parity math, but Reed-Solomon is used (of which XOR is a degenerate case) such that each parity is independent of the other parities. RAIDZ can support up to three parities per stripe. Generally, a single parity can either detect a single corrupt number in a string or if it is known which number is corrupt, a single parity can correct that number. Traditional RAID5 makes the assumption that it knows which number (i.e. block) is bad because the disk failed and therefore can use the parity block to reconstruct it. RAID5 cannot reconstruct a random bit-flip. RAIDZ takes a different approach where the checksum for the number string (i.e. stripe) exists in a different, already validated stripe. With that checksum in hand, ZFS knows when a stripe is corrupt but not which block. ZFS will then reconstruct each data block in the stripe using the parity block, one data block at a time until the checksum matches. At that point ZFS knows which block is bad and can rebuild it and write it to disk. A scrub does this for all stripes and all parities in each stripe. Using the example above, the disk layout would look more like the following for a single stripe, and as Erik mentioned, the location of the data and parity blocks will change from stripe to stripe: disk1 = "ab" disk2 = "cd" disk3 = parity info Again using the example above, if disk 2 fails, or even stays online but producess bad data, the information can be reconstructed from disk 3. The beauty of ZFS is that it does not depend on parity to detect errors, your stripes can be as wide as you want (up to 100-ish devices) and you can choose 1, 2 or 3 parity devices. Hope that makes sense, Marty -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disk space on Raidz1 configuration
> ahh that explains it all, god damn that base 1000 > standard , only usefull for sales people :) As much as it all annoys me too, the SI prefixes are used correctly pretty much everywhere except in operating systems. A kilometer is not 1024 meters and a megawatt is not 1048576 watts. Us, the IT community, grabbed a set of well defined prefixes used by the rest of creation, redefined them, and then became angry because the remainder of civilization uses the correct terms. We have no one to blame but ourselves. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] slog/L2ARC on a hard drive and not SSD?
> Hi, > Out of pure curiosity, I was wondering, what would > happen if one tries to use a regular 7200RPM (or 10K) > drive as slog or L2ARC (or both)? I have done both with success. At one point my backup pool was a collection of USB attached drives (please keep the laughter down) with dedup=verify. Solaris' slow USB performance coupled with slow drives and dedup reads gave abysmal write speeds, so much so that at times it had trouble keeping the snapshots synchronized. To help it along, I took an unused fast, small SCSI disk and made it L2ARC, which significantly improved write performance on the pool. During testing of some iSCSI applications, I ran into a scenario where a client was performing many small, synchronous writes to a zvol in a wide RAIDZ3 stripe. Since synchronous writes can double the write activity (once for the zil and once for the actual pool), actual throughput from the client was below 2MB/s, even though the pool would sustain 200MB/s on sequential writes. As above, I added a mirrored slog which was two small, fast SCSI drives. While I expected the throughput to double, it actually went up by a factor of 4, to 8MB/s. Even though 8MB/s wasn't mind-numbing, it was enough that it was close to saturating the client's 100Mb ethernet link, so it was ok. I think the reason that the slog improved things so much is that the slog disks were not bothered with other i/o and were doing very little seeking. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help identify failed drive
> If the format utility is not displaying the WD drives > correctly, > then ZFS won't see them correctly either. You need to > find out why. > > I would export this pool and recheck all of your > device connections. I didn't see it in the postings, but are the same serial numbers showing up multiple times? Is accidental multipathing taking place here? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help identify failed drive
Michael Shadle wrote: >Actually I guess my real question is why iostat hasn't logged any > errors in its counters even though the device has been bad in there > for months? One of my arrays had a drive in slot 4 fault -- lots of reset something or other errors. I cleared the errors and the pool and it did it again, even though the drive was showing ok in smartmontools and passed its internal self test. I replaced the drive with my cold spare and a week later the replacement drive in slot 4 had the same errors. Clearly it was the chassis and not the drive. I blew out the connector on slot 4 and it did again a week later. Again I cleared error, cycled the power on the array and haven't had the problem in the past 5 weeks. Sometimes things just happen, I guess. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help identify failed drive
> > ' iostat -Eni ' indeed outputs Device ID on some of > > the drives,but I still > > can't understand how it helps me to identify model > > of specific drive. Get and install smartmontools. Period. I resisted it for a few weeks but it has been an amazing tool. It will tell you more than you ever wanted to know about any disk drive in the /dev/rdsk/ tree, down to the serial number. I have seen zfs remember original names in a pool after they have been renamed by the OS such that "zpool status" can list c22t4d0 as a drive in the pool when there exists no such drive on the system. > Why has it been reported as bad (for probably 2 > months now, I haven't > got around to figuring out which disk in the case it > is etc.) but the > iostat isn't showing me any errors. Start a scrub or do an obscure find, e.g. "find /tank_mointpoint -name core" and watch the drive activity lights. The drive in the pool which isn't blinking like crazy is a faulted/offlined drive. Ugly and oh-so-hackerish, but it works. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Move Fedora or Windows disk image to ZFS (iScsi Boot)
> I've found plenty of documentation on how to create a > ZFS volume, iscsi share it, and then do a fresh > install of Fedora or Windows on the volume. Really? I have found just the opposite: how to move your functioning Windows/Linux install to iSCSI. I am fumbling through this process for Ubuntu on a laptop using a Frankenstein mishmash of PXE -> gPXE -> menu.cfg -> sanboot -> grub -> initrd -> Ubuntu. The initial install is through Ubuntu's netboot pxelinux.0 files which make iSCSI installs fairly painless as long as there are no initiator restrictions on the LUN. I couldn't find the magic formula in dnsmasq (on my router) to set the target and initiators which is needed to allow multiple devices to see their own iSCSI volumes, so I used a ${uuid} suffix for both in a gPXE menu.cfg file. Stranger still, it seems that only one LUN can be allocated system-wide, so I can't map LUN0 to target iqn.foo and another LUN0 to target iqn.bar, which means each initiator gets a non-zero LUN. It doesn't seem to bother the iSCSI stacks, but it bugs me. The other poster is correct, all of this has to match in gPXE, initrd and Ubuntu. Either I am more daft than I thought (always a safe choice), or the same thing is very difficult in Windows. To be honest, I have not braved a raw Windows install to iSCSI yet, but will once I conquer Ubuntu. The advantage of going straight to iSCSI is that the zvol can be arbritrarily large and you only allocate the blocks which have been touched. If you install to a disk then do the dd if=localdisk of=iSCSIdisk approach, the zvol will be completely allocated. Worse, the iSCSI volume is limited to the size of the original disk, which kind of misses the point of thin provisioning. Good luck. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] preparing for future drive additions
Cindy wrote: > Mirrored pools are more flexible and generally > provide good performance. > > You can easily create a mirrored pool of two disks > and then add two > more disks later. You can also replace each disk with > larger disks > if needed. See the example below. There is no dispute that multiple vdevs (mirrors or otherwise) allow changing the drives in a single vdev without requiring a change the whole pool. There also is no dispute that mirrors provide better read iops than any other vdev type. On the other hand, situation after situation exists where 2+ drives offline in a pool leaving the RAIDZ1 and single mirror vdevs in real trouble. As I write this, the first thread in this forum is about an invalid pool because one drive died and another is offline, leaving the pool corrupted. This stuff just happens in the real world with non-DMX-class gear. One major point I read over and over about zfs was that it allowed the same level of protection without needing to spend $35 per GB of storage from an enterprise vendor. The only way to make this happen is with significant redundancy. I choose n+3 redundancy and love it. It's like having two prebuilt hot spares. To achieve n+3 redundnancy with mirrors would require quadrupling the costs and spindle count vs. unprotected storage. It would seem that any vdev with n+1 protection is not adequate protection using sub million dollar storage equipment. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Remove non-redundant disk
> I think the request is to remove vdev's from a pool. > Not currently possible. Is this in the works? Actually, I think this is two requests, hashed over hundreds of times in this forum: 1. Remove a vdev from a pool 2. Nondisruptively change vdev geometry #1 above has a stunningly obvious use case. Suppose, despite your best efforts, QA, planning and walkthroughs, you accidentally fat finger a "zpool attach" and unintentionally "zpool add" a disk to a pool. There is no way to reverse that operation without *significant* downtime. I have discussed #2 above multiple times and has at least one obvious use case. Suppose, just for a minute, that over the years since you deployed a zfs pool with nearly constant uptime, that your business needs change and you need to add a disk to a RAIDZ vdev, or move from RAIDZ1 to RAIDZ2, or disks have grown so big that you wish to remove a disk from a vdev. The responses from the community on the two requests seem to be: 1. Don't ever make this mistake and if you do, then tough luck 2. No business ever changes, or technology never changes, or zfs deployments have short lives, or businesses are perfectly ok with large downtimes to effect geometry changes. Both responses seem antithetical to the zfs ethos of survivability in the face of errors and nondisruptive flexibility. Honestly, I still don't understand the resistance to adding those features. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] optimal ZFS filesystem layout on JBOD
Joachim Worringen wrote: > Greetings, > > we are running a few databases of currently 200GB > (growing) in total for data warehousing: > - new data via INSERTs for (up to) millions of rows > per day; sometimes with UPDATEs > - most data in a single table (=> 10 to 100s of > millions of rows) > - queries SELECT subsets of this table via an index > - for effective parallelisation, queries create > (potentially large) non-temporary tables which are > deleted at the end of the query => lots of simple > INSERTs and SELECTs during queries > - large transactions: they may contain millions of > INSERTs/UPDATEs > - running version PostgreSQL 8.4.2 > > We are moving all this to a larger system - the > hardware is available, therefore fixed: > - Sun X4600 (16 cores, 64GB) > - external SAS JBOD with 24 2,5" slots: > o 18x SAS 10k 146GB drives > o 2x SAS 10k 73GB drives > o 4x Intel SLC 32GB SATA SSD > JBOD connected to Adaptec SAS HBA with BBU > - Internal storage via on-board RAID HBA: > o 2x 73GB SAS 10k for OS (RAID1) > o 2x Intel SLC 32GB SATA SSD for ZIL (RAID1) (?) > - OS will be Solaris 10 to have ZFS as filesystem > (and dtrace) > - 10GigE towards client tier (currently, another > X4600 with 32cores and 64GB) > > What would be the optimal storage/ZFS layout for > this? I checked solarisinternals.com and some > PostgreSQL resources and came to the following > concept - asking for your comments: > - run the JBOD without HW-RAID, but let all > redundancy be done by ZFS for maximum flexibility > - create separate ZFS pools for tablespaces (data, > index, temp) and WAL on separate devices (LUNs): > - use the 4 SSDs in the JBOD as Level-2 ARC cache > (can I use a single cache for all pools?) w/o > redundancy > - use the 2 SSDs connected to the on-board HBA as > RAID1 for ZFS ZIL > > Potential issues that I see: > - the ZFS ZIL will not benefit from a BBU (as it is > connected to the backplane, driven by the > onboard-RAID), and might be too small (32GB for ~2TB > of data with lots of writes)? > - the pools on the JBOD might have the wrong size for > the tablespaces - like: using the 2 73GB drives as > RAID 1 for temp might become too small, but adding a > 146GB drive might not be a good idea? > - with 20 spindles, does it make sense at all to use > dedicated devices for the tabelspaces, or will the > load be distributed well enough across the spindles > anyway? > > thanks for any comments & suggestions, > > Joachim I'll chime in based on some tuning experience I had under UFS with Pg 7.x coupled with some experience with ZFS, but not experience with later Pg on ZFS. Take this with a grain of salt. Pg loves to push everything to the WAL and then dribble the changes back to the datafiles when convenient. At a checkpoint, all of the changes are flushed in bulk to the tablespace. Since the changes to WAL and disk are synchronous, ZIL is used, which I believe translates to all data being written four times under ZFS: once to WAL ZIL, then to WAL, then to tablespace ZIL, then to tablespace. For writes, I would break WAL into it's own pool and then put an SSD ZIL mirror on that. It would allow all writes to be nearly instant to WAL and would keep the ZIL needs to the size of the WAL, which probably won't exceed the size of your SSD. The ZIL on WAL will especially help with large index updates which can cause cascading b-tree splits and result in large amounts of small syncronous I/O, bringing Pg to a crawl. Checkpoints will still slow things down when the data is flushed to the tablespace pool, but that will happen with coalesced writes, so iops would be less of a concern. For reads, I would either keep indexes and tables on the same pool and back them with as much L2ARC as needed for the working set, or if you lack sufficient L2ARC, break the indexes into their own pool and L2ARC those instead, because index reads generally are more random and heavily used, at least for well tuned queries. Full table scans for well-vacuumed tables are generally sequential in nature, so table iops again are less of a concern. If you have to break the indexes into their own pool for dedicated SSD L2ARC, you might consider adding some smaller or short-stroked 15K drives for L2ARC on the table pool. For geometry, find the redundancy that you need, e.g. +1, +2 or +3, then decide which is more important, space or iops. If L2ARC and ZIL reduce your need for iops, then go with RAIDZ[123]. If you still need the iops, pile a bunch of [123]-way mirrors together. Yes, I would avoid HW raid and run pure JBOD and would be tempted to keep temp tables on the index or table pool. Like I said above, take this with a grain of salt and feel free to throw out, disagree with or lampoon me for anything that does not resonate with you. Whatever you do, make sure you stress-test the configuration with production-size data and workloads before you deploy it. Good luck, Marty -- This message posted from
Re: [zfs-discuss] Depth of Scrub
> I have a small question about the depth of scrub in a > raidz/2/3 configuration. > I'm quite sure scrub does not check spares or unused > areas of the disks (it > could check if the disks detects any errors there). > But what about the parity? >From some informal performance testing of RAIDZ2/3 arrays, I am confident that >scrub reads the parity blocks and normal reads do not. You can see this for yourself with "iostat -x" or "zpool iostat -v" Start monitoring and watch read I/O. You will see regularly that a RAIDZ3 array will read from all but three drives, which I presume is the unread parity. Do the same monitoring while a scrub is underway and you will see all drives being read from equally. My experience suggests something similar is taking place with mirrors. If you think about it, having a scrub check everything but the parity would be a rather pointless operation. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] one more time: pool size changes
On Jun 3, 2010 7:35 PM, David Magda wrote: > On Jun 3, 2010, at 13:36, Garrett D'Amore wrote: > > > Perhaps you have been unlucky. Certainly, there is > a window with N > > +1 redundancy where a single failure leaves the > system exposed in > > the face of a 2nd fault. This is a statistics > game... > > It doesn't even have to be a drive failure, but an > unrecoverable read > error. Well said. Also include a controller burp, a bit flip somewhere, a drive going offline briefly, fibre cable momentary interruption, etc. The list goes on. My experience is that these weirdo "once in a lifetime" issues tend to present in clumps which are not as evenly distributred as statistics would lead you to believe. Rather, like my kids, they save up their fun into coordinated bursts. When these bursts happen, the ensuing conversations with stakeholders about how all of this "redundancy" you tricked them into purchasing has left them exposed. Not good times. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] one more time: pool size changes
David Dyer-Bennet wrote: > My choice of mirrors rather than RAIDZ is based on > the fact that I have > only 8 hot-swap bays (I still think of this as LARGE > for a home server; > the competition, things like the Drobo, tends to have > 4 or 5), that I > don't need really large amounts of storage (after my > latest upgrade I'm > running with 1.2TB of available data space), and that > I expected to need > to expand storage over the life of the system. With > mirror vdevs, I can > expand them without compromising redundancy even > temporarily, by attaching > the new drives before I detach the old drives; I > couldn't do that with > RAIDZ. Also, the fact that disk is now so cheap > means that 100% > redundancy is affordable, I don't have to compromise > on RAIDZ. Maybe I have been unlucky too many times doing storage admin in the 90s, but simple mirroring still scares me. Even with a hot spare (you do have one, right?) the rebuild window leaves the entire pool exposed to a single failure. One of the nice things about zfs is that allows, "to each his own." My home server's main pool is 22x 73GB disks in a Sun A5000 configured as RAIDZ3. Even without a hot spare, it takes several failures to get the pool into trouble. At the same time, there are several downsides to a wide stripe like that, including relatively poor iops and longer rebuild windows. As noted above, until bp_rewrite arrives, I cannot change the geometry of a vdev, which kind of limits the flexibility. As a side rant, I still find myself baffled that Oracle/Sun correctly touts the benefits of zfs in the enterprise, including tremendous flexibility and simplicity of filesystem provisioning and nondisruptive changes to filesystems via properties. These forums are filled with people stating that the enterprise demands simple, flexibile and nondisruptive filesystem changes, but no enterprise cares about simple, flexibile and nondisruptive pool/vdev changes, e.g. changing a vdev geometry or evacuating a vdev. I can't accept that zfs flexibility is critical and zpool flexibility is unwanted. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] creating a fast ZIL device for $200
I have a Sun A5000, 22x 73GB 15K disks in split-bus configuration, two dual 2Gb HBAs and four fibre cables from server to array, all for just under $200. The array gives 4Gb of aggregate thoughput in each direction across two 11 disk buses. Right now it is the main array, but when we outgrow its storage it will become a multiple external ZIL / L2ARC array for a slow sata array. Admittedly, it is rare for all of the pieces to come together at the right price like this and since it is unsupported no one would seriously consider it for production. At the same time, it makes blistering main storage today and will provide for amazing iops against slow storage later. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposed idea for enhancement - damage control
I cant' stop myself; I have to respond. :-) Richard wrote: > The ideal pool has one inexpensive, fast, and reliable device :-) My ideal pool has become one inexpensive, fast and reliable "device" built on whatever I choose. > I'm not sure how to connect those into the system (USB 3?) Me neither, but if I had to start guessing about host connections, I would probably think FC. > but when you build it, let us know how it works out. While it would be a fun project, a toy like that would certainly exceed my feeble hardware experience and even more feeble budget. At the same time, I could make a compelling argument that this sort of arrangement: stripes of flash, is the future of tier-one storage. We already are seeing SSD devices which internally are stripes of flash. More and more disks farms are taking on the older roles of tape. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposed idea for enhancement - damage control
Bob Friesenhahn wrote: > It is unreasonable to spend more than 24 hours to resilver a single > drive. It is unreasonable to spend more than 6 days resilvering all > of the devices in a RAID group (the 7th day is reserved for the system > administrator). It is unreasonable to spend very much time at all on > resilvering (using current rotating media) since the resilvering > process kills performance. Bob, the vast majority of your post I agree with. At the same time, I might disagree with a couple of things. I don't really care how long a resilver takes (hours, days, months) given a couple things: * Sufficient protection exists on the degraded array during rebuild ** Put another way, the array is NEVER in danger * Rebuild takes a back seat to production demands Since I am on a rant, I suspect there is also room for improvement in the scrub. Why would I rescrub a stripe that was read (and presumably validated) 30 seconds ago by a production application? Wouldn't it make more sense for scrub to "play nice" with production, moving a leisurely pace and only scrubbing stripes not read in the past X hours/days/weeks/whatever? I also agree that an ideal pool would be lowering the device capacity and radically increasing the device count. In my perfect world, I would have a RAID set of 200+ cheap, low-latency, low-capacity flash drives backed by an additional N% parity, e.g. 40-ish flash drives. A setup like this would give massive throughput: 200x each flash drive, amazing IOPS and incredible resiliancy. Rebuild times would be low due to lower capacity. One could probably build such a beast in 1U using MicroSDHC cards or some such thing. End rant. Cheers, Marty -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Cores vs. Speed?
>> Was my raidz2 performance comment above correct? >> That the write speed is that of the slowest disk? >> That is what I believe I have >> read. > You are > sort-of-correct that its the write speed of the > slowest disk. My experience is not in line with that statement. RAIDZ will write a complete stripe plus parity (RAIDZ2 -> two parities, etc.). The write speed of the entire stripe will be brought down to that of the slowest disk, but only for its portion of the stripe. In the case of a 5 spindle RAIDZ2, 1/3 of the stripe will be written to each of three disks and parity info on the other two disks. The throughput would be 3x the slowest disk for read or write. > Mirrored drives will be faster, especially for > random I/O. But you sacrifice storage for that > performance boost. Is that really true? Even after glancing at the code, I don't know if zfs overlaps mirror reads across devices. Watching my rpool mirror leads me to believe that it does not. If true, then mirror reads would be no faster than a single disk. Mirror writes are no faster than the slowest disk. As a somewhat related rant, there seems to be confusion about mirror IOPS vs. RAIDZ[123] IOPS. Assuming mirror reads are not overlapped, then a mirror vdev will read and write at roughly the same throughput and IOPS as a single disk (ignoring bus and cpu constraints). Also ignoring bus and cpu constraints, a RAIDZ[123] vdev will read and write at roughly the same throughput of a single disk, multiplied by the number of data drives: three in the config being discussed. Also, a RAIDZ[123] vdev will have IOPS performance similar to that of a single disk. A stack of mirror vdevs will, of course, perform much better than a single mirror vdev in terms of throughput and IOPS. A stack of RAIDZ[123] vdevs will also perform much better than a single RAIDZ[123] vdev in terms of throughput and IOPS. RAIDZ tends to have more CPU overhead and provides more flexibility in choosing the optimal data to redundancy ratio. Many read IOPS problems can be mitigated by L2ARC, even a set of small, fast disk drives. Many write IOPS problems can be mitigated by ZIL. My anecdotal conclusions backed by zero science, Marty -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] adpu320 scsi timeouts only with ZFS
> To fix it, I swapped out the Adaptec controller and > put in LSI Logic > and all the problems went away. I'm using Sun's built-in LSI controller with (I presume) the original internal cable shipped by Sun. Still, no joy for me at U320 speeds. To be precise, when the controller is set at U320, it runs amazingly fast until it freezes, at which point it is quite slow. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] adpu320 scsi timeouts only with ZFS
> Any news regarding this issue? I'm having the same > problems. Me too. My v40z with U320 drives in the internal bay will lock up partway through a scrub. I backed the whole SCSI chain down to U160, but it seems a shame that U320 speeds can't be used. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] $100 SSD = >5x faster dedupe
--- On Thu, 1/7/10, Tiernan OToole wrote: > Sorry to hijack the thread, but can you > explain your setup? Sounds interesting, but need more > info... This is just a home setup to amuse me and placate my three boys, each of whom has several Windows instances running under Virtualbox. Server is a Sun v40z: quad 2.4 GHz Opteron with 16GB. Internal bays hold a pair of 73GB drives as a mirrored rpool and a pair of 36GB drives for spares to the array plus a 146GB drive I use as cache to the usb pool (a single 320GB sata drive). The array is an HP MSA30 with 14x36GB drives configured as RAIDZ3 using the spares listed above with auto snapshots as the tank pool. Tank is synchronized hourly to the usb pool. It's all connected via four HP 4000M switches (one at the server and one at each workstation) which are meshed via gigabit fiber. Two workstations are triple-head sunrays. One station is a single sunray 150 integrated unit. This is a work in progress with plenty of headroom to grow. I started the build in November and have less than $1200 into it so far. Thanks for letting me hijack the thread by sharing! Cheers, Marty ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] $100 SSD = >5x faster dedupe
Ian wrote: > Why did you set dedup=verify on the USB pool? Because that is my last-ditch copy of the data and MUST be correct. At the same time, I want to cram as much data as possible into the pool. If I ever go to the USB pool, something has already gone horribly wrong and I am desperate. I can't comprehend the anxiety I would have if one or more stripes had a birthday collision giving me silent data corruption that I found out about months or years later. It's probably paranoid, but a level of paranoia I can live with. Good question, by the way. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] $100 SSD = >5x faster dedupe
Michael Herf wrote: > I've written about my slow-to-dedupe RAIDZ. > > After a week of.waitingI finally bought a > little $100 30G OCZ > Vertex and plugged it in as a cache. > > After <2 hours of warmup, my zfs send/receive rate on > the pool is > >16MB/sec (reading and writing each at 16MB as > measured by zpool > iostat). > That's up from <3MB/sec, with a RAM-only cache on a > 6GB machine. > > The SSD has about 8GB utilized right now, and the > L2ARC benefit is amazing. > Quite an amazing improvement for $100...recommend you > don't dedupe without one. I did something similar, but with a SCSI drive. I keep a large external USB drive as a "last ditch" recovery pool which is synchronized hourly from the main pool. Kind of like a poor man's tape backup. When I enabled dedup=verify on the USB pool, the sync performance went south, because the USB drive had to read stripes to verify that they were actual dups. Since I had an unused 146GB SCSI drive plugged in, I made the SCSI drive L2ARC for the USB pool. Write performance skyrocketed by a factor of 6 and is now faster than when there was no dedupe enabled. Marty -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz data loss stories?
risner wrote: > If I understand correctly, raidz{1} is 1 drive > protection and space is (drives - 1) available. > Raidz2 is 2 drive protection and space is (drives - > 2) etc. Same for raidz3 being 3 drive protection. Yes. > Everything I've seen you should stay around 6-9 > drives for raidz, so don't do a raidz3 with 12 > drives. Instead make two raidz3 with 6 drives each > (which is (6-3)*1.5 * 2 = 9 TB array.) >From what I can tell, this is purely a function of needed IOPS. Wider stripe >= better storage/bandwidth utilization = less IOPS. For home usage I run a 14 >drive RAIDZ3 array. > As for whether or not to do raidz, for me the > issue is performance. I can't handle the raidz > write penalty. If there is a RAIDZ write penalty over mirroring, I am unaware of it. In fact, sequential writes are faster under RAIDZ. > If I needed triple drive protection, > a 3way mirror setup would be the only way I would > go. That will give high IOPS with 33% storage utilization and 33% bandwidth utilization. In other words, for every MB of data read/witten by an application, 3MB is read/written from/to the array and stored. Multiply all storage and bandwidth needs by three. > I don't yet quite understand why a 3+ drive > raidz2 vdev is better than a 3 drive mirror vdev? > Other than a 5 drive setup is 3 drives of space > when a 6 drive setup using 3 way mirror is only 2 > drive space. Part of the question you answered yourself. The other part is that with a 6 drive RAIDZ3, I can lose ANY three drives and still be running. With three mirrors, I can lose the pool if the wrong two drives die. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz data loss stories?
Bob Friesenhahn wrote: > On Tue, 22 Dec 2009, Marty Scholes wrote: > > > > That's not entirely true, is it? > > * RAIDZ is RAID5 + checksum + COW > > * RAIDZ2 is RAID6 + checksum + COW > > * A stack of mirror vdevs is RAID10 + checksum + > COW > > These are layman's simplifications that no one here > should be > comfortable with. Well, ok. They do seem to capture the essence of what the different flavors of ZFS protection do, but I'll take you at your word. We do seem to be spinning off on a tangent, tho. > Zfs borrows proven data recovery technologies from > classic RAID but > the data layout on disk is not classic RAID, or even > close to it. > Metadata and file data are handled differently. > Metadata is always > uplicated, with the most critical metadata being > strewn across > multiple disks. Even "mirror" disks are not really > mirrors of each > other. I am having a little trouble reconciling the above statements, but again, ok. I haven't read the official RAID spec, so again, I'll take you at your word. Honestly, those seem like important nuances, but nuances nonetheless. > Earlier in this discussion thread someone claimed > that if a raidz disk > was lost that the pool was then just one data error > away from total disaster That would be me. Let me substitute the phrase "user data loss in some way, shape or form which disrupts availability" for the words "total disaster." Honestly, I think we are splitting hairs here. Everyone agrees that RAIDZ takes RAID5 to a new level. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz data loss stories?
Bob Friesenhahn wrote: > Why are people talking about "RAID-5", RAID-6", and > "RAID-10" on this > list? This is the zfs-discuss list and zfs does not > do "RAID-5", > "RAID-6", or "RAID-10". > > Applying classic RAID terms to zfs is just plain > wrong and misleading > since zfs does not directly implement these classic > RAID approaches > even though it re-uses some of the algorithms for > data recovery. > Details do matter. That's not entirely true, is it? * RAIDZ is RAID5 + checksum + COW * RAIDZ2 is RAID6 + checksum + COW * A stack of mirror vdevs is RAID10 + checksum + COW While there isn't an actual one-to-one mapping, many traditional RAID concepts do seem to apply to ZFS discussions, don't they? Marty -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz data loss stories?
> > Hi Ross, > > > > What about old good raid10? It's a pretty > reasonable choice for > > heavy loaded storages, isn't it? > > > > I remember when I migrated raidz2 to 8xdrives > raid10 the application > > administrators were just really happy with the new > access speed. (we > > didn't use stripped raidz2 though as you are > suggesting). > > Raid10 provides excellent performance and if > performance is a priority > then I recommend it, but I was under the impression > that resiliency > was the priority, as raidz2/raidz3 provide greater > resiliency for a > sacrifice in performance. My experience is in line with Ross' comments. There is no question that more independent vdevs will improve IOPS, e.g. RAID10 or even a pile of RAIDZ vdevs. I have been burnt too many times to let an array get critical (no redunancy). Never, ever, ever again. With a RAID1 or RAID10, one disk loss puts the whole pool critical, just one bad sector from disaster. One prays the hot spare can be built in time. With RAIDZ, the same is true. I think of triple (or even quad) mirroring the same way as I think of RAIDZ3: it's like having prebuilt hot spares. I suspect that the IOPS problems of wide stripes are becoming mitigated by L2ARC/ZIL and that the trend will be toward wide stripes with ever higher parity counts. Sun's recent storage offerings tend to confirm this trend: slower, cheaper and bigger SATA drives fronted by SSD L2ARC and ZIL. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Stupid to have 2 disk raidz?
Erik Trimble wrote: > As always, the devil is in the details. In this case, > the primary > problem I'm having is maintaining two different block > mapping schemes > (one for the old disk layout, and one for the new > disk layout) and still > being able to interrupt the expansion process. My > primary problem is > that I have to keep both schemes in memory during the > migration, and if > something should happen (i.e. reboot, panic, etc) > then I lose the > current state of the zpool, and everything goes to > hell in a handbasket. It might not be that bad, if only zfs would allow mirroring a raidz pool. Back when I did storage admin for a smaller company where availability was hyper-critical (but we couldn't afford EMC/Veritas), we had a hardware RAID5 array. After a few years of service, we ran into some problems: * Need to restripe the array? Screwed. * Need to replace the array because current one is EOL? Screwed. * Array controller barfed for whatever reason? Screwed. * Need to flash the controller with latest firmware? Screwed. * Need to replace a component on the array, e.g. NIC, controller or power supply? Screwed. * Need to relocate the array? Screwed. If we could stomach downtime or short-lived storage solutions, none of this would have mattered. To get around this, we took two hardware RAID arrays and mirrored them in software. We could offline/restripe/replace/upgrade/relocate/whatever-we-wanted to an individual array since it was only a mirror which we could offline/online or detach/attach. I suspect this could be simulated today with setting up a mirrored pool on top of a zvol of a raidz pool. That involves a lot of overhead, doing parity/checksum calculations multiple times for the same data. On the plus side, setting this up might make it possible to defrag a pool. Should zfs simply allow mirroring one pool with another, then with a few spare disks laying around, altering the geometry of an existing pool could be done with zero downtime using steps similar to the following. 1. Create spare_pool as large as current_pool using spare disks 2. Attach spare_pool to current_pool 3. Wait for resilver to complete 4. Detach and destroy current_pool 5. Create new_pool the way you want it now 6. Attach new_pool to spare_pool 7. Wait for resilver to complete 8. Detach/destroy spare_pool 9. Chuckle at the fact that you completely remade your production pool while fully available I did this dance several times over the course of many years back in the Disksuite days. Thoughts? Marty -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RAIDZ versus mirrroed
> Yes. This is a mathematical way of saying > "lose any P+1 of N disks." I am hesitant to beat this dead horse, yet it is a nuance that either I have completely misunderstood or many people I've met have completely missed. Whether a stripe of mirrors or mirror of a stripes, any single failure makes the array critical, i.e. one failure from disaster. For example, suppose a stripe of four sets of mirrors. That stripe has 8 disks total: four data and four mirrors. If one disk fails, say on mirror set 3, then set 3 is running on a single disk. Should that remaining disk in set 3 fail, the whole stripe is lost. Yes, the stripe is safe as long as the next failure is not from set 3. Contrast that to RAIDZ3. Suppose seven total disks with the same effective pool size: 4 data and 3 parity. If any single disk is lost then the array is not critical and can still survive any other loss. In fact, it can survive a total of any three disk failures before it becomes critical. I just see it too often where someone states that a stripe of four mirror sets can sustain four disk failures. Yes, that's true, as long as the correct four disks fail. If we could control which disks fail, then none of this would even be necessary, so that argument seems rather silly. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RAIDZ versus mirrroed
> This line of reasoning doesn't get you very far. > It is much better to take a look at > the mean time to data loss (MTTDL) for the various > configurations. I wrote a > series of blogs to show how this is done. > http://blogs.sun.com/relling/tags/mttdl"; > target="_blank">http://blogs.sun.com/relling/tags/mttdl I will play the Devils advocate here and point out that the chart shows MTTDL for RAIDZ2, both 6 and 8 disk, is much better than mirroring. The chart does show that three way mirroring is better still and I would guess that RAIDZ3 surpasses that. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs send older version?
Lori Alt wrote: > As for being able to read streams of a later format > on an earlier > version of ZFS, I don't think that will ever be > supported. In that > case, we really would have to somehow convert the > format of the objects > stored within the send stream and we have no plans to > implement anything > like that. If that is true, then it at least makes sense to include a "zfs downgrade" and "zpool downgrade" option, does it not? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RAIDZ versus mirrroed
> Generally speaking, striping mirrors will be faster > than raidz or raidz2, > but it will require a higher number of disks and > therefore higher cost to > The main reason to use > raidz or raidz2 instead > of striping mirrors would be to keep the cost down, > or to get higher usable > space out of a fixed number of drives. While it has been a while since I have done storage management for critical systems, the advantage I see with RAIDZN is better fault tolerance: any N drives may fail before the set goes critical. With straight mirroring, failure of the wrong two drives will invalidate the whole pool. The advantage of striped mirrors is that it offers a better chance of higher iops (assuming the I/O is distributed correctly). Also, it might be easier to expand a mirror by upgrading only two drives with larger drives. With RAID, the entire stripe of drives would need to be upgraded. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs send older version?
> The zfs send stream is dependent on the version of > the filesystem, so the > only way to create an older stream is to create a > back-versioned > filesystem: > > zfs create -o version=N pool/filesystem > You can see what versions your system supports by > using the zfs upgrade > command: Thanks for the feedback. So if I have a version X pool/filesystem set, does that mean the way to move it back to an older version of TANK is to do something like: * Create OLDTANK with version=N * For each snapshot in TANK ** (cd tank_snapshot; tar cvf -) | (cd old_tank; tar xvf -) ** zfs snapshot oldtank the_snapshot_name This seems rather involved to get my current files/snaps into an older format. What did I miss? Thanks again, Marty -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs send older version?
After moving from SXCE to 2009.06, my ZFS pools/file systems were at too new of a version. I upgraded to the latest dev and recently upgraded to 122, but am not too thrilled with the instability, especially zfs send / recv lockups (don't recall the bug number). I keep a copy of all of my critical stuff along with the original auto snapshots on a USB drive. I really want to move back to 2009.06 and keep all of my files / snapshots. Is there a way somehow to zfs send an older stream that 2009.06 will read so that I can import that into 2009.06? Can I even create an older pool/dataset using 122? Ideally I would provision an older version of the data and simply reinstall 2009.06 and just import the pool created under 122. It seems this would be a regular request. If I understand it correctly, an older BE cannot read upgraded pools and file systems, so a boot image upgrade followed by a zfs and zpool upgrade would kill a shop's ability to fall back. Or am I mistaken? Is there a way to send older streams? Thanks, Marty -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss