Re: [zfs-discuss] RAIDZ one of the disk showing unavail
Miles Nordin wrote: Ralf, aren't you missing this obstinence-error: sc the following errors must be manually repaired: sc /dev/dsk/c0t2d0s0 is part of active ZFS pool export_content. and he used the -f flag. No, I saw it. My understanding has been that the drive was unavailable right after the *creation* of the zpool. And replacing a broken drive with itself doesn't make sense. And after replacing the drive with a working one, ZFS should recognize this automatically. -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 [EMAIL PROTECTED] - http://web.de/ 11 Internet AG Brauerstraße 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RAIDZ one of the disk showing unavail
Srinivas Chadalavada wrote: I see the first disk as unavailble, How do i make it online? By replacing it with a non-broken one. -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 [EMAIL PROTECTED] - http://web.de/ 11 Internet AG Brauerstraße 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] x4500 vs AVS ?
. This is a cheap workaround, but honestly: You can use something like this for your own datacenter, but I bet nobody wants to sell it to a customer as a supported solution ;-) -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 [EMAIL PROTECTED] - http://web.de/ 11 Internet AG Brauerstraße 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Inconsistent df and du output ?
Juris Krumins wrote: lun.0 file, which is at least 20Gb big resides on /export/storage. Why df shows only 4.9 GB ? --- -bash-3.2# ls -la total 4194871 drwxr-xr-x 2 root sys3 Sep 18 17:28 . drwxr-xr-x 5 root root 8 Sep 18 17:44 .. -rw--- 1 root sys 2147483648 Sep 18 17:42 lun.0 --- Let's make the total file size a bit more human readable: 2,147,483,648 byte. That's 2 GB, not 20. Try `ls -alh`. Concerning df: --- -bash-3.2# df -h Filesystem size used avail capacity Mounted on [...] rpool/export65G 4.8G54G 9%/export [...] --- Looks good to me. Or did I miss something and understood you wrong? -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 [EMAIL PROTECTED] - http://web.de/ 11 Internet AG Brauerstraße 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] x4500 vs AVS ?
Jorgen Lundman wrote: If we were interested in finding a method to replicate data to a 2nd x4500, what other options are there for us? If you already have an X4500, I think the best option for you is a cron job with incremental 'zfs send'. Or rsync. -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 [EMAIL PROTECTED] - http://web.de/ 11 Internet AG Brauerstraße 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [storage-discuss] A few questions
gm_sjo wrote: Are you not infact losing performance by reducing the amount of spindles used for a given pool? This depends. Usually, RAIDZ1/2 isn't a good performancer when it comes to random access read I/O, for instance. If I wanted to scale performance by adding spindles, I would use mirrors (RAID 10). If you want to scale filesystem sizes, RAIDZ is your friend. I once had the problem that I needed a high random I/O performance and at least a 11 TB large filesystem on a X4500. Mirroring was out of the question (not enough disk space left), and RAIDZ gave me only about 25% of the performance of the existing Linux ext2 boxes I had to compete with. But in the end, striping 13 RAIDZ sets of 3 drives each + 1 hot spare delivered acceptable results in both categories. But it took me a lot of benchmarks to get there. -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 [EMAIL PROTECTED] - http://web.de/ 11 Internet AG Brauerstraße 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] x4500 vs AVS ?
Matt Beebe wrote: But what happens to the secondary server? Specifically to its bit-for-bit copy of Drive #2... presumably it is still good, but ZFS will offline that disk on the primary server, replicate the metadata, and when/if I promote the seconday server, it will also be running in a degraded state (ie: 3 out of 4 drives). correct? Correct. In this scenario, my replication hasn't really bought me any increased availablity... or am I missing something? No. You have an increase of availability when the entire primary node goes down, but you're not particularly safer when it comes to decreased zpools. Also, if I do chose to fail over to the secondary, can I just to a scrub the broken drive (which isn't really broken, but the zpool would be inconsistent at some level with the other online drives) and get back to full speed quickly? or will I always have to wait until one of the servers resilvers itself (from scratch?), and re-replicates itself?? I have not tested this scenario, so I can't say anything about this. -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 [EMAIL PROTECTED] - http://web.de/ 11 Internet AG Brauerstraße 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] x4500 vs AVS ?
Richard Elling wrote: Yes, you're right. But sadly, in the mentioned scenario of having replaced an entire drive, the entire disk is rewritten by ZFS. No, this is not true. ZFS only resilvers data. Okay, I see we have a communication problem here. Probably my fault, I should have written the entire data and metadata. I made the assumption that a 1 TB drive in a X4500 may have up to 1 TB of data on it. Simply because nobody buys the 1 TB X4500 just to use 10% of the disk space, he would have bought the 250 GB, 500 GB or 750 GB model then. In any case and any disk size scenario, that's something you don't want to have on your network if there's a chance to avoid this. -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 [EMAIL PROTECTED] - http://web.de/ 11 Internet AG Brauerstraße 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] x4500 vs AVS ?
of trying to start a flame war. From now on, I leave the rest to you, because I earn my living with products of Sun Microsystems, too, and I don't want to damage neither Sun nor this mailing list. -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 [EMAIL PROTECTED] - http://web.de/ 11 Internet AG Brauerstraße 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] x4500 vs AVS ?
[EMAIL PROTECTED] wrote: War wounds? Could you please expand on the why a bit more? - ZFS is not aware of AVS. On the secondary node, you'll always have to force the `zfs import` due to the unnoticed changes of metadata (zpool in use). No mechanism to prevent data loss exists, e.g. zpools can be imported when the replicator is *not* in logging mode. - AVS is not ZFS aware. For instance, if ZFS resilves a mirrored disk, e.g. after replacing a drive, the complete disk is sent over the network to the secondary node, even though the replicated data on the secondary is intact. That's a lot of fun with today's disk sizes of 750 GB and 1 TB drives, resulting in usually 10+ hours without real redundancy (customers who use Thumpers to store important data usually don't have the budget to connect their data centers with 10 Gbit/s, so expect 10+ hours *per disk*). - ZFS AVS X4500 leads to a bad error handling. The Zpool may not be imported on the secondary node during the replication. The X4500 does not have a RAID controller which signals (and handles) drive faults. Drive failures on the secondary node may happen unnoticed until the primary nodes goes down and you want to import the zpool on the secondary node with the broken drive. Since ZFS doesn't offer a recovery mechanism like fsck, data loss of up to 20 TB may occur. If you use AVS with ZFS, make sure that you have a storage which handles drive failures without OS interaction. - 5 hours for scrubbing a 1 TB drive. If you're lucky. Up to 48 drives in total. - An X4500 has no battery buffered write cache. ZFS uses the server's RAM as a cache, 15 GB+. I don't want to find out how much time a resilver over the network after a power outage may take (a full reverse replication would take up to 2 weeks and is no valid option in a serious production environment). But the underlying question I asked myself is why I should I want to replicate data in such an expensive way, when I think the 48 TB data itself are not important enough to be protected by a battery? - I gave AVS a set of 6 drives just for the bitmaps (using SVM soft partitions). Weren't enough, the replication was still very slow, probably because of an insane amount of head movements, and scales badly. Putting the bitmap of a drive on the drive itself (if I remember correctly, this is recommended in one of the most referenced howto blog articles) is a bad idea. Always use ZFS on whole disks, if performance and caching matters to you. - AVS seems to require an additional shared storage when building failover clusters with 48 TB of internal storage. That may be hard to explain to the customer. But I'm not 100% sure about this, because I just didn't find a way, I didn't ask on a mailing list for help. If you want a fail-over solution for important data, use the external JBODs. Use AVS only to mirror complete clusters, don't use it to replicate single boxes with local drives. And, in case OpenSolaris is not an option for you due to your company policies or support contracts, building a real cluster also A LOT cheaper. -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 [EMAIL PROTECTED] - http://web.de/ 11 Internet AG Brauerstraße 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] x4500 vs AVS ?
Jorgen Lundman wrote: We did ask our vendor, but we were just told that AVS does not support x4500. The officially supported AVS works on the X4500 since the X4500 came out. But, although Jim Dunham and others will tell you otherwise, I absolutely can *not* recommend using it on this hardware with ZFS, especially with the larger disk sizes. At least not for important, or even business critical data - in such a case, using X41x0 servers with J4500 JBODs and a HAStoragePlus Cluster instead of AVS may be a much better and more reliable option, for basically the same price. -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 [EMAIL PROTECTED] - http://web.de/ 11 Internet AG Brauerstraße 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] x4500 vs AVS ?
Brent Jones wrote: I did some Googling, but I saw some limitations sharing your ZFS pool via NFS while using HAStorage Cluster product as well. [...] If you are using the zettabyte file system (ZFS) as the exported file system, you must set the sharenfs property to off. That's not a limitation, just looks like one. The cluster's resource type called SUNW.nfs decides if a file system is shared or not. And it does this with the usual share and unshare commands in a separate dfstab file. The ZFS sharenfs flag is set to off to avoid conflicts. -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 [EMAIL PROTECTED] - http://web.de/ 11 Internet AG Brauerstraße 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS hangs/freezes after disk failure,
the percentage of pain during a desaster by spending more money, e.g. by making the SATA controllers redundant and creating a mirror (than controller 1 will hang, but controller 2 will continue working), but you must not forget that your PCI bridges, fans, power supplies, etc. remain single points of failures why can take the entire service down like your pulling of the non-hotpluggable drive did. c) If you want both, you should buy a second server and create a NFS cluster. Hope I could help you a bit, Ralf -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 [EMAIL PROTECTED] - http://web.de/ 11 Internet AG Brauerstraße 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS hangs/freezes after disk failure,
Ralf Ramge wrote: [...] Oh, and please excuse the grammar mistakes and typos. I'm in a hurry, not a retard ;-) At least I think so. -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 [EMAIL PROTECTED] - http://web.de/ 11 Internet AG Brauerstraße 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance with Sun StorageTek 2540
Mertol Ozyoney wrote: 2540 controler can achieve maximum 250 MB/sec on writes on the first 12 drives. So you are pretty close to maximum throughput already. Raid 5 can be a little bit slower. I'm a bit irritated now. I have ZFS running for some Sybase ASE 12.5 databases using X4600 servers (8x dual core, 64 GB RAM, Solaris 10 11/06) and 4 GBit/s lowest cost Infortrend Fibrechannel JBODs with a total of 4x 16 FC drives imported in a single mirrored zpool. I benchmarked them with tiobench, using a filesize of 64 GB and 32 parallel threads. With an untweaked ZFS the average throughput I got was: sequential random read 1GB/s, sequential write 296 MB/s, random write 353 MB/s, leading to a total of approx. 650,000 IOPS with a maximum latency of 350 ms after the databases went into production and the bottleneck are basically the FC HBA's. These are averages, the peaks flatline with reaching the 4 GBit/s FibreChannel maximum capacity pretty soon afterwards. I'm a bit disturbed because I think about switching to 2530/2540 shelves, but a maximum 250 MB/sec would disqualify them instantly, even with individual RAID controllers for each shelf. So my question is: Can I do the same thing I did with the IFT shelves, can I buy only 2501 JBOBDs and attach them directly to the server, thus *not* using the 2540 raid controller and still having access to the single drives? I'm quite nervous about this, because I'm not just talking about a single databases - I'd need a total number of 42 shelves and I'm pretty sure SUN doesn't offer TryBuy deals at such a scale. -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 [EMAIL PROTECTED] - http://web.de/ 11 Internet AG Brauerstraße 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Norbert Lang, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Avoiding performance decrease when pool usage is over 80%
Thomas Liesner wrote: Nobody out there who ever had problems with low diskspace? Okay, I found your original mail :-) Quotas are applied to file systems, not pools, and a such are pretty independent from the pool size. I found it best to give every user his/her own filesystem and applying individual quotas afterwards. -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 [EMAIL PROTECTED] - http://web.de/ 11 Internet AG Brauerstraße 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Norbert Lang, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Avoiding performance decrease when pool usage is over 80%
Thomas Liesner wrote: Does this mean, that if i have a pool of 7TB with one filesystem for all users with a quota of 6TB i'd be alright? Yep. Although I *really* recommend creating individual file systems, e.g. if you have 1,000 users on your server, I'd create 1,000 file systems with a quota of 6 GB each. Easier to handle, more flexible to use, easier to backup, it allows better use of snapshots and it's easier to migrate single users to other servers. The usage of that fs would never be over 80%, right? Nope. Don't mix up pools and file systems. your pool of 7TB will only be filled to a maximum of 6TB, but the file system will be 100% full. which shouldn't impact your overall performance. Like in the following example for the pool shares with a poolsize of 228G an one fs with a quota of 100G: shares 228G28K 220G 1%/shares shares/production 100G 8,4G92G 9%/shares/production This would suite me perfectly, as this would be exactly what i wanted to do ;) Yep, you got it. -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 [EMAIL PROTECTED] - http://web.de/ 11 Internet AG Brauerstraße 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Norbert Lang, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] X4500 ILOM thinks disk 20 is faulted, ZFS thinks not.
Jason J. W. Williams wrote: Have any of y'all seen a condition where the ILOM considers a disk faulted (status is 3 instead of 1), but ZFS keeps writing to the disk and doesn't report any errors? I'm going to do a scrub tomorrow and see what comes back. I'm curious what caused the ILOM to fault the disk. Any advice is greatly appreciated. What does `iostat -E` tell you? I've experienced several times that ZFS is very fault tolerant - a bit too tolerant for my taste - when it comes to faulting a disk. I saw external FC drives with hundreds or even thousands of errors, even entire hanging loops or drives with hardware trouble, and neither ZFS nor /var/adm/messages reported a problem. So I prefer examining the iostat output over `zpool status` - but with the unattractive side effect that it's not possible to reset the error count which iostat reports without a reboot, so this method is not suitable for monitoring purposes. -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 [EMAIL PROTECTED] - http://web.de/ 11 Internet AG Brauerstraße 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O freeze after a disk failure
Gino wrote: [...] Just a few examples: -We lost several zpool with S10U3 because of spacemap bug, and -nothing- was recoverable. No fsck here :( Yes, I criticized the lack of zpool recovery mechanisms, too, during my AVS testing. But I don't have the know-how to judge if it has technical reasons. -We had tons of kernel panics because of ZFS. Here a reboot must be planned with a couple of weeks in advance and done only at saturday night .. Well, I'm sorry, but if your datacenter runs into problems when a single server isn't available, you probably have much worse problems. ZFS is a file system. It's not a substitute for hardware trouble or a misplanned infrastructure. What would you do if you had the fsck you mentioned earlier? Or with another file system like UFS, ext3, whatever? Boot a system into single user mode and fsck several terabytes, after planning it a couple of weeks in advance? -Our 9900V and HP EVAs works really BAD with ZFS because of large cache. (echo zfs_nocacheflush/W 1 | mdb -kw) did not solve the problem. Only helped a bit. Use JBODs. Or tell the cache controllers to ignore the flushing requests. Should be possible, even the $10k low-cost StorageTek arrays support this. -ZFS performs badly with a lot of small files. (about 20 times slower that UFS with our millions file rsync procedures) I have large Sybase database servers and file servers with billions of inodes running using ZFSv3. They are attached to X4600 boxes running Solaris 10 U3, 2x 4 GBit/s dual FibreChannel, using dumb and cheap Infortrend FC JBODs (2 GBit/s) as storage shelves. During all my benchmarks (both on the command line and within applications) show that the FibreChannel is the bottleneck, even with random read. ZFS doesn't do this out of the box, but a bit of tuning helped a lot. -ZFS+FC JBOD: failed hard disk need a reboot :( (frankly unbelievable in 2007!) No. Read the thread carefully. It was mentioned that you don't have to reboot the server, all you need to do is pull the hard disk. Shouldn't be a problem, except if you don't want to replace the faulty one anyway. No other manual operations will be necessary, except for the final zfs replace. You could also try cfgadm to get rid of ZFS pool problems, perhaps it works - I'm not sure about this, because I had the idea *after* I solved that problem, but I'll give it a try someday. Anyway we happily use ZFS on our new backup systems (snapshotting with ZFS is amazing), but to tell you the true we are keeping 2 large zpool in sync on each system because we fear an other zpool corruption. May I ask how you accomplish that? And why are you doing this? You should replicate your zpool to another host, instead of mirroring locally. Where's your redundancy in that? -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 [EMAIL PROTECTED] - http://web.de/ 11 Internet AG Brauerstraße 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O freeze after a disk failure
Gino wrote: The real problem is that ZFS should stop to force kernel panics. I found these panics very annoying, too. And even more that the zpool was faulted afterwards. But my problem is that when someone asks me what ZFS should do instead, I have no idea. I have large Sybase database servers and file servers with billions of inodes running using ZFSv3. They are attached to X4600 boxes running Solaris 10 U3, 2x 4 GBit/s dual FibreChannel, using dumb and cheap Infortrend FC JBODs (2 GBit/s) as storage shelves. Are you using FATA drives? Seagate FibreChannel drives, Cheetah 15k, ST3146855FC for the databases. For the NFS filers we use Infortrend FC shelves with SATA inside. During all my benchmarks (both on the command line and within applications) show that the FibreChannel is the bottleneck, even with random read. ZFS doesn't do this out of the box, but a bit of tuning helped a lot. You found and other good point. I think that with ZFS and JBOD, FC links will be soon the bottleneck. What tuning have you done? That depends on the indivdual requirements of each service. Basically, we change to recordsize according to the transaction size of the databases and, on the filers, the performance results were best when the recordsize was a bit lower than the average file size (average file size is 12K, so I set a recordsize of 8K). I set a vdev cache size of 8K and our databases worked best with a vq_max_pending of 32. ZFSv3 was used, that's the version which is shipped with Solaris 10 11/06. It is a problem if your apps hangs waiting for you to power down/pull out the drive! Almost in a time=money environment :) Yes, but why doesn't your application fail over to a standby? I'm also working in a time is money and failure no option environment, and I doubt I would sleep better if I were responsible for an application under such a service level agreement without full high availability. If a system reboot can be a single point of failure, what about the network infrastructure? Hardware errors? Or power outages? I'm definitely NOT some kind of know-it-all, don't misunderstand me. Your statement just let my alarm bells ring and that's why I'm asking. -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 [EMAIL PROTECTED] - http://web.de/ 11 Internet AG Brauerstraße 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mirrored zpool across network
degraded state may potentially last longer than a weekend and when you're directly responsible for the mail of millions of user and you know that any non-availability will place your name on Slashdot (or the name of your CEO, wich equals placing your head on a scaffold), I'm sure you'll think twice about using ZFS with AVS or letting the linux dudes continue to play with their inefficient boxes :-) But if a disaster happened on the primary node, and a decision was made to ZFS import the storage pool on the secondary, ZFS will detect the inconsistency, mark the drive as failed, swap in the secondary HSP disk. Later, when the primary site comes back, and a reverse synchronization is done to restore writes that happened on the secondary, the primary ZFS file system will become aware that a HSP swap occurred, and continue on right where the secondary node left off. I'll try that as soon as I have a chance again (which means: as soon as Sun gets the Sun Cluster working on a X4500). c) You *must* force every single `zfs import zpool` on the secondary host. Always. Correct, but this is the case even without AVS! If one configured ZFS on SAN based storage and your primary node crashed, one would need to force every single `zfs import zpool`. This is not an AVS issue, but a ZFS protection. Right. Too bad ZFS reacts this way. I have to admit that you made me nervous once, when you wrote that forcing zpool imports would be a bad idea ... [X] Zfsck now! Let's organize a petition. :-) Correct, but this is the case even without AVS! Take the same SAN based storage scenario above, go to a secondary system on your SAN, and force every single `zfs import zpool`. Yes, but on a SAN, I don't have to worry about zpool inconsistency, because the zpool always resides on the same devices. In the case of a SAN, where the same physical disk would be written to by both hosts, you would likely get complete data loss, but with AVS, where ZFS is actually on two physical disk, and AVS is tracking writes, even if they are inconsistent writes, AVS can and will recover if an update sync is done. My problem is that there's no ZFS mechanism which allows me to verify the zpool consistency before I actually try to import it. Like I said before: AVS does it right, just ZFS doesn't (and otherwise it wouldn't make sense to discuss it on this mailinglist anyway :-) ). It could really help me with AVS if there was something like zpool check zpool, something for checking a zpool before an import. I could do a cronjob which puts the secondary host into logging mode, run a zpool check and continue with the replication a few hours afterwards. Would let me sleep better and I wouldn't have to pray to the IT gods before an import. ou know, I saw literally *hundreds* of kernel panics during my tests, that made me nervous. I have scripts which do the job now, but I saw the risks and the things which can go wrong if someone else without my experience does it (like the infamous forgetting to manually place the secondary in the logging mode before trying to import a zpool). Your are quite correct in that although ZFS is intuitively easy to use, AVS is painfully complex. Of course the mindset of AVS and ZFS are as distant apart as they are in the alphabet. :-O AVS was easy to learn and isn't very difficult to work with. All you need is 1 or 2 months of testing experience. Very easy with UFS. With AVS in Nevada, there is now an opportunity for leveraging the ease of use of ZFS, with AVS. Being also the iSCSI Target project lead, I see a lot of value in the ZFS option set shareiscsi=on, to get end users in using iSCSI. Too bad the X4500 has too few PCI slots to consider buying iSCSI cards. The two existing slots are already needed for the Sun Cluster interconnect. I think iSCSI won't be real option unless the servers are shipped with it onboard, like it has been done in the past with the SCSI or ethernet interfaces. I would like to see set replication=AVS:secondary host, configuring a locally named ZFS storage pool to the same named pair on some remote host. Starting down this path would afford things like ZFS replication monitoring, similar to what ZFS does with each of its own vdevs. Yes! Jim, I think we'll become friends :-) Who do I have to send the bribe money to? -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 [EMAIL PROTECTED] - http://web.de/ 11 Internet AG Brauerstraße 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is ZFS efficient for large collections of small files?
Brandorr wrote: Is ZFS efficient at handling huge populations of tiny-to-small files - for example, 20 million TIFF images in a collection, each between 5 and 500k in size? I am asking because I could have sworn that I read somewhere that it isn't, but I can't find the reference. If you're worried about the I/O throughput, you should avoid RAIDZ1/2 configurations. random read performance will be desastrous if you do; I've seen random reads ratios with less than 1 MB/s on a X4500 with 40 dedicated disks for data storage. If you don't have to worry about disk space, use mirrors; I got my best results during my extensive X4500 benchmarking sessions, when I mirrored single slices instead of complete disks (resulting in 40 2-way-mirrors on 40 physical discs, mirroring c0t0d0s0-c0t1d0s1 and c0t1d0s0-c0t0d0s1, and so on). If you're worried about disk space, you should consider striping several instances of RAIDZ1 arrays, each one consisting of three discs or slices. sequential access will go down the cliff, but random reads will be boosted. You should also adjust the recordsize. Try to measure the average I/O transaction size. There's a good chance that your I/O performance will be best if you set your recordsize to a smaller value. For instance, if your average file size is 12 KB, try using 8K or even 4K recordsize, stay away from 16K or higher. -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 [EMAIL PROTECTED] - http://web.de/ 11 Internet AG Brauerstraße 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mirrored zpool across network
Torrey McMahon wrote: AVS? Jim Dunham will probably shoot me, or worse, but I recommend thinking twice about using AVS for ZFS replication. Basically, you only have a few options: 1) Using a battery buffered hardware RAID controller, which leads to bad ZFS performance in many cases, 2) Buildung up Three-Way-Mirrors to avoid complete data loss in several desaster scenarios due to missing ZFS recovery mechanisms like `fsck`, which makes AVS/ZFS based solutions quite expensive, 3) Additionally using another form of backup, e.g. tapes. For instance, one scenario which made me think: Imagine you have a X4500. 48 internal disks, 500 GB each. This would lead to ZFS pool on 40 disks (you need 1 for the system, plus 3x RAID 10 for the bitmap volumes, otherwise your performance will be very bad, plus 2x HSP). Using 40 disks leads to a total of 40 separate replications. Now imagine the following two scenarios: a) A disk in the primary fails. What happens? A HSP jumps in and 500 GB will be rebuilt. These 500 GB are synced over a single 1 GBit/s crossover cable. This takes a bit of time and is 100% unnecessary - and it will become much worse in the future, because the disk capacities rocket up into the sky, while the performance isn't improved as much. During this time, your service misses redundancy. And we're not talking about some minutes during this time. Well, and now try to imagine what will happen if another disks fails during this rebuild, this time in the secondary ... b) A disk in the secondary fails. What happens now? No HSP will jump in on the secondary, because the zpool isn't imported and ZFS doesn't know about the failure. Instead, you'll end up with 39 active replications instead of 40. The one which replicates to the failed drive will become inactive. But ... oh damn, the zpool isn't mounted on the secondary host, so ZFS doesn't report the drive failure to our server monitoring. That can be funny. The only way to get aware of the problem I found after a minute of thinking was asking sndradm about the health status - which would lead to a false alarm on Host A, because the failed disc is in Host B, and operators are usually not bright enough to change the disc in Host B after they get notified about a problem on Host B. But even if everything works, what will if the primary fails before an administrator fixed the problem, the missing replication is running again and the replacement disc has been completely synced? Hello, kernel panic, and Goodbye, 12 TB of data). c) You *must* force every single `zfs import zpool` on the secondary host. Always. Because you usually need your secondary host after your primary crashed. You won't have the chance to export your zpool on the primary first - and if you do, you don't need AVS at all. Bring some Kleenex to get rid of the sweat on your forehead when you have to switch to your secondary host, because a single mistake (like forgetting to put the secondary host into logging mode manually before you try to import the zpool) will lead to a complete data loss. I bet you won't even trust your own failover scripts. Use AVS and ZFS together. I use it myself. But I made sure that I know what I'm doing. Most people probably don't. Btw: I have to admit that I haven't tried the newst nevada builds during the tests. It's possible that AVS and ZFS work better together than they did under Solaris 10 11/06 and AVS 4.0. But there's a reason I haven't tried. It's because Sun Cluster 3.2 instantly crashes on Thumpers, SATA-related kernel panics, and the OpenHA Cluster isn't available yet. -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 [EMAIL PROTECTED] - http://web.de/ 11 Internet AG Brauerstraße 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ? ZFS dynamic striping over RAID-Z
Tim Thomas wrote: if I create a storage pool with multiple RAID-Z stripes in it does ZFS dynamically stripe data across all the RAID-Z stripes in the pool automagically ? If I relate this back to my storage array experience, this would be Plaiding which is/was creating a RAID-0 logical volume across multiple h/ware RAID-5 stripes. I did this one week ago, while trying to get at least bit of random read performance out of a X4500. Normal RAIDZ(2) performance has been between 0.8 and 5 MB/s, which was way too slow for our needs, so I used striped RAIDZ to get a boost. My test configuration: c5t0/t4: system (mirrored) c5t1/t5 - c5t3/t7: AVS bitmap volumes (mirrored) This left 40 disks. I've created 13x RAIDZ with 3 disks each, that's 39 disks in total. The script I used is appended below. And yes, it results in striped RAIDZ arrays (I call it RAIDZ0). And my data throughput was 13 times higher, as expected. Hope this will help you a bit. --- #!/bin/sh /usr/sbin/zpool create -f big raidz c0t0d0s0 c1t0d0s0 c4t0d0s0 spare c7t7d0s0 /usr/sbin/zpool add -f big raidz c6t0d0s0 c7t0d0s0 c0t1d0s0 /usr/sbin/zpool add -f big raidz c1t1d0s0 c4t1d0s0 c6t1d0s0 /usr/sbin/zpool add -f big raidz c7t1d0s0 c0t2d0s0 c1t2d0s0 /usr/sbin/zpool add -f big raidz c4t2d0s0 c6t2d0s0 c7t2d0s0 /usr/sbin/zpool add -f big raidz c0t3d0s0 c1t3d0s0 c4t3d0s0 /usr/sbin/zpool add -f big raidz c6t3d0s0 c7t3d0s0 c0t4d0s0 /usr/sbin/zpool add -f big raidz c1t4d0s0 c4t4d0s0 c6t4d0s0 /usr/sbin/zpool add -f big raidz c7t4d0s0 c0t5d0s0 c1t5d0s0 /usr/sbin/zpool add -f big raidz c4t5d0s0 c6t5d0s0 c7t5d0s0 /usr/sbin/zpool add -f big raidz c0t6d0s0 c1t6d0s0 c4t6d0s0 /usr/sbin/zpool add -f big raidz c6t6d0s0 c7t6d0s0 c0t7d0s0 /usr/sbin/zpool add -f big raidz c1t7d0s0 c4t7d0s0 c6t7d0s0 /usr/sbin/zpool status --- -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 [EMAIL PROTECTED] - http://web.de/ 11 Internet AG Brauerstraße 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [AVS] Question concerning reverse synchronization of a zpool
Ralf Ramge wrote: Questions: a) I don't understand why the kernel panics at the moment. the zpool isn't mounted on both systems, the zpool itself seems to be fine after a reboot ... and switching the primary and secondary hosts just for resyncing seems to force a full sync, which isn't an option. b) I'll try a sndradm -m -r the next time ... but I'm not sure if I like that thought. I would accept this if I replaced the primary host with another server, but having to do a 24 TB full sync just because the replication itself had been disabled for a few minutes would be hard to swallow. Or did I do something wrong? I've answered these questions myself at the meantime (with a nice employee fo Sun Hamburg giving me the hint). For Google: during a reverse sync, neither side of the replication is allowed to have the zpool imported, because after the reverse sync finishes, SNDR enters replication mode. This renders reverse syncs useless for HA scenarios, switch primary secondary instead. c) What performance can I expect from a X4500, 40 disks zpool, when using slices, compared to LUNs? Any experiences? Any input to the question will still be appreciated :-) -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 [EMAIL PROTECTED] - http://web.de/ 11 Internet AG Brauerstraße 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] [AVS] Question concerning reverse synchronization of a zpool
Hi, I'm struggling to get a stable ZFS replication using Solaris 10 110/06 (actual patches) and AVS 4.0 for several weeks now. We tried it on VMware first and ended up in kernel panics en masse (yes, we read Jim Dunham's blog articles :-). Now we try on the real thing, two X4500 servers. Well, I have no trouble replicating our kernel panics there, too ... but I think I learned some important things, too. But one problem is still remaining. I have a zpool on host A. Replication to host B works fine. * zpool export tank on the primary - works. * sndradm -d on both servers - works (paranoia mode) * zpool import id on the secondary - works. So far, so good. I chance the contents of the file system, add some files, delete some others ... no problems. The secondary is in production use now, everything is fine. Okay, let's imagine I switched to the secodary host because had a problem with the primary. Now it's repaired, now I want my redundancy back. * sndradm -E -f on both hosts - works. * sndradm -u -r on the primary for refreshing the primary - works. `nicstat` shows me a bit of traffic. Good, let's switch back to the primary. Actual status: zpool is imported on the secondary and NOT imported on the primary. * zpool export tank on the secondary - *kernel panic* Sadly, the machine dies fast, I don't see the kernel panic with `dmesg`. And disabling the replication again later and mounting the zpool on the primary again shows me that the update sync didn't take place, the file system changes I did on the secondary wren't replicated. Exporting the zpool on the secondary works *after* the system rebooted. I uses slices for the zpool, not LUNs, because I think many of my problems were caused by exclusive locking, but it doesn't help with this one. Questions: a) I don't understand why the kernel panics at the moment. the zpool isn't mounted on both systems, the zpool itself seems to be fine after a reboot ... and switching the primary and secondary hosts just for resyncing seems to force a full sync, which isn't an option. b) I'll try a sndradm -m -r the next time ... but I'm not sure if I like that thought. I would accept this if I replaced the primary host with another server, but having to do a 24 TB full sync just because the replication itself had been disabled for a few minutes would be hard to swallow. Or did I do something wrong? c) What performance can I expect from a X4500, 40 disks zpool, when using slices, compared to LUNs? Any experiences? And another thing: I did some experiments with zvols, because I wanted to make desasters and the AVS configuration itself easier to handle - there won't be a full sync after replacing a disk because AVS doesn't see that a hot spare is being used, and hot spares won't be replicated to the secondary host as well although the original drive on the secondary never failed. I used the zvol with UFS and this kind of hardware RAID controller emulation by ZFS works pretty well, just the performance went down the cliff. Sunsolve told me that this is a flushing problem and there's a workaround in Nevada build 53 and higher. Has somebody done a comparison, can you share some experiences? I only have a few days left and I don't waste time on installing Nevada for nothing ... Thanks, Ralf -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 [EMAIL PROTECTED] - http://web.de/ 11 Internet AG Brauerstraße 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss