[zfs-discuss] ZFS Hard link space savings
Hi All, I have an interesting question that may or may not be answerable from some internal ZFS semantics. I have a Sun Messaging Server which has 5 ZFS based email stores. The Sun Messaging server uses hard links to link identical messages together. Messages are stored in standard SMTP MIME format so the binary attachments are included in the message ASCII. Each individual message is stored in a separate file. So as an example if a user sends a email with a 2MB attachment to the staff mailing list and there is 3 staff stores with 500 users on each, it will generate a space usage like : /store1 = 1 x 2MB + 499 x 1KB /store2 = 1 x 2MB + 499 x 1KB /store3 = 1 x 2MB + 499 x 1KB So total storage used is around ~7.5MB due to the hard linking taking place on each store. If hard linking capability had been turned off, this same message would have used 1500 x 2MB =3GB worth of storage. My question is there any simple ways of determining the space savings on each of the stores from the usage of hard links? The reason I ask is that our educational institute wishes to migrate these stores to M$ Exchange 2010 which doesn't do message single instancing. I need to try and project what the storage requirement will be on the new target environment. If anyone has any ideas be it ZFS based or any useful scripts that could help here, I am all ears. I may post this to Sun Managers as well to see if anyone there might have any ideas on this as well. Regards, Scott. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Hard link space savings
On 13/06/11 10:28 AM, Nico Williams wrote: On Sun, Jun 12, 2011 at 4:14 PM, Scott Lawson scott.law...@manukau.ac.nz wrote: I have an interesting question that may or may not be answerable from some internal ZFS semantics. This is really standard Unix filesystem semantics. I Understand this, just wanting to see if here is any easy way before I trawl through 10 million little files.. ;) [...] So total storage used is around ~7.5MB due to the hard linking taking place on each store. If hard linking capability had been turned off, this same message would have used 1500 x 2MB =3GB worth of storage. My question is there any simple ways of determining the space savings on each of the stores from the usage of hard links? [...] But... you just did! :) It's: number of hard links * (file size + sum(size of link names and/or directory slot size)). For sufficiently large files (say, larger than one disk block) you could approximate that as: number of hard links * file size. The key is the number of hard links, which will typically vary, but for e-mails that go to all users, well, you know the number of links then is the number of users. Yes this number varies based on number of recipients, so could be as many a You could write a script to do this -- just look at the size and hard-link count of every file in the store, apply the above formula, add up the inflated sizes, and you're done. Looks like I will have to, just looking for a tried and tested method before I have to create my own one if possible. Just was looking for an easy option before I have to sit down and develop and test a script. I have resigned from my current job of 9 years and finish in 15 days and have a heck of a lot of documentation and knowledge transfer I need to do around other UNIX systems and am running very short on time... Nico PS: Is it really the case that Exchange still doesn't deduplicate e-mails? Really? It's much simpler to implement dedup in a mail store than in a filesystem... As a side not Exchange 2002 + Exchange 2007 do do this. But apparently M$ decided in Exchange 2010 that they no longer wished to do this and dropped the capability. Bizarre to say the least, but it may come down to what they have done in the underlying store technology changes.. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Hard link space savings
On 13/06/11 11:36 AM, Jim Klimov wrote: Some time ago I wrote a script to find any duplicate files and replace them with hardlinks to one inode. Apparently this is only good for same files which don't change separately in future, such as distro archives. I can send it to you offlist, but it would be slow in your case because it is not quite the tool for the job (it will start by calculating checksums of all of your files ;) ) What you might want to do and script up yourself is a recursive listing find /var/opt/SUNWmsqsr/store/partition... -ls. This would print you the inode numbers and file sizes and link counts. Pipe it through something like this: find ... -ls | awk '{print $1 $4 $7}' | sort | uniq And you'd get 3 columns - inode, count, size My AWK math is a bit rusty today, so I present a monster-script like this to multiply and sum up the values: ( find ... -ls | awk '{print $1 $4 $7}' | sort | uniq | awk '{ print $2*$3+\\ }'; echo 0 ) | bc This looks something like what I thought would have to be done, I was just looking to see if there was something tried and tested before I had to invent something. I was really hoping in zdb there might have been some magic information I could have tapped into.. ;) Can be done cleaner, i.e. in a PERL one-liner, and if you have many values - that would probably complete faster too. But as a prototype this would do. HTH, //Jim PS: Why are you replacing the cool Sun Mail? Is it about Oracle licensing and the now-required purchase and support cost? Yes it is about cost mostly. We had Sun Mail for our Staff and students. We had 20,000 + students on it up until Christmas time as well. We have now migrated them to M$ Live@EDU. This leaves us with 1500 Staff left who all like to use LookOut. The Sun connector for LookOut is a bit flaky at best. But the Oracle licensing cost for Messaging and Calendar starts at 10,000 users plus and so is now rather expensive for what mailboxes we have left. M$ also heavily discounts Exchange CALS to Edu and Oracle is not very friendly the way Sun was with their JES licensing. So it is bye bye Sun Messaging Server for us. 2011-06-13 1:14, Scott Lawson пишет: Hi All, I have an interesting question that may or may not be answerable from some internal ZFS semantics. I have a Sun Messaging Server which has 5 ZFS based email stores. The Sun Messaging server uses hard links to link identical messages together. Messages are stored in standard SMTP MIME format so the binary attachments are included in the message ASCII. Each individual message is stored in a separate file. So as an example if a user sends a email with a 2MB attachment to the staff mailing list and there is 3 staff stores with 500 users on each, it will generate a space usage like : /store1 = 1 x 2MB + 499 x 1KB /store2 = 1 x 2MB + 499 x 1KB /store3 = 1 x 2MB + 499 x 1KB So total storage used is around ~7.5MB due to the hard linking taking place on each store. If hard linking capability had been turned off, this same message would have used 1500 x 2MB =3GB worth of storage. My question is there any simple ways of determining the space savings on each of the stores from the usage of hard links? The reason I ask is that our educational institute wishes to migrate these stores to M$ Exchange 2010 which doesn't do message single instancing. I need to try and project what the storage requirement will be on the new target environment. If anyone has any ideas be it ZFS based or any useful scripts that could help here, I am all ears. I may post this to Sun Managers as well to see if anyone there might have any ideas on this as well. Regards, Scott. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OT: anyone aware how to obtain 1.8.0 for X2100M2?
Hi, Took me a couple of minutes to find the download for this in my Oracle support. Search for the patch like this . Patches and Updates Panel - Patch Search - Patch Name or Number is : 10275731 Pretty easy really. Scott. PS. I found that patch by using product or family equals x2100 and it found it for me easily. On 20/12/2010 1:04 p.m., Jerry Kemp wrote: Eugen, I would 2nd your observation. I *do* have several support contracts, and as I review my Oracle profile, it does show that I am authorized to download patches, among other items. I really haven't downloaded a lot since SunSolve was killed off. Do others on the list have access to download stuff like this? Or is there some other place with in Oracle's site that makes Eugen's link obsolete? Jerry On 12/19/10 12:28, Eugen Leitl wrote: I realize this is off-topic, but Oracle has completely screwed up the support site from Sun. I figured someone here would know how to obtain Sun Fire X2100 M2 Server Software 1.8.0 Image contents: * BIOS is version 3A21 * SP is updated to version 3.24 (ELOM) * Chipset driver is updated to 9.27 from http://www.sun.com/servers/entry/x2100/downloads.jsp I've been trying for an hour, and I'm at the end of my rope. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ___ Scott Lawson Systems Architect Manukau Institute of Technology Information Communication Technology Services Private Bag 94006 Manukau City Auckland New Zealand Phone : +64 09 968 7611 Fax: +64 09 968 7641 Mobile : +64 27 568 7611 mailto:sc...@manukau.ac.nz http://www.manukau.ac.nz perl -e 'print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS HW RAID
Bob Friesenhahn wrote: On Fri, 18 Sep 2009, David Magda wrote: If you care to keep your pool up and alive as much as possible, then mirroring across SAN devices is recommended. One suggestion I heard was to get a LUN that's twice the size, and set copies=2. This way you have some redundancy for incorrect checksums. This only helps for block-level corruption. It does not help much at all if a whole LUN goes away. It seems best for single disk rpools. I second this. In my experience you are more likely to have a single LUN go missing for some reason or another and it seems most prudent to support any production data volume with at the very minimum a mirror. This also give you 2 copies in a far more resilient way generally. (and per my other post, there can be other niceties that come with it as well when couple with SAN based LUNS.) Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS HW RAID
@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- _ Scott Lawson Systems Architect Information Communication Technology Services Manukau Institute of Technology Private Bag 94006 South Auckland Mail Centre Manukau 2240 Auckland New Zealand Phone : +64 09 968 7611 Fax: +64 09 968 7641 Mobile : +64 27 568 7611 mailto:sc...@manukau.ac.nz http://www.manukau.ac.nz __ perl -e 'print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);' __ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] problem with zfs
The latest official Solaris 10 is actually 05/09. There are update patch bundles available on Sunsolve for free download that will take you to 05/09. It may well be worth applying these to see if they remedy the problem for you. They certainly allow you to bring ZFS up to version 10 from recollection. I have upgraded 30 plus systems with these and haven't experienced any issues. (both SPARC and x86) http://sunsolve.sun.com/pdownload.do?target=10_sparc_0509_patchbundle_part1.zip http://sunsolve.sun.com/pdownload.do?target=10_sparc_0509_patchbundle_part2.zip http://sunsolve.sun.com/pdownload.do?target=10_sparc_0509_patchbundle_part3.zip http://sunsolve.sun.com/pdownload.do?target=10_sparc_0509_patchbundle_part4.zip serge goyette wrote: for release sorry i meant Solaris 10 10/08 s10s_u6wos_07b SPARC Copyright 2008 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 27 October 2008 -- ___ Scott Lawson Systems Architect Manukau Institute of Technology Information Communication Technology Services Private Bag 94006 Manukau City Auckland New Zealand Phone : +64 09 968 7611 Fax: +64 09 968 7641 Mobile : +64 27 568 7611 mailto:sc...@manukau.ac.nz http://www.manukau.ac.nz perl -e 'print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] problem with zfs
serge goyette wrote: actually i did apply the latest recommended patches Recommended patches and upgrade clusters are different by the way. 10_Recommended != Upgrade Cluster that. Upgrade cluster will upgrade the system to a effectively the Solaris Release that the upgrade cluster is minus any new features that arrived in the newer OS release. SunOS VL-MO-ZMR01 5.10 Generic_139555-08 sun4v sparc SUNW,SPARC-Enterprise-T5120 but still perhaps you are not doing much import - export because when i do not do, i do not experience much problem but when doing it, outch ... Sure I import and export pools. But generally this is for moving the pool to another system. But I think we would need more information about the pool and it's file systems to be able to help you. Specifically maybe the output of 'zpool history' and 'zfs list' for starters. This will at least allow some specific data to try and help resolve your issues. The question as it stands is pretty generic. Have you upgraded your pools after the patches as well? 'zpool upgrade' and 'zfs upgrade' ? a reboot will solve until next time -sego- - ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to find poor performing disks
Also you may wish to look at the output of 'iostat -xnce 1' as well. You can post those to the list if you have a specific problem. You want to be looking for error counts increasing and specifically 'asvc_t' for the service times on the disks. I higher number for asvc_t may help to isolate poorly performing individual disks. Scott Meilicke wrote: You can try: zpool iostat pool_name -v 1 This will show you IO on each vdev at one second intervals. Perhaps you will see different IO behavior on any suspect drive. -Scott ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs fragmentation
Ed Spencer wrote: I don't know of any reason why we can't turn 1 backup job per filesystem into say, up to say , 26 based on the cyrus file and directory structure. No reason whatsoever. Sometimes the more the better as per the rest of this thread. The key here is to test and tweak till you get the optimal arrangement of backup window time and performance. Performance tuning is a little bit of a Journey, that sooner or later has a final destination. ;) The cyrus file and directory structure is designed with users located under the directories A,B,C,D,etc to deal with the millions of little files issue at the filesystem layer. The sun messaging server actually hashes the user names into a structure which looks quite similar to a squid cache store. This has a top level of 128 directories, which each in turn contain 128 directories, which then contain a folder for each user that has been mapped into that structure by the hash algorithm on the user name. I use a wildcard mapping to split this into 16 streams to cover the 0-9, a-f of the hexadecimal directory structure names. eg. /mailstore1/users/0* Our backups will have to be changed to use this design feature. There will be a little work on the front end to create the jobs but once done the full backups should finish in a couple of hours. The nice thing about this work is it really is only a one off configuration in the backup software and then it is done. Certainly works a lot better than something like ALL_LOCAL_DRIVES in Netbackup which effectively forks one backup thread per file system. As an aside, we are currently upgrading our backup server to a sun4v machine. This architecture is well suited to run more jobs in parallel. I use a T5220 with staging to a J4500 with 48 x 1 TB disks in a zpool with 6 file systems. This then gets streamed to 6 LTO4 tape drives in a SL500 .Needless to say this supports a high degree of parallelism and generally finds the source server to be the bottleneck. I also take advantage of the 10 GigE capability built straight into the Ultrasparc T2. Only major bottleneck in this system is the SAS interconnect to the J4500. Thanx for all your help and advice. Ed On Tue, 2009-08-11 at 22:47, Mike Gerdts wrote: On Tue, Aug 11, 2009 at 9:39 AM, Ed Spencered_spen...@umanitoba.ca wrote: We backup 2 filesystems on tuesday, 2 filesystems on thursday, and 2 on saturday. We backup to disk and then clone to tape. Our backup people can only handle doing 2 filesystems per night. Creating more filesystems to increase the parallelism of our backup is one solution but its a major redesign of the of the mail system. What is magical about a 1:1 mapping of backup job to file system? According to the Networker manual[1], a save set in Networker can be configured to back up certain directories. According to some random documentation about Cyrus[2], mail boxes fall under a pretty predictable hierarchy. 1. http://oregonstate.edu/net/services/backups/clients/7_4/admin7_4.pdf 2. http://nakedape.cc/info/Cyrus-IMAP-HOWTO/components.html Assuming that the way that your mailboxes get hashed fall into a structure like $fs/b/bigbird and $fs/g/grover (and not just $fs/bigbird and $fs/grover), you should be able to set a save set per top level directory or per group of a few directories. That is, create a save set for $fs/a, $fs/b, etc. or $fs/a - $fs/d, $fs/e - $fs/h, etc. If you are able to create many smaller save sets and turn the parallelism up you should be able to drive more throughput. I wouldn't get too worried about ensuring that they all start at the same time[3], but it would probably make sense to prioritize the larger ones so that they start early and the smaller ones can fill in the parallelism gaps as the longer-running ones finish. 3. That is, there is sometimes benefit in having many more jobs to run than you have concurrent streams. This avoids having one save set that finishes long after all the others because of poorly balanced save sets. Couldn't agree more Mike. -- Mike Gerdts http://mgerdts.blogspot.com/ -- ___ Scott Lawson Systems Architect Manukau Institute of Technology Information Communication Technology Services Private Bag 94006 Manukau City Auckland New Zealand Phone : +64 09 968 7611 Fax: +64 09 968 7641 Mobile : +64 27 568 7611 mailto:sc...@manukau.ac.nz http://www.manukau.ac.nz perl -e 'print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs fragmentation
, will not scale well as the number of files increases. It doesn't matter what file system you use, the scalability will look more-or-less similar. For millions of files, ZFS send/receive works much better. More details are in my paper. I look forward to reading this Richard. I think it will be a interesting read for members of this. We will have to do something to address the problem. A combination of what I just listed is our probable course of action. (Much testing will have to be done to ensure our solution will address the problem because we are not 100% sure what is the cause of performance degradation). I'm also dealing with Network Appliance to see if there is anything we can do at the filer end to increase performance. But I'm holding out little hope. DNLC hit rate? Also, is atime on? Turning atime off may make a big difference for you. It certainly does for Sun Messaging server. Maybe worth doing and reposting result? But please, don't miss the point I'm trying to make. ZFS would benefit from a utility or a background process that would reorganize files and directories in the pool to optimize performance. A utility to deal with Filesystem Entropy. Currently a zfs pool will live as long as the lifetime of the disks that it is on, without reorganization. This can be a long long time. Not to mention slowly expanding the pool over time contributes to the issue. This does not come for free in either performance or risk. It will do nothing to solve the directory walker's problem. Agree. It will have little bearing on the outcome for the reason you mention. NB, people who use UFS don't tend to see this because UFS can't handle millions of files. It can but only if you have less than a 1 TB'ish sized file systems. Not large by ZFS standards. They do work, but with the same performance issue for directory walker backups. Heaven help you in fsck'ing them after a system crash. Hours and hours. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ___ Scott Lawson Systems Architect Manukau Institute of Technology Information Communication Technology Services Private Bag 94006 Manukau City Auckland New Zealand Phone : +64 09 968 7611 Fax: +64 09 968 7641 Mobile : +64 27 568 7611 mailto:sc...@manukau.ac.nz http://www.manukau.ac.nz perl -e 'print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] grow zpool by replacing disks
Tobias Exner wrote: Hi list, some months ago I spoke with an zfs expert on a Sun Storage event. He told it's possible to grow a zpool by replacing every single disk with a larger one. After replacing and resilvering all disks of this pool zfs will provide the new size automatically. Now I found time to check that and I was not able to grow the pool. It's still the same size as before. Maybe I missed one step? Or this functionality is not implemented yet? Can you post the steps that you took? Output of 'zpool history' would be helpful. For your info: I tested this using VMWare and a Solaris 10u6 Any comments or ideas? Did you export and then import the pool afterwards? This step is needed from recollection for the system to make the new capacity available in the pool. I did this about 3-4 weeks ago on a mirror and this step was required to make the change in capacity seen by the system. This was on Solaris 10 10/08 which is the same as S10u6. Thank you in advance... Tobias ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40
Dave Stubbs wrote: I don't mean to be offensive Russel, but if you do ever return to ZFS, please promise me that you will never, ever, EVER run it virtualized on top of NTFS (a.k.a. worst file system ever) in a production environment. Microsoft Windows is a horribly unreliable operating system in situations where things like protecting against data corruption are important. Microsoft knows this Oh WOW! Whether or not our friend Russel virtualized on top of NTFS (he didn't - he used raw disk access) this point is amazing! System5 - based on this thread I'd say you can't really make this claim at all. Solaris suffered a crash and the ZFS filesystem lost EVERYTHING! And there aren't even any recovery tools? HANG YOUR HEADS!!! Recovery from the same situation is EASY on NTFS. There are piles of tools out there that will recover the file system, and failing that, locate and extract data. The key parts of the file system are stored in multiple locations on the d You mean the data that you don't know you have lost yet? ZFS allows you to be very paranoid about data protection with things like copies=2,3,4 etc etc.. isk just in case. It's been this way for over 10 years. I'd say it seems from this thread that my data is a lot safer on NTFS than it is on ZFS! I can't believe my eyes as I read all these responses blaming system engineering and hiding behind ECC memory excuses and well, you know, ZFS is intended for more Professional systems and not consumer devices, etc etc. My goodness! You DO realize that Sun has this website called opensolaris.org which actually proposes to have people use ZFS on commodity hardware, don't you? I don't see a huge warning on that site saying ATTENTION: YOU PROBABLY WILL LOSE ALL YOUR DATA. I recently flirted with putting several large Unified Storage 7000 systems on our corporate network. The hype about ZFS is quite compelling and I had positive experience in my lab setting. But because of not having Solaris capability on our staff we went in another direction instead. You do realize that the 7000 series machines are appliances and have no prerequisite for you to have any Solaris knowledge whatsoever? They are a supported device just like any other disk storage system that you can purchase from any vendor and have it supported as such. To use it all you need is a web browser. Thats it. This is no different than your EMC array or HP Storageworks hardware, except that the under pinnings of the storage system are there for all to see in the form of open source code contributed to the community by Sun. Reading this thread, I'm SO glad we didn't put ZFS in production in ANY way. Guys, this is the real world. Stuff happens. It doesn't matter what the reason is - hardware lying about cache commits, out-of-order commits, failure to use ECC memory, whatever. It is ABSOLUTELY unacceptable for the filesystem to be entirely lost. No excuse or rationalization of any type can be justified. There MUST be at least the base suite of tools to deal with this stuff. without it, ZFS simply isn't ready yet. Sounds like you have no real world experience of ZFS in production environments and it's true reliability. As many people here report there are thousands if not millions of zpools out there containing business critical environments that are happily fixing broken hardware on a daily basis. I have personally seen all sorts of pieces of hardware break and ZFS corrected and fixed things for me. I personally manage 50 plus ZFS zpools that are anywhere from 100GB to 30 TB. Works very, very, very well for me. I have never lost anything despite having had plenty of pieces of hardware break in some form underneath ZFS. I am saving a copy of this thread to show my colleagues and also those Sun Microsystems sales people that keep calling. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] avail drops to 32.1T from 40.8T after create -o mountpoint
Glen Gunselman wrote: Here is the output from my J4500 with 48 x 1 TB disks. It is almost the exact same configuration as yours. This is used for Netbackup. As Mario just pointed out, zpool list includes the parity drive in the space calculation whereas zfs list doesn't. [r...@xxx /]# zpool status Scoot, Thanks for the sample zpool status output. I will be using the storage for NetBackup, also. (I am booting the X4500 from a SAN - 6140 - and using a SL48 w/2 LTO4 drives.) Glen Glen, If you want any more info about our configuration drop me a line. It works ver very well and we have had no issues at all. This System is a T5220 (323 GB RAM)with the 48 TB J4500 connected via SAS. System also has 3 dual port fibre channel HBA's feeding 6 LTO4 drives in a 540 slot SL500. The server is 10 gig attached straight to our network core routers and needless to say achieves very high throughput. I have seen it pushing the full capacity of the SAS link to the J4500 quite commonly. This is probably the choke point for this system. /Scott -- ___ Scott Lawson Systems Architect Manukau Institute of Technology Information Communication Technology Services Private Bag 94006 Manukau City Auckland New Zealand Phone : +64 09 968 7611 Fax: +64 09 968 7641 Mobile : +64 27 568 7611 mailto:sc...@manukau.ac.nz http://www.manukau.ac.nz perl -e 'print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] avail drops to 32.1T from 40.8T after create -o mountpoint
Glen Gunselman wrote: This is my first ZFS pool. I'm using an X4500 with 48 TB drives. Solaris is 5/09. After the create zfs list shows 40.8T but after creating 4 filesystems/mountpoints the available drops 8.8TB to 32.1TB. What happened to the 8.8TB. Is this much overhead normal? zpool create -f zpool1 raidz c1t0d0 c2t0d0 c3t0d0 c5t0d0 c6t0d0 \ raidz c1t1d0 c2t1d0 c3t1d0 c4t1d0 c5t1d0 \ raidz c6t1d0 c1t2d0 c2t2d0 c3t2d0 c4t2d0 \ raidz c5t2d0 c6t2d0 c1t3d0 c2t3d0 c3t3d0 \ raidz c4t3d0 c5t3d0 c6t3d0 c1t4d0 c2t4d0 \ raidz c3t4d0 c5t4d0 c6t4d0 c1t5d0 c2t5d0 \ raidz c3t5d0 c4t5d0 c5t5d0 c6t5d0 c1t6d0 \ raidz c2t6d0 c3t6d0 c4t6d0 c5t6d0 c6t6d0 \ raidz c1t7d0 c2t7d0 c3t7d0 c4t7d0 c5t7d0 \ spare c6t7d0 c4t0d0 c4t4d0 zpool list NAME SIZE USED AVAILCAP HEALTH ALTROOT zpool1 40.8T 176K [b]40.8T[/b] 0% ONLINE - ## create multiple file systems in the pool zfs create -o mountpoint=/backup1fs zpool1/backup1fs zfs create -o mountpoint=/backup2fs zpool1/backup2fs zfs create -o mountpoint=/backup3fs zpool1/backup3fs zfs create -o mountpoint=/backup4fs zpool1/backup4fs zfs list NAME USED AVAIL REFER MOUNTPOINT zpool1 364K [b]32.1T[/b] 28.8K /zpool1 zpool1/backup1fs 28.8K 32.1T 28.8K /backup1fs zpool1/backup2fs 28.8K 32.1T 28.8K /backup2fs zpool1/backup3fs 28.8K 32.1T 28.8K /backup3fs zpool1/backup4fs 28.8K 32.1T 28.8K /backup4fs Thanks, Glen (PS. As I said this is my first time working with ZFS, if this is a dumb question - just say so.) Here is the output from my J4500 with 48 x 1 TB disks. It is almost the exact same configuration as yours. This is used for Netbackup. As Mario just pointed out, zpool list includes the parity drive in the space calculation whereas zfs list doesn't. [r...@xxx /]# zpool status errors: No known data errors pool: nbupool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM nbupool ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0 c2t4d0 ONLINE 0 0 0 c2t5d0 ONLINE 0 0 0 c2t6d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t7d0 ONLINE 0 0 0 c2t8d0 ONLINE 0 0 0 c2t9d0 ONLINE 0 0 0 c2t10d0 ONLINE 0 0 0 c2t11d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t12d0 ONLINE 0 0 0 c2t13d0 ONLINE 0 0 0 c2t14d0 ONLINE 0 0 0 c2t15d0 ONLINE 0 0 0 c2t16d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t17d0 ONLINE 0 0 0 c2t18d0 ONLINE 0 0 0 c2t19d0 ONLINE 0 0 0 c2t20d0 ONLINE 0 0 0 c2t21d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t22d0 ONLINE 0 0 0 c2t23d0 ONLINE 0 0 0 c2t24d0 ONLINE 0 0 0 c2t25d0 ONLINE 0 0 0 c2t26d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t27d0 ONLINE 0 0 0 c2t28d0 ONLINE 0 0 0 c2t29d0 ONLINE 0 0 0 c2t30d0 ONLINE 0 0 0 c2t31d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t32d0 ONLINE 0 0 0 c2t33d0 ONLINE 0 0 0 c2t34d0 ONLINE 0 0 0 c2t35d0 ONLINE 0 0 0 c2t36d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t37d0 ONLINE 0 0 0 c2t38d0 ONLINE 0 0 0 c2t39d0 ONLINE 0 0 0 c2t40d0 ONLINE 0 0 0 c2t41d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t42d0 ONLINE 0 0 0 c2t43d0 ONLINE 0 0 0 c2t44d0 ONLINE 0 0 0 c2t45d0 ONLINE 0 0 0 c2t46d0 ONLINE 0 0 0 spares c2t47d0 AVAIL c2t48d0 AVAIL c2t49d0 AVAIL errors: No known data errors [r...@xxx /]# zfs list NAME USED AVAIL REFER MOUNTPOINT NBU 113G 20.6G 113G /NBU nbupool 27.5T 4.58T 30.4K /nbupool nbupool/backup1 6.90T 4.58T 6.90T /backup1 nbupool/backup2 6.79T 4.58T 6.79T /backup2 nbupool/backup3 7.28T 4.58T 7.28T /backup3 nbupool/backup4 6.43T 4.58T 6.43T /backup4 nbupool/nbushareddisk 20.1G 4.58T 20.1G /nbushareddisk nbupool/zfscachetest 69.2G 4.58T 69.2G /nbupool/zfscachetest [r...@xxx /]# zpool list NAME SIZE USED AVAIL CAP HEALTH ALTROOT NBU 136G 113G 22.8G 83% ONLINE - nbupool 40.8T 34.4T 6.37T 84% ONLINE - [r...@solnbu1 /]# -- ___ Scott Lawson Systems Architect Manukau Institute of Technology Information Communication Technology Services Private Bag 94006 Manukau City Auckland New Zealand Phone : +64 09 968 7611 Fax: +64 09 968 7641 Mobile : +64 27 568 7611 mailto:sc...@manukau.ac.nz http://www.manukau.ac.nz perl -e 'print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] L2ARC support in Solaris 10 (Update 8?)
Hi All, Can anyone shed some light on / if L2ARC support will be included in the next Solaris 10 update? Or if it is included in a Kernel patch over and above the standard Kernel patch rev that ships in 05/09 (AKA U7)? The reason I ask is that I have standardised on S10 here and am not keen to deploy OpenSolaris in production. (Just another platform and patching system to document and maintain. I don't want to debate this here. It's the way it is.) I am currently speccing some x4240's with SSD's for some upgraded Squid proxy cache's that will be handling caching duties for around 40 - 60 megabit's / s. Large disk caches and L1ARC for squid will make these systems really fly. (These are replacing tow v240's that are getting a little long in the tooth and want keep up with the bandwidth jump) The plan is to have a couple of x4240's with Dual quad core processors, 16 GB RAM and 6 x 146 GB 10K SAS drives plus 1 x 32 GB SSD as L2ARC. I can add this later if support for this not available at build time, but is road mapped for S8? ZFS config will be a pair of 146 GB mirrored as boot drives (and possibly access logging) and then a RAIDZ1 of 4 drives for max capacity (data is disposable as it is purely cached object data). Compression will be enabled on the disk cache RAIDZ1 to increase performance of cached data read from disk. (seeing as I have many CPU cycles to burn in these systems ;) ) I am hoping that these systems will have a L1ARC of around 10GB, L2ARC of 32GB and cache volume of ~420GB RAIDZ plus compression. We may add more drives or RAIDZ's as we tweak the Squid cached object size. We are hoping to cache objects up to around 100 MB. Any comments on either system configuration and / or L2ARC support are invited from the list. Thanks, Scott. -- ___ Scott Lawson Systems Architect Manukau Institute of Technology Information Communication Technology Services Private Bag 94006 Manukau City Auckland New Zealand Phone : +64 09 968 7611 Fax: +64 09 968 7641 Mobile : +64 27 568 7611 mailto:sc...@manukau.ac.nz http://www.manukau.ac.nz perl -e 'print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Migrating a zfs pool to another server
Peter Farmer wrote: Super! Does the export need to be called just before I import the pool to another server, Yes that is correct. or can the export be called at the time the pool is created? no. It must be done on the server that is exporting the pool so that it can be imported as Daniel explained. because in a fail over I wouldn't be able to export the pool before importing it. In that case you would do a zpool import -f $mypoolname on the new server. The '-f' will forcibly import the pool into the system. Provided the pool isn't horribly broken. In my experience this has worked fine for me on numerous occasions, where this circumstance has arisen. Thanks, Peter 2009/7/20 Daniel J. Priem daniel.pr...@disy.net: Peter Farmer pfarmer.li...@googlemail.com writes: Hi All, I have a zfs pool setup on one server, the pool is made up of 4 iSCSI luns, is it possible to migrate the zfs pool to another server? Each of the iSCSI luns would be available on the other server. Thanks, yes. zpool export $mypoolname on the old server zpool import $mypoolname on the new server regards daniel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- _ Scott Lawson Systems Architect Information Communication Technology Services Manukau Institute of Technology Private Bag 94006 South Auckland Mail Centre Manukau 2240 Auckland New Zealand Phone : +64 09 968 7611 Fax: +64 09 968 7641 Mobile : +64 27 568 7611 mailto:sc...@manukau.ac.nz http://www.manukau.ac.nz __ perl -e 'print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);' __ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
I added a second Lun identical in size as a mirror and reran test. Results are more in line with yours now. ./zfs-cache-test.ksh test1 System Configuration: Sun Microsystems sun4u Sun SPARC Enterprise M3000 Server System architecture: sparc System release level: 5.10 Generic_139555-08 CPU ISA list: sparcv9+vis2 sparcv9+vis sparcv9 sparcv8plus+vis2 sparcv8plus+vis sparcv8plus sparcv8 sparcv8-fsmuld sparcv7 sparc Pool configuration: pool: test1 state: ONLINE scrub: resilver completed after 0h0m with 0 errors on Wed Jul 15 11:38:54 2009 config: NAME STATE READ WRITE CKSUM test1 ONLINE 0 0 0 mirror ONLINE 0 0 0 c3t600A0B8000562264039B4A257E11d0 ONLINE 0 0 0 c3t600A0B8000336DE204394A258B93d0 ONLINE 0 0 0 errors: No known data errors zfs create test1/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /test1/zfscachetest ... Done! zfs unmount test1/zfscachetest zfs mount test1/zfscachetest Doing initial (unmount/mount) 'cpio -C 131072 -o /dev/null' 48000256 blocks real3m25.13s user0m2.67s sys 0m28.40s Doing second 'cpio -C 131072 -o /dev/null' 48000256 blocks real8m53.05s user0m2.69s sys 0m32.83s Feel free to clean up with 'zfs destroy test1/zfscachetest'. Scott Lawson wrote: Bob, Output of my run for you. System is a M3000 with 16 GB RAM and 1 zpool called test1 which is contained on a raid 1 volume on a 6140 with 7.50.13.10 firmware on the RAID controllers. RAid 1 is made up of two 146GB 15K FC disks. This machine is brand new with a clean install of S10 05/09. It is destined to become a Oracle 10 server with ZFS filesystems for zones and DB volumes. [r...@xxx /]# uname -a SunOS xxx 5.10 Generic_139555-08 sun4u sparc SUNW,SPARC-Enterprise [r...@xxx /]# cat /etc/release Solaris 10 5/09 s10s_u7wos_08 SPARC Copyright 2009 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 30 March 2009 [r...@xxx /]# prtdiag -v | more System Configuration: Sun Microsystems sun4u Sun SPARC Enterprise M3000 Server System clock frequency: 1064 MHz Memory size: 16384 Megabytes Here is the run output for you. [r...@xxx tmp]# ./zfs-cache-test.ksh test1 zfs create test1/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /test1/zfscachetest ... Done! zfs unmount test1/zfscachetest zfs mount test1/zfscachetest Doing initial (unmount/mount) 'cpio -o /dev/null' 48000247 blocks real4m48.94s user0m21.58s sys 0m44.91s Doing second 'cpio -o /dev/null' 48000247 blocks real6m39.87s user0m21.62s sys 0m46.20s Feel free to clean up with 'zfs destroy test1/zfscachetest'. Looks like a 25% performance loss for me. I was seeing around 80MB/s sustained on the first run and around 60M/'s sustained on the 2nd. /Scott. Bob Friesenhahn wrote: There has been no forward progress on the ZFS read performance issue for a week now. A 4X reduction in file read performance due to having read the file before is terrible, and of course the situation is considerably worse if the file was previously mmapped as well. Many of us have sent a lot of money to Sun and were not aware that ZFS is sucking the life out of our expensive Sun hardware. It is trivially easy to reproduce this problem on multiple machines. For example, I reproduced it on my Blade 2500 (SPARC) which uses a simple mirrored rpool. On that system there is a 1.8X read slowdown from the file being accessed previously. In order to raise visibility of this issue, I invite others to see if they can reproduce it in their ZFS pools. The script at http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh Implements a simple test. It requires a fair amount of disk space to run, but the main requirement is that the disk space consumed be more than available memory so that file data gets purged from the ARC. The script needs to run as root since it creates a filesystem and uses mount/umount. The script does not destroy any data. There are several adjustments which may be made at the front of the script. The pool 'rpool' is used by default, but the name of the pool to test may be supplied via an argument similar to: # ./zfs-cache-test.ksh Sun_2540 zfs create Sun_2540/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /Sun_2540/zfscachetest ... Done! zfs unmount Sun_2540/zfscachetest zfs mount Sun_2540/zfscachetest Doing initial (unmount/mount) 'cpio -o /dev/null' 48000247 blocks real2m54.17s user0m7.65s sys 0m36.59s Doing second 'cpio -o /dev/null' 48000247 blocks real11m54.65s user0m7.70s sys 0m35.06s Feel free to clean up
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob Friesenhahn wrote: On Wed, 15 Jul 2009, Scott Lawson wrote: NAME STATE READ WRITE CKSUM test1 ONLINE 0 0 0 mirror ONLINE 0 0 0 c3t600A0B8000562264039B4A257E11d0 ONLINE 0 0 0 c3t600A0B8000336DE204394A258B93d0 ONLINE 0 0 0 Each of these LUNS is a pair of 146GB 15K drives in a RAID1 on Crystal firmware on a 6140. Each LUN is 2km apart in different data centres. 1 LUN where the server is, 1 remote. Interestingly by creating the mirror vdev the first run got faster, and the second much much slower. The second cpio took and extra 2 minutes by virtue of it being a mirror. I ran the script once again prior to adding the mirror and the results were pretty much the same as the first run posted. (plus or minus a couple of seconds, which is to be expected as these LUNS are on prod arrays feeding other servers as well) I will try these tests on some of my J4500's when I get a chance shortly. My interest is now piqued. Doing initial (unmount/mount) 'cpio -C 131072 -o /dev/null' 48000256 blocks real3m25.13s user0m2.67s sys 0m28.40s It is quite impressive that your little two disk mirror reads as fast as mega Sun systems with 38+ disks and striped vdevs to boot. Incredible! Does this have something to do with your well-managed power and cooling? :-) Maybe it is Bob, maybe it is. ;) haha. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
This system has 32 GB of RAM so I will probbaly need to increase the data set size. [r...@x tmp]# ./zfs-cache-test.ksh nbupool System Configuration: Sun Microsystems sun4v SPARC Enterprise T5220 System architecture: sparc System release level: 5.10 Generic_141414-02 CPU ISA list: sparcv9+vis2 sparcv9+vis sparcv9 sparcv8plus+vis2 sparcv8plus+vis sparcv8plus sparcv8 sparcv8-fsmuld sparcv7 sparc Pool configuration: pool: nbupool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM nbupool ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0 c2t4d0 ONLINE 0 0 0 c2t5d0 ONLINE 0 0 0 c2t6d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t7d0 ONLINE 0 0 0 c2t8d0 ONLINE 0 0 0 c2t9d0 ONLINE 0 0 0 c2t10d0 ONLINE 0 0 0 c2t11d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t12d0 ONLINE 0 0 0 c2t13d0 ONLINE 0 0 0 c2t14d0 ONLINE 0 0 0 c2t15d0 ONLINE 0 0 0 c2t16d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t17d0 ONLINE 0 0 0 c2t18d0 ONLINE 0 0 0 c2t19d0 ONLINE 0 0 0 c2t20d0 ONLINE 0 0 0 c2t21d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t22d0 ONLINE 0 0 0 c2t23d0 ONLINE 0 0 0 c2t24d0 ONLINE 0 0 0 c2t25d0 ONLINE 0 0 0 c2t26d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t27d0 ONLINE 0 0 0 c2t28d0 ONLINE 0 0 0 c2t29d0 ONLINE 0 0 0 c2t30d0 ONLINE 0 0 0 c2t31d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t32d0 ONLINE 0 0 0 c2t33d0 ONLINE 0 0 0 c2t34d0 ONLINE 0 0 0 c2t35d0 ONLINE 0 0 0 c2t36d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t37d0 ONLINE 0 0 0 c2t38d0 ONLINE 0 0 0 c2t39d0 ONLINE 0 0 0 c2t40d0 ONLINE 0 0 0 c2t41d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c2t42d0 ONLINE 0 0 0 c2t43d0 ONLINE 0 0 0 c2t44d0 ONLINE 0 0 0 c2t45d0 ONLINE 0 0 0 c2t46d0 ONLINE 0 0 0 spares c2t47d0AVAIL c2t48d0AVAIL c2t49d0AVAIL errors: No known data errors zfs create nbupool/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /nbupool/zfscachetest ... Done! zfs unmount nbupool/zfscachetest zfs mount nbupool/zfscachetest Doing initial (unmount/mount) 'cpio -C 131072 -o /dev/null' 48000256 blocks real3m37.24s user0m9.87s sys 1m54.08s Doing second 'cpio -C 131072 -o /dev/null' 48000256 blocks real1m59.11s user0m9.93s sys 1m49.15s Feel free to clean up with 'zfs destroy nbupool/zfscachetest'. Scott Lawson wrote: Bob, Output of my run for you. System is a M3000 with 16 GB RAM and 1 zpool called test1 which is contained on a raid 1 volume on a 6140 with 7.50.13.10 firmware on the RAID controllers. RAid 1 is made up of two 146GB 15K FC disks. This machine is brand new with a clean install of S10 05/09. It is destined to become a Oracle 10 server with ZFS filesystems for zones and DB volumes. [r...@xxx /]# uname -a SunOS xxx 5.10 Generic_139555-08 sun4u sparc SUNW,SPARC-Enterprise [r...@xxx /]# cat /etc/release Solaris 10 5/09 s10s_u7wos_08 SPARC Copyright 2009 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 30 March 2009 [r...@xxx /]# prtdiag -v | more System Configuration: Sun Microsystems sun4u Sun SPARC Enterprise M3000 Server System clock frequency: 1064 MHz Memory size: 16384 Megabytes Here is the run output for you. [r...@xxx tmp]# ./zfs-cache-test.ksh test1 zfs create test1/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /test1/zfscachetest ... Done! zfs unmount test1/zfscachetest zfs mount test1/zfscachetest Doing initial (unmount/mount) 'cpio -o /dev
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Bob, Output of my run for you. System is a M3000 with 16 GB RAM and 1 zpool called test1 which is contained on a raid 1 volume on a 6140 with 7.50.13.10 firmware on the RAID controllers. RAid 1 is made up of two 146GB 15K FC disks. This machine is brand new with a clean install of S10 05/09. It is destined to become a Oracle 10 server with ZFS filesystems for zones and DB volumes. [r...@xxx /]# uname -a SunOS xxx 5.10 Generic_139555-08 sun4u sparc SUNW,SPARC-Enterprise [r...@xxx /]# cat /etc/release Solaris 10 5/09 s10s_u7wos_08 SPARC Copyright 2009 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 30 March 2009 [r...@xxx /]# prtdiag -v | more System Configuration: Sun Microsystems sun4u Sun SPARC Enterprise M3000 Server System clock frequency: 1064 MHz Memory size: 16384 Megabytes Here is the run output for you. [r...@xxx tmp]# ./zfs-cache-test.ksh test1 zfs create test1/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /test1/zfscachetest ... Done! zfs unmount test1/zfscachetest zfs mount test1/zfscachetest Doing initial (unmount/mount) 'cpio -o /dev/null' 48000247 blocks real4m48.94s user0m21.58s sys 0m44.91s Doing second 'cpio -o /dev/null' 48000247 blocks real6m39.87s user0m21.62s sys 0m46.20s Feel free to clean up with 'zfs destroy test1/zfscachetest'. Looks like a 25% performance loss for me. I was seeing around 80MB/s sustained on the first run and around 60M/'s sustained on the 2nd. /Scott. Bob Friesenhahn wrote: There has been no forward progress on the ZFS read performance issue for a week now. A 4X reduction in file read performance due to having read the file before is terrible, and of course the situation is considerably worse if the file was previously mmapped as well. Many of us have sent a lot of money to Sun and were not aware that ZFS is sucking the life out of our expensive Sun hardware. It is trivially easy to reproduce this problem on multiple machines. For example, I reproduced it on my Blade 2500 (SPARC) which uses a simple mirrored rpool. On that system there is a 1.8X read slowdown from the file being accessed previously. In order to raise visibility of this issue, I invite others to see if they can reproduce it in their ZFS pools. The script at http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh Implements a simple test. It requires a fair amount of disk space to run, but the main requirement is that the disk space consumed be more than available memory so that file data gets purged from the ARC. The script needs to run as root since it creates a filesystem and uses mount/umount. The script does not destroy any data. There are several adjustments which may be made at the front of the script. The pool 'rpool' is used by default, but the name of the pool to test may be supplied via an argument similar to: # ./zfs-cache-test.ksh Sun_2540 zfs create Sun_2540/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /Sun_2540/zfscachetest ... Done! zfs unmount Sun_2540/zfscachetest zfs mount Sun_2540/zfscachetest Doing initial (unmount/mount) 'cpio -o /dev/null' 48000247 blocks real2m54.17s user0m7.65s sys 0m36.59s Doing second 'cpio -o /dev/null' 48000247 blocks real11m54.65s user0m7.70s sys 0m35.06s Feel free to clean up with 'zfs destroy Sun_2540/zfscachetest'. And here is a similar run on my Blade 2500 using the default rpool: # ./zfs-cache-test.ksh zfs create rpool/zfscachetest Creating data file set (3000 files of 8192000 bytes) under /rpool/zfscachetest ... Done! zfs unmount rpool/zfscachetest zfs mount rpool/zfscachetest Doing initial (unmount/mount) 'cpio -o /dev/null' 48000247 blocks real13m3.91s user2m43.04s sys 9m28.73s Doing second 'cpio -o /dev/null' 48000247 blocks real23m50.27s user2m41.81s sys 9m46.76s Feel free to clean up with 'zfs destroy rpool/zfscachetest'. I am interested to hear about systems which do not suffer from this bug. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, power failures, and UPSes
Haudy Kazemi wrote: Hello, I've looked around Google and the zfs-discuss archives but have not been able to find a good answer to this question (and the related questions that follow it): How well does ZFS handle unexpected power failures? (e.g. environmental power failures, power supply dying, etc.) Does it consistently gracefully recover? Mostly. Unless you are unlucky. Backups are your friend in *any* environment though. Should having a UPS be considered a (strong) recommendation or a don't even think about running without it item? There has been quite any interesting thread on this over the last few months. I won't repeat my comments, but it is there in digital posterity on the zfs-discuss archives. Certainly in a large environment with a lot of data being written, then one should consider this a mandatory requirement if you care about your data. Particularly if there are many links in your storage chain that cause data corruption due to power failure. Are there any communications/interfacing caveats to be aware of when choosing the UPS? In this particular case, we're talking about a home file server running OpenSolaris 2009.06. As far as a home server goes, particularly if it is not write intensive then you will 'most likely' be fine. I have a home one with a v120 running S10 u6 with a D1000 and 7 x 300 GB SCSI disk in a RAIDZ2 that has seen numerous power interruptions with no faults. This machine is a Samba server for my Macs and printing business. I also have another mail / web server also on another v120 which experiences the same power faults and regularly bounces back without issues. But your mileage may vary. It all really depends on how much you care about the data really. I haven't used OpenSolaris specifically however as I prefer the generally more well supported S10 releases. (yes I know you can get support for OS, but I tend to be conservative and standardize as much as possible. I do have millions of files stored on ZFS volumes for our Uni and I sleep well ;)) Actual environment power failures are generally 1 per year. I know there are a few blog articles about this type of application, but I don't recall seeing any (or any detailed) discussion about power failures and UPSes as they relate to ZFS. I did see that the ZFS Evil Tuning Guide says cache flushes are done every 5 seconds. The flush time you mention is based on older versions of ZFS, newer ones can have a flush time as long as 30 seconds I believe now. Here is one post that didn't get any replies about a year ago after someone had a power failure, then UPS battery failure while copying data to a ZFS pool: http://lists.macosforge.org/pipermail/zfs-discuss/2008-July/000670.html Both theoretical answers and real life experiences would be appreciated as the former tells me where ZFS is needed while the later tells me where it has been or is now. Thanks, -hk ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, power failures, and UPSes
David Magda wrote: On Jun 30, 2009, at 14:08, Bob Friesenhahn wrote: I have seen UPSs help quite a lot for short glitches lasting seconds, or a minute. Otherwise the outage is usually longer than the UPSs can stay up since the problem required human attention. A standby generator is needed for any long outages. Can't remember where I read the claim, but supposedly if power isn't restored within about ten minutes, then it will probably be out for a few hours. If this 'statistic' is true, it would mean that your UPS should last (say) fifteen minutes, and after that you really need a generator. Most UPS's from any vendor are designed to run for around ~12 minutes at full load. So that would appear to back that claim up and from my experience that is pretty much on the money... At $WORK we currently have about thirty minutes worth of juice at full load, but as time drags on and we start shutting down less essential stuff we can increase that. The PBX and security system have their own UPSes in their own racks, so there are two layers of battery there. The problem comes when the power cut comes and you aren't there in the middle of the night. Then you either need an automated shutdown system instigated by traps from the UPS (shutting things down in the correct order) or a generator. About here the generator becomes a very good option. The above no generator scenario needs to be consistently tested to maintain it's validity, which is a royal pain in the neck. Gen sets are worth their weight in gold. I can't even think how many times in the last few years they have saved our bacon. (through both planned and unplanned outages) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Increase size of ZFS mirror
Thomas Maier-Komor wrote: Ben schrieb: Thomas, Could you post an example of what you mean (ie commands in the order to use them)? I've not played with ZFS that much and I don't want to muck my system up (I have data backed up, but am more concerned about getting myself in a mess and having to reinstall, thus losing my configurations). Many thanks for both of your replies, Ben I'm not an expert on this, and I haven't tried it, so beware: 1) If the pool you want to expand is not the root pool: $ zpool export mypool # now replace one of the disks with a new disk $ zpool import mypool # zpool status will show that mypool is in degraded state because of a missing disk $ zpool replace mypool replaceddisk # now the pool will start resilvering # Once it is done with resilvering: $ zpool detach mypool otherdisk # now physically replace otherdisk $ zpool replace mypool otherdisk This will all work well. But I have a couple of suggestions for you as well. If you are using mirrored vdevs then you can also grow the vdev by making it a 3 or a 4 way mirror. This way you don't lose your resiliency in your vdev whilst you are migrating to larger disks. Now of course you have to be able to take the extra device in your system either via a spare drive bay in a storage enclosure or SAN or iSCSI based LUNS. When you have a lot of data and the business requires you to minimize any risk as much as possible this is a good idea. The pool was only offline for 14 seconds to gain the extra space and at all times there were *always* two devices in my mirror vdev. Here is a cut and paste from this process from just the other day with a live production server where the maintenance window was only 5 minutes. This pool was increased from 300 to 500 GB on LUNS from two disparate datacentres. 2009-06-17.13:57:05 zpool attach blackboard c4t600C0FF00924686710D4CF02d0 c4t600C0FF00082CA2312B99E05d0 2009-06-17.18:12:14 zpool detach blackboard c4t600C0FF00080797CC7A87F02d0 2009-06-17.18:12:57 zpool attach blackboard c4t600C0FF00924686710D4CF02d0 c4t600C0FF00086136F22B65F05d0 2009-06-17.20:02:00 zpool detach blackboard c4t600C0FF00924686710D4CF02d0 2009-06-18.05:58:52 zpool export blackboard 2009-06-18.05:59:06 zpool import blackboard For home users this is probably overkill, but I thought I would mention it for more enterprise type people that are maybe familiar with disksuiite and not ZFS as much. 2) if you are working on the root pool, just skip export/import part and boot with only one half of the mirror. Don't forget to run installgrub after replacing a disk. HTH, Thomas ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ___ Scott Lawson Systems Architect Manukau Institute of Technology Information Communication Technology Services Private Bag 94006 Manukau City Auckland New Zealand Phone : +64 09 968 7611 Fax: +64 09 968 7641 Mobile : +64 27 568 7611 mailto:sc...@manukau.ac.nz http://www.manukau.ac.nz perl -e 'print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Path = ???
Howard, I have certainly seen this with other apps making the ZFS console disappear. This is what I use to make it available again. I commonly have had this happen when I have installed CAM into the webconsole in the past. (Don't think it happened the last time I did however ;)) #wcadmin deploy -a zfs -x zfs /usr/share/webconsole/webapps/zfs Also if you wish to make the webconsole accessible from more than just the localhost, use : # svccfg -s svc:/system/webconsole setprop options/tcp_listen = true # smcwebserver restart Hope this helps, Scott. cindy.swearin...@sun.com wrote: Hi Howard, Which Solaris release is this? You shouldn't have to register the ZFS app, but other problems prevented the ZFS GUI tool from launching successfully in the Solaris 10 release. If you can provide the Solaris release info and specific error messages, I can try to get some answers. Thanks, Cindy Howard Huntley wrote: I am running ZFS on a sun Blade 100 with 2 80gig drives. When I upgraded the OS to the latest version of Solaris, ZFS did not register in the Java WEB console. I have to run the command smreg add -a /directory/containing/application-file to manually reregister the zfs app. Can any one tell me the correct path to use in this command? Password = howard http://imageevent.com/hhuntley/computerlab/computerlab ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- _ Scott Lawson Systems Architect Information Communication Technology Services Manukau Institute of Technology Private Bag 94006 South Auckland Mail Centre Manukau 2240 Auckland New Zealand Phone : +64 09 968 7611 Fax: +64 09 968 7641 Mobile : +64 27 568 7611 mailto:sc...@manukau.ac.nz http://www.manukau.ac.nz __ perl -e 'print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);' __ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SAS 15K drives as L2ARC
Roger Solano wrote: Hello, Does it make any sense to use a bunch of 15K SAS drives as L2ARC cache for several TBs of SATA disks? For example: A STK2540 storage array with this configuration: * Tray 1: Twelve (12) 146 GB @ 15K SAS HDDs. * Tray 2: Twelve (12) 1 TB @ 7200 SATA HDDs. Just thought I would point out that these are hardware backed RAID arrays. You might be better off using the J4200 instead for this so ZFS can manage the disks completely as well. Will probably be cheaper too! The savings could be put towards some SSD's or more system RAM for L1ARC. I was thinking about using disks from Tray 1 as L2ARC for Tray 2 and put all of these disks in one (1) zfs storage pool. This pool would be used mainly as astronomical images repository, shared via NFS over a Sun Fire X2200. Is it worth to do? Thanks in advance for any help. Regards, Roger -- http://www.sun.com *Roger Solano* Arquitecto de Soluciones Región ACC - Venezuela *Sun Microsystems, Inc.* Teléfono: +58-212-905-3800 Fax: +58-212-905-3811 Email: roger.sol...@sun.com INFORMACION: Este mensaje está destinado para uso exclusivo del destinatarioypuedecontenermaterialeinformación confidencial. Esta prohibida cualquier revisión, uso, publicación o distribución no autorizada del material o información. Si usted no es el destinatario correcto, por favor contactar a través del correo electrónico a la persona que envía la comunicación y destruya todas las copias del mensaje original. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SAS 15K drives as L2ARC
Bob Friesenhahn wrote: On Thu, 7 May 2009, Scott Lawson wrote: A STK2540 storage array with this configuration: * Tray 1: Twelve (12) 146 GB @ 15K SAS HDDs. * Tray 2: Twelve (12) 1 TB @ 7200 SATA HDDs. Just thought I would point out that these are hardware backed RAID arrays. You might be better off using the J4200 instead for this so ZFS can manage the disks completely as well. Will probably be cheaper too! The savings could be put towards some SSD's or more system RAM for L1ARC. Something nice about the STK2540 solution is that if the server system dies. The STK2540's can quickly be swung over to another system via a quick 'zfs import'. Sure provided they have it attached to a fibre channel switch or have a nice long fibre lead. The difference is negligible other than cost. Roger replied off list and mentioned customer has the 2540 already, so my suggestion is moot anyway for him. FYI. I have relocated zpools both ways, with SAN attached 3510, 11's, 6140's and SAS attached J4500's. Both ways work just fine. One is cheaper. ;) Being that he was mentioning astronomical data, which I know is large datasets I just thought I would point it out the 2540 probably wouldn't offer the best bang for buck for this NFS server. Thats all. If SSDs are embedded inside the server system then it is necessary to physically move the log devices to the new system. It is possible to buy these J series JBODS with bundled SSD's as well right now. Log device would be contained in this chassis which would facilitate easy importing and exporting in the case of a system shift being required. The issue about how to quickly recover after the server dies seems to rarely be discussed here. Embedded log devices tend to make issues more complex. A dumb SAS array is certainly much cheaper and will perform at least as well, but it does seem like these newfangled embedded log devices cause an issue when maximum availability is desired. With SAS it is necessary to physically swing the cables to the replacement server and of course the replacement server needs to be very close by. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + EMC Cx310 Array (JBOD ? Or Singe MetaLUN ?)
Wilkinson, Alex wrote: 0n Thu, Apr 30, 2009 at 11:11:55AM -0500, Bob Friesenhahn wrote: On Thu, 30 Apr 2009, Wilkinson, Alex wrote: I currently have a single 17TB MetaLUN that i am about to present to an OpenSolaris initiator and it will obviously be ZFS. However, I am constantly reading that presenting a JBOD and using ZFS to manage the RAID is best practice ? Im not really sure why ? And isn't that a waste of a high performing RAID array (EMC) ? The JBOD advantage is that then ZFS can schedule I/O for the disks and there is less chance of an unrecoverable pool since ZFS is assured to lay out redundant data on redundant hardware and ZFS uses more robust error detection than the firmware on any array. When using mirrors there is considerable advantage since writes and reads can be concurrent. That said, your EMC hardware likely offers much nicer interfaces for indicating and replacing bad disk drives. With the ZFS JBOD approach you have to back-track from what ZFS tells you (a Solaris device ID) and figure out which physical drive is not behaving correctly. EMC tech support may not be very helpful if ZFS says there is something wrong but the raid array says there is not. Sometimes there is value with taking advantage of what you paid for. So forget ZFS and use UFS ? Or use UFS with a ZVOL ? Or Just use Vx{VM,FS} ? It kinda sux that you get no benefit from using such a killer volume manager + filesystem with an EMC array :( -aW Besides the volume management aspects of ZFS and self healing etc, you still get other benefits by virtue of using ZFS. Depending on *your* requirements, they can be arguably more beneficial, if you are happy with the reliability of your underlying storage. Specifically I am talking of ZFS snapshots, rollbacks, cloning, clone promotion, file system quotas, multiple block copies, compression, (encryption soon) etc etc. I have use snapshots, rollbacks and cloning quite successfully in complex upgrades of systems with multiple packages and complex dependencies. Case in point was a Blackboard Upgrade which had two servers. Both with ZFS file systems. One for Blackboard and one for Oracle. The upgrade involved going through 3 versions of Oracle and 4 versions of blackboard where the process had potentially many places to go wrong. At every point of the way we performed a snapshot on both Oracle and Blackboard to allow us to rollback any particular part that we got wrong. This saved us an immense amount of time and money and is a good real world example of where this side of ZFS has been extremely helpful. In the Oracle side this was infinitely faster than having to rollback the database itself. BB had some very large tables! Of course to take maximum advantage of ZFS in full, then as everyone has mentioned it is a good idea to let ZFS manage the underlying raw disks if possible. IMPORTANT: This email remains the property of the Australian Defence Organisation and is subject to the jurisdiction of section 70 of the CRIMES ACT 1914. If you have received this email in error, you are requested to contact the sender and delete the email. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Raidz vdev size... again.
Leon, RAIDZ2 is ~equivalent to RAID6. ~2 disks of parity data. Allowing a double drive failure and still having the pool available. If possible though you would be best to let the 3ware controller expose the 16 disks as a JBOD to ZFS and create a RAIDZ2 within Solaris as you will then gain the full benefits of ZFS. Block self healing etc etc. There isn't an issue in using a larger amount of disks in a RAIDZ2, just that it is not the optimal size. Longer rebuild times for larger vdev's in a zpool (although this is proportional to how full the pool is.). Two parity disks gives you greater cover in the event of a drive failing in a large vdev stripe. /Scott Leon Meßner wrote: Hi, i'm new to the list so please bare with me. This isn't an OpenSolaris related problem but i hope it's still the right list to post to. I'm on the way to move a backup server to using zfs based storage, but i don't want to spend too much drives to parity (the 16 drives are attached to a 3ware raid controller so i could also just use raid6 there). I want to be able to sustain two parallel drive failures so i need raidz2. The man page of zpool says the recommended vdev size is somewhere between 3-9 drives (for raidz). Is this just for getting the best performance or are there stability issues ? There won't be anything like heavy multi-user IO on this machine so couldn't i just put all 16 drive in one raidz2 and have all the benefits of zfs without sacrificing 2 extra drives to parity (compared to raid6)? Thanks in Advance, Leon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ___ Scott Lawson Systems Architect Manukau Institute of Technology Information Communication Technology Services Private Bag 94006 Manukau City Auckland New Zealand Phone : +64 09 968 7611 Fax: +64 09 968 7641 Mobile : +64 27 568 7611 mailto:sc...@manukau.ac.nz http://www.manukau.ac.nz perl -e 'print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Raidz vdev size... again.
Michael Shadle wrote: On Mon, Apr 27, 2009 at 5:32 PM, Scott Lawson scott.law...@manukau.ac.nz wrote: One thing you haven't mentioned is the drive type and size that you are planning to use as this greatly influences what people here would recommend. RAIDZ2 is built for big, slow SATA disks as reconstruction times in large RAIDZ's and RAIDZ2's increase the risk of vdev failure significantly due to the time taken to resilver to a replacement drive. Hot spares are your friend! Well these are Seagate 1.5TB SATA disks. So.. big slow disks ;) Then RAIDZ2 is your friend! The resilver time on a large RAIDZ2 stripe on these would take a significant amount of time. The probability of another drive failing during this rebuild time is quite high. I have in my time seen numerous double disk failures in hardware backed RAID5's resulting in complete volume failure. You did also state that this is a system to be used for backups? So availability is five 9's? Are you planning on using Open Solaris or mainstream Solaris 10? Mainstream Solaris 10 is more conservative and is capable of being placed under a support agreement if need be. Nope. Home storage (DVDs, music, etc) - I'd be fine with mainstream Solaris, the only reason I went with SXCE was for the in-kernel CIFS, which I wound up not using anyway due to some weird bug. I have a v240 at home with a 12 bay D1000 chassis with 11 x 300GB SCSI's in a RAIDZ2 at home with 1 hot spare. Makes a great NAS for me. Mostly for photo's and music so the capacity is fine. Speed is very very quick as these are 10 K drives. I have a a printing business on the side where we store customer images on this and have gigabit to all the macs that we use for photoshop. The assurance that RAIDZ2 gives me allows me to sleep comfortably. (coupled with daily snapshots ;)) I use S10 10/08 with Samba for my network clients. Runs like a charm. -- ___ Scott Lawson Systems Architect Manukau Institute of Technology Information Communication Technology Services Private Bag 94006 Manukau City Auckland New Zealand Phone : +64 09 968 7611 Fax: +64 09 968 7641 Mobile : +64 27 568 7611 mailto:sc...@manukau.ac.nz http://www.manukau.ac.nz perl -e 'print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Raidz vdev size... again.
Richard Elling wrote: Some history below... Scott Lawson wrote: Michael Shadle wrote: On Mon, Apr 27, 2009 at 4:51 PM, Scott Lawson scott.law...@manukau.ac.nz wrote: If possible though you would be best to let the 3ware controller expose the 16 disks as a JBOD to ZFS and create a RAIDZ2 within Solaris as you will then gain the full benefits of ZFS. Block self healing etc etc. There isn't an issue in using a larger amount of disks in a RAIDZ2, just that it is not the optimal size. Longer rebuild times for larger vdev's in a zpool (although this is proportional to how full the pool is.). Two parity disks gives you greater cover in the event of a drive failing in a large vdev stripe. Hmm, this is a bit disappointing to me. I would have dedicated only 2 disks out of 16 then to a single large raidz2 instead of two 8 disk raidz2's (meaning 4 disks went to parity) No I was referring to a single RAIDZ2 vdev of 16 disks in your pool. So you would lose ~2 disks to parity effectively. The larger the stripe, potentially the slower the rebuild. If you had multiple vdevs in a pool that were smaller stripes you would get less performance degradation by virtue of IO isolation. Of course here you lose pool capacity. With smaller vdevs, you could also potentially just use RAIDZ and not RAIDZ2 and then you would have the equivalent size pool still with two parity disks. 1 per vdev. A few years ago, Sun introduced the X4500 (aka Thumper) which had 48 disks in the chassis. Of course, the first thing customers did was to make a single-level 46 or 48 disk raidz set. The second thing they did was complain that the resulting performance sucked. So the solution was to try and put some sort of practical limit into the docs to help people not hurt themselves. After much research (down at the pub? :-) the recommendation you see in the man page was the concensus. It has absolutely nothing to do with correctness of design or implementation. It has everything to do with setting expectations of goodness. Sure, I understand this. I was a beta tester for the J4500 because I prefer SPARC systems mostly for Solaris. Certainly for these large disk systems the preferred layout of around 5-6 drives per vdev is what I use on my assortment of *4500 series devices. My production J4500's with 48 x 1 TB drives yield around ~31 TB usable. A T5520 10 Gig attached will pretty much saturate the 3Gb/s SAS HBA connecting it to the J4500. ;) Being that this a home NAS for Michael serving large contiguous files with fairly low random access requirements, most likely I would imagine that these rules of thumb can be relaxed a little. As you state they are a rule of thumb for generic loads. This list does appear to be attracting people wanting to use ZFS for home and capacity tends to be the biggest requirement over performance. As I always advise people. Test with *your* workload as *your* requirements may be different to the next mans. If you favor capacity over performance then a larger vdev of a dozen or so disks will work 'OK' in my experience. (I do routinely get referred to Sun customers in NZ as a site that actually use ZFS in production and doesn't just play with it.) I have tested the aforementioned thumpers with just this sort of config myself with varying results on varying workloads. Video servers, Sun Email etc etc... Long time ago now. I also have hardware backed RAID 6's consisting of 16 drives in 6000 series storage on Crystal firmware which work just fine in the hardware RAID world. (where I want capacity over speed). This is real world production class stuff. Works just fine. I have ZFS overlaid on top of this as well. But it is good that we are emphasizing the trade offs that any config has. Everyone can learn from these sorts of discussions. ;) One thing you haven't mentioned is the drive type and size that you are planning to use as this greatly influences what people here would recommend. RAIDZ2 is built for big, slow SATA disks as reconstruction times in large RAIDZ's and RAIDZ2's increase the risk of vdev failure significantly due to the time taken to resilver to a replacement drive. Hot spares are your friend! The concern with large drives is unrecoverable reads during resilvering. One contributor to this is superparamagnetic decay, where the bits are lost over time as the medium tries to revert to a more steady state. To some extent, periodic scrubs will help repair these while the disks are otherwise still good. At least one study found that this can occur even when scrubs are done, so there is an open research opportunity to determine the risk and recommend scrubbing intervals. To a lesser extent, hot spares can help reduce the hours it may take to physically repair the failed drive. +1 I was still operating under the impression that vdevs larger than 7-8 disks typically make baby Jesus nervous. You did also state that this is a system
Re: [zfs-discuss] Can this be done?
Michael Shadle wrote: On Tue, Apr 7, 2009 at 5:22 PM, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: No. The two vdevs will be load shared rather than creating a mirror. This should double your multi-user performance. Cool - now a followup - When I attach this new raidz2, will ZFS auto rebalance data between the two, or will it keep the other one empty and do some sort of load balancing between the two for future writes only? Future writes only as far as I am aware. You will however get increased IO potentially. (Total increase will depend on controller layouts etc etc.) Is there a way (perhaps a scrub? or something?) to get the data spread around to both? No. You could backup and restore though. (or if you a small number of big files you could I guess copy them around inside the pool to get them rebalanced. ) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ___ Scott Lawson Systems Architect Manukau Institute of Technology Information Communication Technology Services Private Bag 94006 Manukau City Auckland New Zealand Phone : +64 09 968 7611 Fax: +64 09 968 7641 Mobile : +64 27 568 7611 mailto:sc...@manukau.ac.nz http://www.manukau.ac.nz perl -e 'print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can this be done?
Michael Shadle wrote: On Mon, Mar 30, 2009 at 4:13 PM, Michael Shadle mike...@gmail.com wrote: Sounds like a reasonable idea, no? Follow up question: can I add a single disk to the existing raidz2 later on (if somehow I found more space in my chassis) so instead of a 7 disk raidz2 (5+2) it becomes a 6+2 ? No. There is no way to expand a RAIDZ or RAIDZ2 at this point. It is a feature that is often discussed and people would like, but has been seen by Sun as more of a feature home users would like rather than enterprise users. Enterprise users are expected to buy a 4 or more disks and create another RAIDZ2 vdev and add it to the pool to increase space. You would of course have this option.. However by the time that you fill it there might be a solution. Adam Leventhal proposed a way that this could be implemented on his blog, so I suspect at some point in the next few years somebody will implement it and you will possible have the option to do so then. (after and OS and ZFS version upgrade) http://blogs.sun.com/ahl/entry/expand_o_matic_raid_z Thanks... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can this be done?
Bob Friesenhahn wrote: On Sat, 28 Mar 2009, Michael Shadle wrote: Well this is for a home storage array for my dvds and such. If I have to turn it off to swap a failed disk it's fine. It does not need to be highly available and I do not need extreme performance like a database for example. 45mb/sec would even be acceptable. I can see that 14 disks costs a lot for a home storage array but to you the data on your home storage array may be just as important as data on some businesses enterprise storage array. In fact, it may be even more critical since it seems unlikely that you will have an effective backup system in place like large businesses do. The main problem with raidz1 is that if a disk fails and you replace it, that if a second disk substantially fails during resilvering (which needs to successfully read all data on remaining disks) then your ZFS pool (or at least part of the files) may be toast. The more data which must be read during resilvering, the higher the probability that there will be a failure. If 12TB of data needs to be read to resilver a 1TB disk, then that is a lot of successful reading which needs to go on. This is a very good point for anyone following this and wondering why RAIDZ2 is a good idea. I have seen over the years several large RAID 5 hardware arrays go belly up as a 2nd drive fails during a rebuild with the end result of the entire RAID set being rendered useless. If you can afford it then you should use it. RAID6 or RAIDZ2 was made for big SATA drives. If you do use it though, one should make sure that you have reasonable CPU as it does require a bit more grunt to run over RAIDZ. The bigger the disks and the bigger the stripe the more likely you are to encounter a issue during a rebuild of a failed drive. plain and simple. In order to lessen risk, you can schedule a periodic zfs scrub via a cron job so that there is less probabily of encountering data which can not be read. This will not save you from entirely failed disk drives though. As far as Tim's post that NOBODY recommends using better than RAID5, I hardly consider companies like IBM and NetApp to be NOBODY. Only Sun RAID hardware seems to lack RAID6, but Sun offers ZFS's raidz2 so it does not matter. Plenty of Sun hardware comes with RAID6 support out of the box these days Bob. Certainly all of the 4140, 4150, 4240 and 4250 2 socket x86 /x64 systems have hardware controllers for this. Also all of the 6140's, 6540 and 6780's disk arrays do also have RAID 6 if they have Crystal firmware and of course the Open Storage 7000 series machines do as well being that they are Opensolaris and ZFS based. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- _ Scott Lawson Systems Architect Information Communication Technology Services Manukau Institute of Technology Private Bag 94006 South Auckland Mail Centre Manukau 2240 Auckland New Zealand Phone : +64 09 968 7611 Fax: +64 09 968 7641 Mobile : +64 27 568 7611 mailto:sc...@manukau.ac.nz http://www.manukau.ac.nz __ perl -e 'print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);' __ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS on a SAN
Grant, Yes this is correct. If host A goes belly up, you can deassign the LUN from host A and assign to host B. Being that host A has not gracefully exported it's zpool you will need to 'zpool import -f poolname' to force the pool to be imported because it hasn't been exported prior to import due to the unexpected inaccessibility of host A. It is possible to have the LUN visible to both machines at the same time, just not in use by both machines. This is in general how clusters work. Be aware that if you do do this and access the disk on both systems then you run a very real risk of corruption of the volume. I use the first approach here quite regularly in what I call 'poor mans clustering'. ;) I tend to install all my software and data environments on SAN based LUNS that allow ease of moving just by exporting the zpool , reassigning the LUN then importing to the new system. Works well as long as both systems are of the same OS revision or greater on the target system. /Scott. Grant Lowe wrote: Hi Erik, A couple of questions about what you said in your email. In synopsis 2, if hostA has gone belly up and is no longer accessible, then a step that is implied (or maybe I'm just inferring it) is to go to the SAN and reassign the LUN from hostA to hostB. Correct? - Original Message From: Erik Trimble erik.trim...@sun.com To: Grant Lowe gl...@sbcglobal.net Cc: zfs-discuss@opensolaris.org Sent: Wednesday, March 11, 2009 1:42:06 PM Subject: Re: [zfs-discuss] ZFS on a SAN I'm not 100% sure what your question here is, but let me give you a (hopefully) complete answer: (1) ZFS is NOT a clustered file system, in the sense that it is NOT possible for two hosts to have the same LUN mounted at the same time, even if both are hooked to a SAN and can normally see that LUN. (2) ZFS can do failover, however. If you have a LUN from a SAN on hostA, create a ZFS pool in it, and use as normal. Should you with to failover the LUN to hostB, you need to do a 'zpool export zpool' on hostA, then 'zpool import zpool' on hostB. If hostA has been lost completely (hung/died/etc) and you are unable to do an 'export' on it, you can force the import on hostB via 'zpool import -f zpool' ZFS requires that you import/export entire POOLS, not just filesystems. So, given what you seem to want, I'd recommend this: On the SAN, create (2) LUNs - one for your primary data, and one for your snapshots/backups. On hostA, create a zpool on the primary data LUN (call it zpool A), and another zpool on the backup LUN (zpool B). Take snapshots on A, then use 'zfs send' and 'zfs receive' to copy the clone/snapshot over to zpool B. then 'zpool export B' On hostB, import the snapshot pool: 'zfs import B' It might just be as easy to have two independent zpools on each host, and just do a 'zfs send' on hostA, and 'zfs receive' on hostB to copy the snapshot/clone over the wire. -Erik On Wed, 2009-03-11 at 13:18 -0700, Grant Lowe wrote: Hi All, I'm new on ZFS, so I hope this isn't too basic a question. I have a host where I setup ZFS. The Oracle DBAs did their thing and I know have a number of ZFS datasets with their respective clones and snapshots on serverA. I want to export some of the clones to serverB. Do I need to zone serverB to see the same LUNs as serverA? Or does it have to have preexisting, empty LUNs to import the clones? Please help. Thanks. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ___ Scott Lawson Systems Architect Manukau Institute of Technology Information Communication Technology Services Private Bag 94006 Manukau City Auckland New Zealand Phone : +64 09 968 7611 Fax: +64 09 968 7641 Mobile : +64 27 568 7611 mailto:sc...@manukau.ac.nz http://www.manukau.ac.nz perl -e 'print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Comstar production-ready?
Jacob Ritorto wrote: Caution: I built a system like this and spent several weeks trying to get iscsi share working under Solaris 10 u6 and older. It would work fine for the first few hours but then performance would start to degrade, eventually becoming so poor as to actually cause panics on the iscsi initiator boxes. Couldn't find resolution through the various Solaris knowledge bases. Closest I got was to find out that there's a problem only in the *Solaris 10* iscsi target code that incorrectly frobs some counter when it shouldn't, violating the iscsi target specifications. The problem is fixed in Nevada/OpenSolaris. Can't say I have had a problem myself. Initiator is the default Microsoft Vista initiator. Mine has been running for at least 6-9 months fine. It doesn't get a absolute hammering though. Using it to provide extra storage to IT staff desktops who need it and at the same time allowing staff to play with iSCSI. I run a largish fibre channel shop and prefer that mostly anyway. Long story short, I tried OpenSolaris 2008.11 and the iscsi crashes ceased and things ran smoothly. Not the solution I was hoping for, since this was to eventually be a prod box, but then Sun announced that I could purchase OpenSolaris support, so I was covered. On OS, my two big filers have been running really nicely for months and months now. If you get Solaris 10 support sun will provide fixes for that too I imagine. Again, can't say I have had a problem myself. But as I mentioned in my previous email, I can't stress how important it is to *test* a solution in your environment with *your* workload with the hardware/OS *you* choose. The Sun try and buy on hardware is a great way to do this relatively risk free. If it doesn't work send it back. Also they have startup essentials which will potentially allow you to ge the try and buy hardware in the cheap if you are a new customer. Don't try to use Solaris 10 as a filer OS unless you can identify and resolve the iscsi target issue. If iSCSI is truly broken then one could log a support call on this is you take basic maintenance. This is cheaper the RHEL for the entry level stuff by the way... flame Being that this is a Linux shop you are selling into, OpenSolaris might be the best way to go as the GNU userland might be more familiar to them and they might not understand having to change their shell paths to get the userland that they want ;) /flame On Wed, Mar 4, 2009 at 2:47 AM, Scott Lawson scott.law...@manukau.ac.nz wrote: Stephen Nelson-Smith wrote: Hi, I recommended a ZFS-based archive solution to a client needing to have a network-based archive of 15TB of data in a remote datacentre. I based this on an X2200 + J4400, Solaris 10 + rsync. This was enthusiastically received, to the extent that the client is now requesting that their live system (15TB data on cheap SAN and Linux LVM) be replaced with a ZFS-based system. The catch is that they're not ready to move their production systems off Linux - so web, db and app layer will all still be on RHEL 5. At some point I am sure you will convince them to see the light! ;) As I see it, if they want to benefit from ZFS at the storage layer, the obvious solution would be a NAS system, such as a 7210, or something buillt from a JBOD and a head node that does something similar. The 7210 is out of budget - and I'm not quite sure how it presents its storage - is it NFS/CIFS? The 7000 series devices can present NFS, CIFS and iSCSI. Looks very nice if you need a nice Gui / Don't know command line / need nice analytics. I had a play with one the other day and am hoping to get my mit's on one shortly for testing. I would like to give it a real gd crack with VMWare for VDI VM's. If so, presumably it would be relatively easy to build something equivalent, but without the (awesome) interface. For sure the above gear would be fine for that. If you use standard Solaris 10 10/08 you have NFS and iSCSI ability directly in the OS and also available to be supported via a support contract if needed. Best bet would probably be NFS for the Linux machines, but you would need to test in *their* environment with *their* workload. The interesting alternative is to set up Comstar on SXCE, create zpools and volumes, and make these available either over a fibre infrastructure, or iSCSI. I'm quite excited by this as a solution, but I'm not sure if it's really production ready. If you want fibre channel target then you will need to use OpenSolaris or SXDE I believe. It's not available in mainstream Solaris yet. I am personally waiting till then when it has been *well* tested in the bleeding edge community. I have too much data to take big risks with it. What other options are there, and what advice/experience can you share? I do very similar stuff here with J4500's and T2K's for compliance archives, NFS and iSCSI targets
Re: [zfs-discuss] Comstar production-ready?
Right on the money there Bob. Without knowing more detail about the clients workload, it would be hard to advise either way. I would imagine based purely on the small amount of info around the clients apps and workload that NFS would most likely be the appropriate solution on top of ZFS. You will make more efficient use of the your ZFS storage this way and provide all the niceties like snapshots and rollbacks from the Solaris based filer whilst still maintaining Linux front ends. Do take heed to the various list posts around ZFS/NFS with certain types of workloads however. Bob Friesenhahn wrote: On Wed, 4 Mar 2009, Stephen Nelson-Smith wrote: The interesting alternative is to set up Comstar on SXCE, create zpools and volumes, and make these available either over a fibre infrastructure, or iSCSI. I'm quite excited by this as a solution, but I'm not sure if it's really production ready. While this is indeed exciting, the solutions you have proposed vary considerably in the type of functionality they offer. Comstar and iSCSI provide access to a storage volume similar to SAN storage. This volume is then formatted with some alien filesystem which is unlikely to support the robust features of ZFS. Even though the storage volume is implemented in robust ZFS, the client still has the ability to scramble its own filesystem. ZFS snapshots can help defend against that by allowing to rewind the entire content of the storage volume to a former point in time. With the NFS/CIFS server model, only ZFS is used. There is no dependence on a client filesystem. With the Comstar/iSCSI approach, you are balkanizing (http://en.wikipedia.org/wiki/Balkanization) your storage so that each client owns its own filesystems without ability to share the data unless the client does it. With the native ZFS server approach, all clients share the pool storage and can share files on the server if the server allows it. A drawback of the native ZFS server approach is that the server needs to know about the users on the clients in order to support access control. Regardless, there are cases where Comstar/iSCSI makes the most sense, or the ZFS fileserver makes the most sense. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ___ Scott Lawson Systems Architect Manukau Institute of Technology Information Communication Technology Services Private Bag 94006 Manukau City Auckland New Zealand Phone : +64 09 968 7611 Fax: +64 09 968 7641 Mobile : +64 27 568 7611 mailto:sc...@manukau.ac.nz http://www.manukau.ac.nz perl -e 'print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Comstar production-ready?
Stephen Nelson-Smith wrote: Hi, I recommended a ZFS-based archive solution to a client needing to have a network-based archive of 15TB of data in a remote datacentre. I based this on an X2200 + J4400, Solaris 10 + rsync. This was enthusiastically received, to the extent that the client is now requesting that their live system (15TB data on cheap SAN and Linux LVM) be replaced with a ZFS-based system. The catch is that they're not ready to move their production systems off Linux - so web, db and app layer will all still be on RHEL 5. At some point I am sure you will convince them to see the light! ;) As I see it, if they want to benefit from ZFS at the storage layer, the obvious solution would be a NAS system, such as a 7210, or something buillt from a JBOD and a head node that does something similar. The 7210 is out of budget - and I'm not quite sure how it presents its storage - is it NFS/CIFS? The 7000 series devices can present NFS, CIFS and iSCSI. Looks very nice if you need a nice Gui / Don't know command line / need nice analytics. I had a play with one the other day and am hoping to get my mit's on one shortly for testing. I would like to give it a real gd crack with VMWare for VDI VM's. If so, presumably it would be relatively easy to build something equivalent, but without the (awesome) interface. For sure the above gear would be fine for that. If you use standard Solaris 10 10/08 you have NFS and iSCSI ability directly in the OS and also available to be supported via a support contract if needed. Best bet would probably be NFS for the Linux machines, but you would need to test in *their* environment with *their* workload. The interesting alternative is to set up Comstar on SXCE, create zpools and volumes, and make these available either over a fibre infrastructure, or iSCSI. I'm quite excited by this as a solution, but I'm not sure if it's really production ready. If you want fibre channel target then you will need to use OpenSolaris or SXDE I believe. It's not available in mainstream Solaris yet. I am personally waiting till then when it has been *well* tested in the bleeding edge community. I have too much data to take big risks with it. What other options are there, and what advice/experience can you share? I do very similar stuff here with J4500's and T2K's for compliance archives, NFS and iSCSI targets for Windows machines. Works fine for me. Biggest system is 48TB on J4500 for Veritas Netbackup DDT staging volumes. Very good throughput indeed. Perfect in fact, based on the large files that are created in this environment. One of these J4500's can keep 4 LTO4 drives in a SL500 saturated with data on a T5220. (4 streams at ~160 MB/sec) I think you have pretty much the right idea though. Certainly if you use Sun kit you will be able to deliver a commercially supported solution for them. Thanks, S. -- _ Scott Lawson Systems Architect Information Communication Technology Services Manukau Institute of Technology Private Bag 94006 South Auckland Mail Centre Manukau 2240 Auckland New Zealand Phone : +64 09 968 7611 Fax: +64 09 968 7641 Mobile : +64 27 568 7611 mailto:sc...@manukau.ac.nz http://www.manukau.ac.nz __ perl -e 'print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);' __ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS on SAN?
Hi Andras, No problems writing direct. Answers inline below. (If there are any typo's it cause it's late and I have had a very long day ;)) andras spitzer wrote: Scott, Sorry for writing you directly, but most likely you have missed my questions regarding your SW design, whenever you have time, would you reply to that? I really value your comments and appreciate it as it seems you have great experience with ZFS in a professional environment, and this is something not so frequent today. That was my e-mail, response to your e-mail (it's in the thread) : Scott, That is an awesome reference you wrote, I totally understand and agree with your idea of having everything redundant (dual path, redundant switches, dual controllers) at the SAN infrastructure, I would have some question about the sw design you use if you don't mind. - are you using MPxIO as DMP? Yes. configuring via 'stmsboot'. I have used Sun MPXIO for quite a few years now and have found it works well (was SAN Foundatin Kit for many years). - as I understood from your e-mail all of your ZFS pools are ZFS mirrored? (you don't have non-redundant ZFS configuration) Certainly the ones that are from SAN based disk. No there are no non redundant ZFS configurations. All storage is doubled up. Expensive, but we tend to stick to modular storage for this and spread the cost over many yeasr. Storage budget is at least 50% of systems group infrastructure budget. There are many other ZFS file systems which aren't SAN attached and are in mirrors, RAIDZ's etc. I mentioned the Loki's aka J4500 which are in RAIDZ's. Very nice and have worked very reliably so far. I would strongly advocate these units for ZFS if you want a lot of disk reasonably cheaply that performs well... - why you decided to use ZFS mirror instead of ZFS raidz or raidz2? As we already have hardware based RAID5 from our arrays. (Sun 3510, 3511, 6140's). The ZFS file systems are used mostly for mirroring purposes, but also to take advantage of the other nice things ZFS brings lack snapshots, cloning, clone promotions etc. - you have RAID 5 protected LUNs from SAN, and you put ZFS mirror on top of them? Yes. Covered above I think. Could you please share some details about your configuration regarding SAN redundancy VS ZFS redundancy (I guess you use both here), also some background why you decided to go with that? Been doing it for many years. Not just with ZFS, but UFS and VXFS as well. Also quite a large number of NTFS machines. We have two geographically separate data centers which are a few kilometers apart with redundant dark fibre links over different routes. All core switches are in a full mesh with two cores per site, each with a redundant connection to the two cores at the other site. One via each route. We believe strongly that storage is the key to our business. Servers are but processing to work the data and are far easier to replace. We tend to standardize on particular models and then buy a bunch of em and not necessarily maintenance for them. There are a lot of key things to building a reliable data center. I have been having a lively discussion on this twith Toby and Richard which has been raising some interesting points. I do firmly believe in getting things right from the ground up. I start with power and environment. Storage comes next in my book. Regards, sendai One point I'm really interested is that it seems you deploy ZFS with ZFS mirror, even when you have RAID redundancy at the HW/SAN level, which means extra costs to you obviously. I'm looking for a fairly decisive opinion whether is it safe to use ZFS configuration without redundancy when you have RAID redundancy in your high-end SAN, or you still decide to go with ZFS redundancy (ZFS mirror in your case, not even raidz or raidz2) because of the extra self-healing feature and the lowered risk of total pool failure? I think this has also been covered in recent list posts. the important thing is really to have two copies of blocks if you wish to be able to self heal. The cost I guess is what value you place on availability and reliability of your data. ZFS mirrors are faster for resilvering as well. Much much faster in my experience. We recently used this during a data center move and rebuild. Our SAN fabric was extended to 3 sites and we moved blocks of storage one piece at a time and resynced them at the new location once they were in place with 0% disruption to the business. I do think the fishworks stuff are going to prove to be game breakers in the near future for many people as they will offer many of the features we want in our storage. Once COMSTAR has been integrated into this line I might buy some. (I have a large investment in fibre channel and I don't trust networking people as far as I can kick them when it comes to understanding the potential problems that can arise from disconnecting block targets that are coming in over
Re: [zfs-discuss] qmail on zfs
Robert Milkowski wrote: Hello Asif, Wednesday, February 18, 2009, 1:28:09 AM, you wrote: AI On Tue, Feb 17, 2009 at 5:52 PM, Robert Milkowski mi...@task.gda.pl wrote: Hello Asif, Tuesday, February 17, 2009, 7:43:41 PM, you wrote: AI Hi All AI Does anyone have any experience on running qmail on solaris 10 with ZFS only? AI I would appreciate if you share your findings, suggestion and gotchas It just works. AI Is there any performance penalty over ufs ? I did some testing years ago and I honestly do not remember - nevertheless we migrated to ZFS so either it wasn't slower or it was faster. I run exim (which is a pretty similar sort of MTA) on 2-3 year old x4200's with ZFS mirrored on local SAS drives. These perform better than they did with UFS due to the L2ARC. I have two of these. Each one manages to process around 300-500K inbound messages per day quite easily with Spamassassin running on them as well. Spec is dual dual core Opteron with 4 GB RAM. 4 x 73 GB 10K 2.5 SAS disks. As I think Tony mentioned, the best thing to do is to test with your specific workload. I don't think you will see a drop in performance at all. (and you will gain so many other lovely things that ZFS brings ;)) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS on SAN? Availability edition.
Robin, From recollection the business case for investment in power protection technology was relatively simple. We calculated what the downtime per hour was worth and how frequently it happened. We used to have several if not more incidents per year and that would cause major system outages. When you have over 1000 staff and multiple remote sites depending on your data center (now data centers, plural). Calculate cost per hour for staff wages alone and it becomes quite easy to justify. (I am not even going to fact in loss of reputation and the media in this, or our most important customer. Our students) I cannot *stress* just how important power and environment protection is to data. It is the main consideration I take into account when deploying new sites. (This discussion went off list yesterday and I was mentioning these same things there). My analogy here is what would be the first thing NASA designs into a new space craft? Life Support. Without it you don't even leave the ground. Electricity *is* the lifeblood of available storage. Case in point. Last year we had an arsonist set fire to a critical point in out campus infrastructure last year which burnt down a building that just happened to have one of the main communication and power trenches running through it. Knocked out around 5 buildings on that campus for two weeks. Immense upheaval and disruption followed. Our brand new DR data center was on that site. Kept running because of redundant fibre paths to the SAN switches and core routers so that we could still provide service to the rest of the campus and maintain active DR to our primary site. Emergency power via generator was also available until main power could be rerouted to the data center as well. I will take a look at the twinstrata website. (as should others). Sorry to all if we are diverging too much from zfs-discuss. /Scott This stuff does happen. When you have been around for a while you see it. Robin Harris wrote: Calculating the availability and economic trade-offs of configurations is hard. Rule of thumb seems to rule. I recently profiled an availability/reliability tool on StorageMojo.com that uses Bayesian analysis to estimate datacenter availability. You can quickly (minutes, not days) model systems and compare availability and recovery times as well as OpEx and CapEx implications. One hole: AFAIK, ZFS isn't in their product catalog. There's a free version of the tool at http://www.twinstrata.com/ Feedback on the tool from this group is invited. Robin StorageMojo.com Date: Tue, 17 Feb 2009 21:36:38 -0800 From: Richard Elling richard.ell...@gmail.com mailto:richard.ell...@gmail.com To: Toby Thain t...@telegraphics.com.au mailto:t...@telegraphics.com.au Cc: zfs-discuss@opensolaris.org mailto:zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] ZFS on SAN? Message-ID: 499b9e66.2010...@gmail.com mailto:499b9e66.2010...@gmail.com Content-Type: text/plain; charset=ISO-8859-1; format=flowed Toby Thain wrote: Not at all. You've convinced me. Your servers will never, ever lose power unexpectedly. Methinks living in Auckland has something to do with that :-) http://en.wikipedia.org/wiki/1998_Auckland_power_crisis When services are reliable, then complacency brings risk. My favorite example recently is the levees in New Orleans. Katrina didn't top the levees, they were undermined. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ___ Scott Lawson Systems Architect Manukau Institute of Technology Information Communication Technology Services Private Bag 94006 Manukau City Auckland New Zealand Phone : +64 09 968 7611 Fax: +64 09 968 7641 Mobile : +64 27 568 7611 mailto:sc...@manukau.ac.nz http://www.manukau.ac.nz perl -e 'print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS on SAN? Availability edition.
Miles Nordin wrote: sl == Scott Lawson scott.law...@manukau.ac.nz writes: sl Electricity *is* the lifeblood of available storage. I never meant to suggest computing machinery could run without electricity. My suggestion is, if your focus is _reliability_ rather than availability, meaning you don't want to lose the contents of a pool, you should think about what happens when power goes out, not just how to make sure power Never goes out Ever Absolutely because we Paid and our power is PERFECT. My focus is on both. And I understand that nothing is ever perfect, only that one should strive for it if possible. But when one lives in a place like NZ where our power grid system is creaky, it starts becoming a real liability that needs mitigation thats all. I am sure there are plenty of ZFS users in the same boat. * pools should not go corrupt when power goes out. Absolutely agree. * UPS does not replace need for NVRAM's to have batteries in it because there are things between the UPS and the NVRAM like cords and power supplies, and the UPS themselves are not reliable enough if you have only one, and the controller containing the NVRAM may need to be hard-booted because of bugs. Fully understand this too. If you use as I do hardware RAID arrays behind zpool vdevs then it is very important that this stuff is maintained and that the batteries backing the RAID array write caches are good and that you can have power available to allow them to flush cache to disk before the batteries go flat. This is certainly true of any file system that is built upon LUNS from hardware backed RAID arrays. * supplying superexpensive futuristic infalliable fancypower to all disk shelves does not mean the SYNC CACHE command can be thrown out. maybe the power is still not infalliable, or maybe there will be SAN outages or blown controllers or shelves with junky software in them that hang the whole array when one drive goes bad. In general why I use mirrored vdevs with LUNS provided from two different arrays geographically isolated, less likely to be a problem hopefully. But yes anything that ignores SYNC CACHE could pose a serious problem if it is hidden by an array controller from ZFS. If you really care about availability: * reliability crosses into availability if you are planning to have fragile pools backed by a single SAN LUN, which may become corrupt if they lose power. Maybe you're planning to destroy the pool and restore from backup in that case, and you have some carefully-planned offsite backup heirarchy that's always recent enough to capture all the data you care about. But, a restore could take days, which turns two minutes of unavailable power into one day of unavailable data. If there were no reliability problem causing pool loss during power loss, two minutes unavailable power maybe means 10min of unavailable data. Agreed and is why I would recommend against a single hardware RAID SAN LUN for a zpool. At bare minimum for this you would want to use copies=2 if you really care about your data. IF you don't care about the data then no problems, go ahead. I do use zpools for transient data that I don't care about and favor capacity over resiliency. (main think I want is L2ARC for these, think squid proxy server caches) * there are reported problems with systems that take hours to boot up, ex. with thousands of filesystems, snapshots, or nfs exports, which isn't exactly a reliability problem, but is a problem. That open issue falls into the above outage-magnification category, too. Have seen this myself. Not nice after a system reboot. Can't recall if I have seen it recently though? Seem to recall it was more around S10 U2 or U3. I just don't like the idea people are building fancy space-age data centers and then thinking they can safely run crappy storage software that won't handle power outages because they're above having to worry about all that little-guy nonsense. A big selling point of the last step-forward in filesystems (metadata logging) was that they'd handle power failures with better consistency guarantees and faster reboots---at the time, did metadata logging appeal only to people with unreliable power? I hope not. I am just trying to put forward the perspective of a big user here. I have already generated numerous off list posts with people wanting more info on the methodology that we like to use. If I can be of help to people I will. never mind those of us who find these filsystem features important because we'd like cheaper or smaller systems, with cords that we sometimes trip over, that are still useful. I think having such protections in the storage software and having them actually fully working not just imaginary or fragile, is always useful, Absolutely. It is all part of the big picture. Albeit probably the *the* most important part. Consistency of your
Re: [zfs-discuss] ZFS on SAN?
Hi All, I have been watching this thread for a while and thought it was time a chipped my 2 cents worth in. I have been an aggressive adopter of ZFS here across all of our Solaris systems and have found the benefits have far outweighed any small issues that have arisen. Currently I have many systems that have LUNs provided from SAN based storage to systems for zpools. All our systems are configured with mirrored vdevs and the reliability factor has been as good as, if not greater than UFS and LVM. My rules of thumb around systems tend to stem around getting the storage infrastructure right as that generally leads to the best availability. To this end for every single SAN attached system we have dual paths to separate switches, every array has dual controllers dual pathed to different switches. ZFS may be more or less susceptible to any physical infrastructure problem, but in my experience it is on a par with UFS (and I gave up shelling out for vxfs long ago) The reasons for the above configuration is that our storage is evenly split between two sites dark fibre between them across redundant routes. This forms a ring configuration which is around 5 km around. We have so much storage that we need to have this in case of a data center catastrophe. The business recognizes the time to recovery risk would be so great that if we didn't we would be out of business in the event of one of our data centres burning or other natural disaster. I have seen other people discussing power availability on other threads recently. If you want it, you can have it. You just need the business case for it. I don't buy the comments on UPS unreliability. Quite frequently I have rebooted arrays and removed them from mirrored vdevs and have not had any issues with the LUNS they provided reattaching and re silvering. Scrubs on the pools have always been successful. Largest single mirrored pool is around 11TB which is form two 6140 RAID 5's. We also use Loki boxes as well for very large storage pools which are routinely filled. (I was a beta tester for Loki). I have two J4500's, one with 48 x 250 GB and 1 x with 48 x 1 TB drives. No issues there either. The 48 x 1 TB is used in a a Disk _ Disk - Tape config with a SL500 to back up our entire site. It is routinely fulled to the brim and it performs admirably attached to a T5220 which is 10 gig attached. All of the systems I have mentioned vary from Samba servers to compliance archives to Oracle DB servers, Blackboard content stores, squid web caches, LDAP directory servers, Mail stores, Mail spools., Calendar servers DB's. The list covers 60 plus systems. I have 0% Solaris older than Solaris 10. Why would you? In short I hope people don't hold back from adoption of ZFS because they are unsure about it. Judge for yourself as I have done and dip your toes in at whatever rate you are happy to do so. Thats what I did. /Scott. I also use it at home too with and old D1000 attached to a v120 with 8 x 320 GB scsi's in a RAIDZ2 for all our home data and home business (which is a printing outfit which creates a lot of very big files on our macs). -- ___ Scott Lawson Systems Architect Manukau Institute of Technology Information Communication Technology Services Private Bag 94006 Manukau City Auckland New Zealand Phone : +64 09 968 7611 Fax: +64 09 968 7641 Mobile : +64 27 568 7611 mailto:sc...@manukau.ac.nz http://www.manukau.ac.nz perl -e 'print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS on SAN?
Toby Thain wrote: On 17-Feb-09, at 3:01 PM, Scott Lawson wrote: Hi All, ... I have seen other people discussing power availability on other threads recently. If you want it, you can have it. You just need the business case for it. I don't buy the comments on UPS unreliability. Hi, I remarked on it. FWIW, my experience is that commercial data centres do not avoid 'unscheduled outages', no matter how many steely-eyed assurances they give. It seems rather imprudent to assume that power is never going to fail. No matter how many diesel generators, rooftop tanks, or pebblebed reactors you have, somebody is inevitably going to kick out a plug... at least in most of the real world. --Toby Thats why you have two plugs if not more. I still don't buy your argument. It comes down to procedural issues on the site when it comes to people kicking plugs out. Everything we have has dual power supplies, feed from dual power rails, feed from separate switchboards, through separate very large UPS's, backed by generators, feed by two substations and then cloned to another data center 3 km away. HA is all about design. (I won't even comment about further up the stack than electricity) We have secure data centers with strict practices of work and qualified staff following best practice for maintenance and risk management around maintenance. I am far, far more worried with someone with root access typing 'zpool destroy' than I am worried about the lights going out in the data centers I designed that house hundreds and hundreds of servers. ;) and no we don't have unplanned outages. Not in a long time. Not all people that design data centers know how to design power systems for them. Sometimes the IT people don't convey their requirements exactly enough to the electrical engineers. (I am an electrical engineer who got sidetracked by SunOS around '91 and never went back.) Anyway we diverge I think. Maybe we can agree to disagree? Back to discussions about disk caddies and overpriced hardware.. slightly more closer to the topic at hand... ;) ... -- ___ Scott Lawson Systems Architect Manukau Institute of Technology Information Communication Technology Services Private Bag 94006 Manukau City Auckland New Zealand Phone : +64 09 968 7611 Fax: +64 09 968 7641 Mobile : +64 27 568 7611 mailto:sc...@manukau.ac.nz http://www.manukau.ac.nz perl -e 'print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ___ Scott Lawson Systems Architect Manukau Institute of Technology Information Communication Technology Services Private Bag 94006 Manukau City Auckland New Zealand Phone : +64 09 968 7611 Fax: +64 09 968 7641 Mobile : +64 27 568 7611 mailto:sc...@manukau.ac.nz http://www.manukau.ac.nz perl -e 'print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS on SAN?
David Magda wrote: On Feb 17, 2009, at 21:35, Scott Lawson wrote: Everything we have has dual power supplies, feed from dual power rails, feed from separate switchboards, through separate very large UPS's, backed by generators, feed by two substations and then cloned to another data center 3 km away. HA http://www.geonet.org.nz/earthquake/quakes/recent_quakes.html Ha. Yeah thats why we were once known to the British as The Shaky Isles . We do have lot's of earthquakes around the pacific rim. We are in Auckland however which is north of all those little stars on the pic which is where the edge of the pacific plate intersects with the Australian plate. So not too many earthquakes to worry about in Auckland compared to the rest of NZ. Although one of the data centers I built recently was on the second floor of a building and had to be earthquake restrained due to the fact that we were going to be potentially creating up to one ton point loads on the floor. The rest of NZ gets little and biggish quakes fairly often, so much so that the Aussies next door on the west island see fit to warn their citizens about the potential for earthquakes in NZ if visiting... Our capital city Wellington the other hand is built on fault lines.. think San Francisco... Now a Volcano in Auckland might be a different story... We have over 50 dormant cones and whopping big one in the main harbor which is called Rangitoto. Translated form the Maori it means Blood Sky. My UPS's wont protect from that one... ;) I am far, far more worried with someone with root access typing 'zpool destroy' than I am worried about the lights going out in the data centers I designed that house hundreds and hundreds of servers. ;) Yeah, this is probably more likely. -- ___ Scott Lawson Systems Architect Manukau Institute of Technology Information Communication Technology Services Private Bag 94006 Manukau City Auckland New Zealand Phone : +64 09 968 7611 Fax: +64 09 968 7641 Mobile : +64 27 568 7611 mailto:sc...@manukau.ac.nz http://www.manukau.ac.nz perl -e 'print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss