Re: [zfs-discuss] A few questions
On Sat, Jan 08, 2011 at 12:33:50PM -0500, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Garrett D'Amore When you purchase NexentaStor from a top-tier Nexenta Hardware Partner, you get a product that has been through a rigorous qualification process How do I do this, exactly? I am serious. Before too long, I'm going to need another server, and I would very seriously consider reprovisioning my unstable Dell Solaris server to become a linux or some other stable machine. The role it's currently fulfilling is the backup server, which basically does nothing except zfs receive from the primary Sun solaris 10u9 file server. Since the role is just for backups, it's a perfect opportunity for experimentation, hence the Dell hardware with solaris. I'd be happy to put some other configuration in there experimentally instead ... say ... nexenta. Assuming it will be just as good at zfs receive from the primary server. Is there some specific hardware configuration you guys sell? Or recommend? How about a Dell R510/R610/R710? Buy the hardware separately and buy NexentaStor as just a software product? Or buy a somehow more certified hardware software bundle together? If I do encounter a bug, where the only known fact is that the system keeps crashing intermittently on an approximately weekly basis, and there is absolutely no clue what's wrong in hardware or software... How do you guys handle it? If you'd like to follow up offlist, that's fine. Then just email me at the email address: nexenta at nedharvey.com (I use disposable email addresses on mailing lists like this, so at any random unknown time, I'll destroy my present alias and start using a new one.) Hey, Other OS's have had problems with the Broadcom NICs aswell.. See for example this RHEL5 bug: https://bugzilla.redhat.com/show_bug.cgi?id=520888 Host crashing probably due to MSI-X IRQs with bnx2 NIC.. And VMware vSphere ESX/ESXi 4.1 crashing with bnx2x: http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1029368 So I guess there are firmware/driver problems affecting not just Solaris but also other operating systems.. -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
From: Pasi Kärkkäinen [mailto:pa...@iki.fi] Other OS's have had problems with the Broadcom NICs aswell.. Yes. The difference is, when I go to support.dell.com and punch in my service tag, I can download updated firmware and drivers for RHEL that (at least supposedly) solve the problem. I haven't tested it, but the dell support guy told me it has worked for RHEL users. There is nothing available to download for solaris. Also, the bcom is not the only problem on that server. After I added-on an intel network card and disabled the bcom, the weekly crashes stopped, but now it's ... I don't know ... once every 3 weeks with a slightly different mode of failure. This is yet again, rare enough that the system could very well pass a certification test, but not rare enough for me to feel comfortable putting into production as a primary mission critical server. I really think there are only two ways in the world to engineer a good solid server: (a) Smoke your own crack. Systems engineering teams use the same systems that are sold to customers. or (b) Sell millions of 'em. So despite whether or not the engineering team uses them, you're still going to have sufficient mass to dedicate engineers to the purpose of post-sales bug solving. I suppose a third way, which has certainly happened in history but not very applicable to me... Is to simply charge such ridiculously high prices for your servers that you can dedicate engineers to post-sales bug solving, even if you only sold a handful of those systems in the whole world. Things like munitions-strength cray and alphaservers etc in the past have sometimes fit into this category. I do feel confident assuming that solaris kernel engineers use sun servers primarily for their server infrastructure. So I feel safe buying this configuration. The only thing there is to gain by buying something else is lower prices... or maybe some obscure fringe detail that I can't think of. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On Jan 9, 2011, at 4:19 PM, Edward Ned Harvey opensolarisisdeadlongliveopensola...@nedharvey.com wrote: From: Pasi Kärkkäinen [mailto:pa...@iki.fi] Other OS's have had problems with the Broadcom NICs aswell.. Yes. The difference is, when I go to support.dell.com and punch in my service tag, I can download updated firmware and drivers for RHEL that (at least supposedly) solve the problem. I haven't tested it, but the dell support guy told me it has worked for RHEL users. There is nothing available to download for solaris. The drivers are written by Broadcom and are, AFAIK, closed source. By going through Dell, you are going through a middle-man. For example, http://www.broadcom.com/support/ethernet_nic/netxtremeii10.php where you see the release of the Solaris drivers was at the same time as Windows. Also, the bcom is not the only problem on that server. After I added-on an intel network card and disabled the bcom, the weekly crashes stopped, but now it's ... I don't know ... once every 3 weeks with a slightly different mode of failure. This is yet again, rare enough that the system could very well pass a certification test, but not rare enough for me to feel comfortable putting into production as a primary mission critical server. I really think there are only two ways in the world to engineer a good solid server: (a) Smoke your own crack. Systems engineering teams use the same systems that are sold to customers. This is rarely practical, not to mention that product development is often not in the systems engineering organization. or (b) Sell millions of 'em. So despite whether or not the engineering team uses them, you're still going to have sufficient mass to dedicate engineers to the purpose of post-sales bug solving. yes, indeed :-) -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
Just to add a bit to this, I just love sweeping generalizations... On 9 Jan 2011, at 19:33 , Richard Elling wrote: On Jan 9, 2011, at 4:19 PM, Edward Ned Harvey opensolarisisdeadlongliveopensola...@nedharvey.com wrote: From: Pasi Kärkkäinen [mailto:pa...@iki.fi] Other OS's have had problems with the Broadcom NICs aswell.. Yes. The difference is, when I go to support.dell.com and punch in my service tag, I can download updated firmware and drivers for RHEL that (at least supposedly) solve the problem. I haven't tested it, but the dell support guy told me it has worked for RHEL users. There is nothing available to download for solaris. The drivers are written by Broadcom and are, AFAIK, closed source. By going through Dell, you are going through a middle-man. For example, http://www.broadcom.com/support/ethernet_nic/netxtremeii10.php where you see the release of the Solaris drivers was at the same time as Windows. What Richard says is true. Broadcom have been a source of contention in the Linux world as well as the *BSD world due to the proprietary nature of their firmware. OpenSolaris/Solaris users are not the only ones who have complained about this. There's been much uproar in the FOSS community about Broadcom and their drivers. As a result, I've seen some pretty nasty hacks like people using the Windows drivers linked into their kernel - *gack* I forget all the gory details, but it was rather disgusting as I recall, bubblegum, bailing wire, duct tape and all. Dell and Red Hat aren't exactly a marriage made in heaven either. I've had problems getting support from both Dell and Red Hat, them pointing fingers at each other rather than solving the problem. Like most people, I've had to come up with my own work-arounds, like others with the Broadcom issue, using a known quantity NIC. When dealing with Dell as a corporate buyer, they have always made it quite clear that they are primarily a Windows platform. Linux, oh yes, we have that too... Also, the bcom is not the only problem on that server. After I added-on an intel network card and disabled the bcom, the weekly crashes stopped, but now it's ... I don't know ... once every 3 weeks with a slightly different mode of failure. This is yet again, rare enough that the system could very well pass a certification test, but not rare enough for me to feel comfortable putting into production as a primary mission critical server. I've never been particularly warm and fuzzy with Dell servers. They seem to like to change their chipsets slightly while a model is in production. This can cause all sorts of problems which are difficult to diagnose since an identical Dell system will have no problems, and it's mate will crash weekly. I really think there are only two ways in the world to engineer a good solid server: (a) Smoke your own crack. Systems engineering teams use the same systems that are sold to customers. This is rarely practical, not to mention that product development is often not in the systems engineering organization. or (b) Sell millions of 'em. So despite whether or not the engineering team uses them, you're still going to have sufficient mass to dedicate engineers to the purpose of post-sales bug solving. yes, indeed :-) -- richard As for certified systems, It's my understanding that Nexenta themselves don't certify anything. They have systems which are recommended and supported by their network of VAR's. It just so happens that SuperMicro is one of the brands of choice, but even then one must adhere to a fairly tight HCL. The same holds true for Solaris/OpenSolaris with third-party hardware. SATA Controllers and multiplexers are also another example of the drivers being written by the manufacturer and Solaris/OpenSolaris are not a priority over Windows and Linux, in that order. Deviation from items which are not somewhat plain vanilla and are not listed on the HCL is just asking for trouble. Mike --- Michael Sullivan michael.p.sulli...@me.com http://www.kamiogi.net/ Mobile: +1-662-202-7716 US Phone: +1-561-283-2034 JP Phone: +81-50-5806-6242 smime.p7s Description: S/MIME cryptographic signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
As for certified systems, It's my understanding that Nexenta themselves don't certify anything. They have systems which are recommended and supported by their network of VAR's. The certified solutions listed on Nexenta's website were certified by Nexenta. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On 01/ 6/11 05:28 AM, Edward Ned Harvey wrote: From: Khushil Dep [mailto:khushil@gmail.com] I've deployed large SAN's on both SuperMicro 825/826/846 and Dell R610/R710's and I've not found any issues so far. I always make a point of installing Intel chipset NIC's on the DELL's and disabling the Broadcom ones but other than that it's always been plain sailing - hardware-wise anyway. not found any issues, except the broadcom one which causes the system to crash regularly in the default factory configuration. How did you learn about the broadcom issue for the first time? I had to learn the hard way, and with all the involvement of both Dell and Oracle support teams, nobody could tell me what I needed to change. We literally replaced every component of the server twice over a period of 1 year, and I spent mandays upgrading and downgrading firmwares randomly trying to find a stable configuration. I scoured the internet to find this little tidbit about replacing the broadcom NIC, and randomly guessed, and replaced my nic with an intel card to make the problem go away. The same system doesn't have a problem running RHEL/centos. What will be the new problem in the next line of servers? Why, during my internet scouring, did I find a lot of other reports, of people who needed to disable c-states (didn't work for me) and lots of false leads indicating firmware downgrade would fix my broadcom issue? See my point? Next time I buy a server, I do not have confidence to simply expect solaris on dell to work reliably. The same goes for solaris derivatives, and all non-sun hardware. There simply is not an adequate qualification and/or support process. When you purchase NexentaStor from a top-tier Nexenta Hardware Partner, you get a product that has been through a rigorous qualification process which includes the hardware and software configuration matched together, tested with an extensive battery. You also can get a higher level of support than is offered to people who build their own systems. Oracle is *not* the only company capable of performing in depth testing of Solaris. I can also know enough about problems that Oracle customers (or rather Sun customers) faced with Solaris on Sun hardware -- such as the terrible nvidia ethernet problems on first generation U20 and U40 problems, or the marvell SATA problems on Thumper -- that I know that your picture of Oracle isn't nearly as rosy as you believe. Of course, I also lived (as a Sun employee) through the UltraSPARC-II ECC fiasco... - Garrett ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On Thu, Jan 6, 2011 at 11:36 PM, Garrett D'Amore garr...@nexenta.com wrote: On 01/ 6/11 05:28 AM, Edward Ned Harvey wrote: See my point? Next time I buy a server, I do not have confidence to simply expect solaris on dell to work reliably. The same goes for solaris derivatives, and all non-sun hardware. There simply is not an adequate qualification and/or support process. When you purchase NexentaStor from a top-tier Nexenta Hardware Partner, you Where is the list? Is this the one on http://www.nexenta.com/corp/technology-partners-overview/certified-technology-partners ? get a product that has been through a rigorous qualification process which includes the hardware and software configuration matched together, tested with an extensive battery. You also can get a higher level of support than is offered to people who build their own systems. Oracle is *not* the only company capable of performing in depth testing of Solaris. Does this roughly mean I can expect similar (or even better) hardware compatibility support and with nexentastor on supermicro as solaris on oracle/sun hardware? -- Fajar ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Garrett D'Amore When you purchase NexentaStor from a top-tier Nexenta Hardware Partner, you get a product that has been through a rigorous qualification process How do I do this, exactly? I am serious. Before too long, I'm going to need another server, and I would very seriously consider reprovisioning my unstable Dell Solaris server to become a linux or some other stable machine. The role it's currently fulfilling is the backup server, which basically does nothing except zfs receive from the primary Sun solaris 10u9 file server. Since the role is just for backups, it's a perfect opportunity for experimentation, hence the Dell hardware with solaris. I'd be happy to put some other configuration in there experimentally instead ... say ... nexenta. Assuming it will be just as good at zfs receive from the primary server. Is there some specific hardware configuration you guys sell? Or recommend? How about a Dell R510/R610/R710? Buy the hardware separately and buy NexentaStor as just a software product? Or buy a somehow more certified hardware software bundle together? If I do encounter a bug, where the only known fact is that the system keeps crashing intermittently on an approximately weekly basis, and there is absolutely no clue what's wrong in hardware or software... How do you guys handle it? If you'd like to follow up offlist, that's fine. Then just email me at the email address: nexenta at nedharvey.com (I use disposable email addresses on mailing lists like this, so at any random unknown time, I'll destroy my present alias and start using a new one.) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
Am 08.01.11 18:33, schrieb Edward Ned Harvey: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Garrett D'Amore When you purchase NexentaStor from a top-tier Nexenta Hardware Partner, you get a product that has been through a rigorous qualification process How do I do this, exactly? I am serious. Before too long, I'm going to need another server, and I would very seriously consider reprovisioning my unstable Dell Solaris server to become a linux or some other stable machine. The role it's currently fulfilling is the backup server, which basically does nothing except zfs receive from the primary Sun solaris 10u9 file server. Since the role is just for backups, it's a perfect opportunity for experimentation, hence the Dell hardware with solaris. I'd be happy to put some other configuration in there experimentally instead ... say ... nexenta. Assuming it will be just as good at zfs receive from the primary server. Is there some specific hardware configuration you guys sell? Or recommend? How about a Dell R510/R610/R710? Buy the hardware separately and buy NexentaStor as just a software product? Or buy a somehow more certified hardware software bundle together? If I do encounter a bug, where the only known fact is that the system keeps crashing intermittently on an approximately weekly basis, and there is absolutely no clue what's wrong in hardware or software... How do you guys handle it? If you'd like to follow up offlist, that's fine. Then just email me at the email address: nexenta at nedharvey.com (I use disposable email addresses on mailing lists like this, so at any random unknown time, I'll destroy my present alias and start using a new one.) ___ Hmm… that'd interest me as well - I do have 4 Dell PE R610, that are running OSol or Sol11Expr. I actually bought a Sun Fire X4170 M2, since I couldn't get my R610 stable, just as Edward points out. So, if you guys think that NexentaStor avoids these issues, then I'd seriously consider to jumpship - so either please don't continue offlist, or please include me in that conversation. ;) Cheers, budy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On 01/ 8/11 10:43 AM, Stephan Budach wrote: Am 08.01.11 18:33, schrieb Edward Ned Harvey: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Garrett D'Amore When you purchase NexentaStor from a top-tier Nexenta Hardware Partner, you get a product that has been through a rigorous qualification process How do I do this, exactly? I am serious. Before too long, I'm going to need another server, and I would very seriously consider reprovisioning my unstable Dell Solaris server to become a linux or some other stable machine. The role it's currently fulfilling is the backup server, which basically does nothing except zfs receive from the primary Sun solaris 10u9 file server. Since the role is just for backups, it's a perfect opportunity for experimentation, hence the Dell hardware with solaris. I'd be happy to put some other configuration in there experimentally instead ... say ... nexenta. Assuming it will be just as good at zfs receive from the primary server. Is there some specific hardware configuration you guys sell? Or recommend? How about a Dell R510/R610/R710? Buy the hardware separately and buy NexentaStor as just a software product? Or buy a somehow more certified hardware software bundle together? If I do encounter a bug, where the only known fact is that the system keeps crashing intermittently on an approximately weekly basis, and there is absolutely no clue what's wrong in hardware or software... How do you guys handle it? Such problems are handled on a case by case basis. Usually we can do some analysis from a crash dump, but not always. My team includes several people who are experienced with such analysis, and when problems like this occur, we are called into action. Ultimately this usually results in a patch, sometimes workaround suggestions, and sometimes even binary relief (which happens faster than a regular patch, but without the deeper QA.) - Garrett ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On 06/01/2011 00:14, Edward Ned Harvey wrote: solaris engineers don't use? Non-sun hardware. Pretty safe bet you won't find any Dell servers in the server room where solaris developers do their thing. You would lose that bet, not only would you find Dell you would many other big names as well as white box hand build systems too. Solaris developers use a lot of different hardware - Sun never made laptops so many of us have Apple (running Solaris on the metal and/or under virtualisation) or Toshiba or Fujitsu etc laptops. There are also many workstations around the company that aren't Sun hardware as well as servers. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
I've deployed large SAN's on both SuperMicro 825/826/846 and Dell R610/R710's and I've not found any issues so far. I always make a point of installing Intel chipset NIC's on the DELL's and disabling the Broadcom ones but other than that it's always been plain sailing - hardware-wise anyway. I've always found that the real issue is formulating SOP's to match what the organisation is used to with legacy storage systems, educating the admins who will manage it going forward and doing the technical hand-over to folks who may not know or want to know a whole lot of *nix land. My 2p. YMMV. --- W. A. Khushil Dep - khushil@gmail.com - 07905374843 Windows - Linux - Solaris - ZFS - Nexenta - Development - Consulting Contracting http://www.khushil.com/ - http://www.facebook.com/GlobalOverlord On 6 January 2011 00:14, Edward Ned Harvey opensolarisisdeadlongliveopensola...@nedharvey.com wrote: From: Richard Elling [mailto:richard.ell...@nexenta.com] I'll agree to call Nexenta a major commerical interest, in regards to contribution to the open source ZFS tree, if they become an officially supported OS on Dell, HP, and/or IBM hardware. NexentaStor is officially supported on Dell, HP, and IBM hardware. The only question is, what is your definition of 'support'? Many NexentaStor I don't want to argue about this, but I'll just try to clarify what I meant: Presently, I have a dell server with officially supported solaris, and it's as unreliable as pure junk. It's just the backup server, so I'm free to frequently create destroy it... And as such, I frequently do recreate and destroy it. It is entirely stable running RHEL (centos) because Dell and RedHat have a partnership with a serious number of human beings and machines looking for and fixing any compatibility issues. For my solaris instability, I blame the fact that solaris developers don't do significant quality assurance on non-sun hardware. To become officially compatible, the whole qualification process is like this: Somebody installs it, doesn't see any problems, and then calls it certified. They reformat with something else, and move on. They don't build their business on that platform, so they don't detect stability issues like the ones reported... System crashes once per week and so forth. Solaris therefore passes the test, and becomes one of the options available on the drop-down menu for OSes with a new server. (Of course that's been discontinued by oracle, but that's how it was in the past.) Developers need to eat their own food. Smoke your own crack. Hardware engineers at Dell need to actually use your OS on their hardware, for their development efforts. I would be willing to bet Sun hardware engineers use a significant percentage of solaris servers for their work... And guess what solaris engineers don't use? Non-sun hardware. Pretty safe bet you won't find any Dell servers in the server room where solaris developers do their thing. If you want to be taken seriously as an alternative storage option, you've got to at LEAST be listed as a factory-distributed OS that is an option to ship with the new server, and THEN, when people such as myself buy those things, we've got to have a good enough experience that we don't all bitch and flame about it afterward. Nexenta, you need a real and serious partnership with Dell, HP, IBM. Get their developers to run YOUR OS on the servers which they use for development. Get them to sell your product bundled with their product. And dedicate real and serious engineering into bugfixes working with customers, to truly identify root causes of instability, with a real OS development and engineering and support group. It's got to be STABLE, that's the #1 requirement. I previously made the comparison... Even close-source solaris ZFS is a better alternative to close-source netapp wafl. So for now, those are the only two enterprise supportable options I'm willing to stake my career on, and I'll buy Sun hardware with Solaris. But I really wish I could feel confident buying a cheaper Dell server and running ZFS on it. Nexenta, if you make yourself look like a serious competitor against solaris, and really truly form an awesome stable partnership with Dell, I will happily buy your stuff instead of Oracle. Even if you are a little behind in feature offering. But I will not buy your stuff if I can't feel perfectly confident in its stability. Ever heard the phrase Nobody ever got fired for buying IBM. You're the little guys. If you want to compete against the big guys, you've got to kick ass. And don't get sued into oblivion. Even today's feature set is perfectly adequate for at least a couple of years to come. If you put all your effort into stability and bugfixes, serious partnerships with Dell, HP, IBM, and become extremely professional looking and stable, with fanatical support... You don't have to
Re: [zfs-discuss] A few questions
From: Richard Elling [mailto:richard.ell...@nexenta.com] If I understand correctly, you want Dell, HP, and IBM to run OSes other I agree, but neither Dell, HP, nor IBM develop Windows... I'm not sure of the current state, but many of the Solaris engineers develop on laptops and Sun did not offer a laptop product line. You will find them where Nexenta developers live :-) Wait a minute... this is patently false. The big storage vendors: NetApp, EMC, Hitachi, Fujitsu, LSI... none run on HP, IBM, or Dell servers. Like I said, not interested in arguing. This is mostly just a bunch of contradictions to what I said. To each his own. My conclusion is that I am not willing to stake my career on the underdog alternative, when I know I can safely buy the sun hardware and solaris. I experimented once by buying solaris on dell. It was a proven failure, but that's why I did it on a cheap noncritical backup system experimentally before expecting it to work in production. Haven't seen any underdog proven solid enough for me to deploy in enterprise yet. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
This is a silly argument, but... Haven't seen any underdog proven solid enough for me to deploy in enterprise yet. I haven't seen any overdog proven solid enough for me to be able to rely on either. Certainly not Solaris. Don't get me wrong, I like(d) Solaris. But every so often you'd find a bug and they'd take an age to fix it (or to declare that they wouldn't fix it). In one case we had 18 months between reporting a problem and Sun fixing it. In another case it was around 3 months and because we happened to have the source code we even told them where the bug was and what a fix could be. Solaris (and the other overdogs) are worth it when you want someone else to do the grunt work and someone else to point at and blame, but lets not romanticize how good it or any of the others are. What made Solaris (10 at least) worth deploying were its features (dtrace, zfs, SMF, etc). Julian -- Julian King Computer Officer, University of Cambridge, Unix Support ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
From: Bob Friesenhahn [mailto:bfrie...@simple.dallas.tx.us] On Wed, 5 Jan 2011, Edward Ned Harvey wrote: with regards to ZFS and all the other projects relevant to solaris.) I know in the case of SGE/OGE, it's officially closed source now. As of Dec 31st, sunsource is being decomissioned, and the announcement of officially closing the SGE source and decomissioning the open source community went out on Dec 24th. So all of this leads me to believe, with very little reservation, that the new developments beyond zpool 28 are closed source moving forward. There's very little breathing room remaining for hope of that being open sourced again. I have no idea what you are talking about. Best I can tell, SGE/OGE is a reference to Sun Grid Engine, which has nothing to do with zfs. The only annoucement and discussion I can find via Google is written by you. It was pretty clear even a year ago that Sun Grid Engine was going away. Agreed, SGE/OGE has nothing to do with ZFS, unless you believe there's an oracle culture which might apply to both. The only thing written by me, as I recall, included links to the original official announcements. Following those links now, I see the archives have been decomissioned. So there ya go. Since it's still in my inbox, I just saved a copy for you here... It is long winded, and the main points are: SGE (now called OGE) is officially closed-source, and sunsouce.net decommissioned. There is an open source fork, which will not share code development with the closed-source product. http://dl.dropbox.com/u/543241/SGE_officially_closed/GE%20users%20GE%20annou nce%20Changes%20for%20a%20Bright%20Future%20at%20Oracle.txt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
From: Khushil Dep [mailto:khushil@gmail.com] I've deployed large SAN's on both SuperMicro 825/826/846 and Dell R610/R710's and I've not found any issues so far. I always make a point of installing Intel chipset NIC's on the DELL's and disabling the Broadcom ones but other than that it's always been plain sailing - hardware-wise anyway. not found any issues, except the broadcom one which causes the system to crash regularly in the default factory configuration. How did you learn about the broadcom issue for the first time? I had to learn the hard way, and with all the involvement of both Dell and Oracle support teams, nobody could tell me what I needed to change. We literally replaced every component of the server twice over a period of 1 year, and I spent mandays upgrading and downgrading firmwares randomly trying to find a stable configuration. I scoured the internet to find this little tidbit about replacing the broadcom NIC, and randomly guessed, and replaced my nic with an intel card to make the problem go away. The same system doesn't have a problem running RHEL/centos. What will be the new problem in the next line of servers? Why, during my internet scouring, did I find a lot of other reports, of people who needed to disable c-states (didn't work for me) and lots of false leads indicating firmware downgrade would fix my broadcom issue? See my point? Next time I buy a server, I do not have confidence to simply expect solaris on dell to work reliably. The same goes for solaris derivatives, and all non-sun hardware. There simply is not an adequate qualification and/or support process. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
Two fold really - firstly I remember the headaches I used to have configuring Broadcom cards properly under Debain/Ubuntu but the sweetness that was using an Intel NIC. Bottom line for me was that I know Intel drivers have been around longer than Broadcom drivers and thus it would make sense to ensure that we hand intel NIC's on the server. Secondly, I asked Andy Bennett from Nexenta who told me it would make sense - always good to get a second opinion :-) There were/are reports all over Google about Broadcom issues with Solaris/OpenSolaris so I didn't want to risk it. For a couple of hundred for a quad port gig NIC - it's worth it when the entire solution is 90K+. Sometimes (like the issue with bus-resets when some brands/firmware-rev's of SSD's are used) the knowledge comes from people you work with (Nexenta rode to the rescue here again - plug! plug! plug!) :-) These are deployed in a couple of University and a very large data capture/marketing company I used to work for and I know it works really well and (plug! plug! plug) I know the dedicated support I got from the Nexenta guys. The difference as I see it is that OpenSolaris/ZFS/Dtrace/FMA allow you to build your own solution to your own problem. Thinking of storage in a completely new way instead of just a block of storage it becomes an integrated part of performance engineering - certainly has been for the last two installs I've been involved in. I know why folks want a Certified solution with the likes of Dell/HP etc but from my point of view (and all points of view are valid here), I know I can deliver a cheaper, more focussed (and when I say that I'm not just doing some marketing bs) solution for the requirement at hand. It's sometimes a struggle to get customers/end-users to think of storage as more than just storage. There's quite a lot of entrenched thinking to get around/over in our field (try getting a Java dev to think clearly about thread handling and massive SMP drawbacks for example). Anyway - not trying to engage in an argument but it's always interesting to find out why someone went for certain solutions over others. My 2p. YMMV. *goes off to collect cheque from Nexenta* ;-) --- W. A. Khushil Dep - khushil@gmail.com - 07905374843 Windows - Linux - Solaris - ZFS - Nexenta - Development - Consulting Contracting http://www.khushil.com/ - http://www.facebook.com/GlobalOverlord On 6 January 2011 13:28, Edward Ned Harvey opensolarisisdeadlongliveopensola...@nedharvey.com wrote: From: Khushil Dep [mailto:khushil@gmail.com] I've deployed large SAN's on both SuperMicro 825/826/846 and Dell R610/R710's and I've not found any issues so far. I always make a point of installing Intel chipset NIC's on the DELL's and disabling the Broadcom ones but other than that it's always been plain sailing - hardware-wise anyway. not found any issues, except the broadcom one which causes the system to crash regularly in the default factory configuration. How did you learn about the broadcom issue for the first time? I had to learn the hard way, and with all the involvement of both Dell and Oracle support teams, nobody could tell me what I needed to change. We literally replaced every component of the server twice over a period of 1 year, and I spent mandays upgrading and downgrading firmwares randomly trying to find a stable configuration. I scoured the internet to find this little tidbit about replacing the broadcom NIC, and randomly guessed, and replaced my nic with an intel card to make the problem go away. The same system doesn't have a problem running RHEL/centos. What will be the new problem in the next line of servers? Why, during my internet scouring, did I find a lot of other reports, of people who needed to disable c-states (didn't work for me) and lots of false leads indicating firmware downgrade would fix my broadcom issue? See my point? Next time I buy a server, I do not have confidence to simply expect solaris on dell to work reliably. The same goes for solaris derivatives, and all non-sun hardware. There simply is not an adequate qualification and/or support process. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On Jan 5, 2011, at 7:44 AM, Edward Ned Harvey wrote: From: Khushil Dep [mailto:khushil@gmail.com] We do have a major commercial interest - Nexenta. It's been quiet but I do look forward to seeing something come out of that stable this year? :-) I'll agree to call Nexenta a major commerical interest, in regards to contribution to the open source ZFS tree, if they become an officially supported OS on Dell, HP, and/or IBM hardware. NexentaStor is officially supported on Dell, HP, and IBM hardware. The only question is, what is your definition of 'support'? Many NexentaStor customers today appear to be deploying on SuperMicro and Quanta systems, for obvious cost reasons. Nexenta has good working relationships with these major vendors and others. As for investment, Nexenta has been and continues to hire the best engineers and professional services people we can find. We see a lot of demand in the market and have been growing at an astonishing rate. If you'd like to contribute to making software storage solutions rather than whining about what Oracle won't discuss, check us out and send me your resume :-) -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On Jan 5, 2011, at 4:14 PM, Edward Ned Harvey wrote: From: Richard Elling [mailto:richard.ell...@nexenta.com] I'll agree to call Nexenta a major commerical interest, in regards to contribution to the open source ZFS tree, if they become an officially supported OS on Dell, HP, and/or IBM hardware. NexentaStor is officially supported on Dell, HP, and IBM hardware. The only question is, what is your definition of 'support'? Many NexentaStor I don't want to argue about this, but I'll just try to clarify what I meant: Presently, I have a dell server with officially supported solaris, and it's as unreliable as pure junk. It's just the backup server, so I'm free to frequently create destroy it... And as such, I frequently do recreate and destroy it. It is entirely stable running RHEL (centos) because Dell and RedHat have a partnership with a serious number of human beings and machines looking for and fixing any compatibility issues. For my solaris instability, I blame the fact that solaris developers don't do significant quality assurance on non-sun hardware. To become officially compatible, the whole qualification process is like this: Somebody installs it, doesn't see any problems, and then calls it certified. They reformat with something else, and move on. They don't build their business on that platform, so they don't detect stability issues like the ones reported... System crashes once per week and so forth. Solaris therefore passes the test, and becomes one of the options available on the drop-down menu for OSes with a new server. (Of course that's been discontinued by oracle, but that's how it was in the past.) If I understand correctly, you want Dell, HP, and IBM to run OSes other than Microsoft and RHEL. For the thousands of other OSes out there, this is a significant barrier to entry. One can argue that the most significant innovations in the past 5 years came from none of those companies -- they came from Google, Apple, Amazon, Facebook, and the other innovators who did not spend their efforts trying to beat Microsoft and get into the manufacturing floor of the big vendors. Developers need to eat their own food. I agree, but neither Dell, HP, nor IBM develop Windows... Smoke your own crack. Hardware engineers at Dell need to actually use your OS on their hardware, for their development efforts. I would be willing to bet Sun hardware engineers use a significant percentage of solaris servers for their work... And guess what solaris engineers don't use? Non-sun hardware. I'm not sure of the current state, but many of the Solaris engineers develop on laptops and Sun did not offer a laptop product line. Pretty safe bet you won't find any Dell servers in the server room where solaris developers do their thing. You will find them where Nexenta developers live :-) If you want to be taken seriously as an alternative storage option, you've got to at LEAST be listed as a factory-distributed OS that is an option to ship with the new server, and THEN, when people such as myself buy those things, we've got to have a good enough experience that we don't all bitch and flame about it afterward. Wait a minute... this is patently false. The big storage vendors: NetApp, EMC, Hitachi, Fujitsu, LSI... none run on HP, IBM, or Dell servers. Nexenta, you need a real and serious partnership with Dell, HP, IBM. Get their developers to run YOUR OS on the servers which they use for development. Get them to sell your product bundled with their product. And dedicate real and serious engineering into bugfixes working with customers, to truly identify root causes of instability, with a real OS development and engineering and support group. It's got to be STABLE, that's the #1 requirement. There are many marketing activities are in progress towards this end. One of Nexenta's major OEMs (Compellent) is being purchased by Dell. The deal is not done, so there is no public information on future plans, to my knowledge. I previously made the comparison... Even close-source solaris ZFS is a better alternative to close-source netapp wafl. So for now, those are the only two enterprise supportable options I'm willing to stake my career on, and I'll buy Sun hardware with Solaris. But I really wish I could feel confident buying a cheaper Dell server and running ZFS on it. Nexenta, if you make yourself look like a serious competitor against solaris, and really truly form an awesome stable partnership with Dell, I will happily buy your stuff instead of Oracle. Even if you are a little behind in feature offering. But I will not buy your stuff if I can't feel perfectly confident in its stability. I can assure you that we take stability very seriously. And since you seem to think the big box vendors are infallible, a sampling of those things we (Nexenta) have to live with:
Re: [zfs-discuss] A few questions
From: Edward Ned Harvey opensolarisisdeadlongliveopensola...@nedharvey.com To: 'Khushil Dep' khushil@gmail.com Cc: Richard Elling richard.ell...@nexenta.com, zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] A few questions Message-ID: 000201cbada5$a3678270$ea3687...@nedharvey.com Content-Type: text/plain; charset=utf-8 From: Khushil Dep [mailto:khushil@gmail.com] I've deployed large SAN's on both SuperMicro 825/826/846 and Dell R610/R710's and I've not found any issues so far. I always make a point of installing Intel chipset NIC's on the DELL's and disabling the Broadcom ones but other than that it's always been plain sailing - hardware-wise anyway. not found any issues, except the broadcom one which causes the system to crash regularly in the default factory configuration. How did you learn about the broadcom issue for the first time? I had to learn the hard way, and with all the involvement of both Dell and Oracle support teams, nobody could tell me what I needed to change. We literally replaced every component of the server twice over a period of 1 year, and I spent mandays upgrading and downgrading firmwares randomly trying to find a stable configuration. I scoured the internet to find this little tidbit about replacing the broadcom NIC, and randomly guessed, and replaced my nic with an intel card to make the problem go away. 20 years of doing this c*(# has taught me that most things only get learned the hard way. I certainly won't bet my career solely on the ability of the vendor to support the product, because they're hardly omniscient. Testing, testing, and generous return policies (and/or RD budget) The same system doesn't have a problem running RHEL/centos. Then you're not pushing it hard enough, or your stars are just aligned nicely. We have massive piles of Dell hardware, all types. Running CentOS since at least 4.5. Every single one of those Dells has an Intel NIC in it, and the Broadcoms disabled in the BIOS. Because every time we do something stupid like let ourselves think oh, we could maybe use those extra Broadcom ports for X, we get burned. High-volume financial trading system. Blew up on the bcoms. Didn't matter what driver or tweak or fix. Plenty of man-days wasted debugging. Went with net.advice, put in Intel NIC. No more problems. That was 3 years ago. Thought we could use the bcoms for our fileservers. Nope. Thought we could use the bcoms for the dedicated drbd links for our xen cluster. Nope. And we know we're not alone in this evaluation. We could have spent forever chasing support to get someone to fix it I suppose... but we have better things to do. See my point? Next time I buy a server, I do not have confidence to simply expect solaris on dell to work reliably. The same goes for solaris derivatives, and all non-sun hardware. There simply is not an adequate qualification and/or support process. I'm not convinced ANYONE really has such a thing. Or that it's even necessarily possible. In fact, I'm sure they don't. Cuz that's what it says in the fine print on the support contracts and the purchase agreements - we do not guarantee... I just prefer not to have any confidence for the most part. It's easier and safer. -bacon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Tim Cook The claim was that there are more people contributing code from outside of Oracle than inside to zfs. Your contributions to Illumos do absolutely nothing Guys, please let's just say this much: To all those who are contributing to the open-source ZFS code, freebsd, illumos project, and others, thank you very much. :-) We know certain things are stable and production ready, but there has not yet been much forward development after zpool 28, but the effort is well appreciated, and for whatever comes next, yes we can all be patient. Right now, Oracle is not contributing at all to the open source branches of any of these projects. So right now it's fair to say the non-oracle contributions to the OPEN SOURCE ZFS outweighs the nonexistent oracle contributions. However, Oracle is continuing to develop the closed-source ZFS. I don't know if anyone has real numbers, dollars contributed or number of developer hours etc, but I think it's fair to say that oracle is probably contributing more to the closed source ZFS right now, than the rest of the world is contributing to the open source ZFS right now. Also, we know that the closed source ZFS right now is more advanced than the open source ZFS (zpool 31 vs 28). Oracle closed source ZFS is ahead, and probably developing faster too, than the open source ZFS right now. If anyone has any good way to draw more contributors into the open source tree, that would also be useful and appreciated. Gosh, it would be nice to get major players like Dell, HP, IBM, Apple contributing into that project. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
Edward Ned Harvey wrote I don't know if anyone has real numbers, dollars contributed or number of developer hours etc, but I think it's fair to say that oracle is probably contributing more to the closed source ZFS right now, than the rest of the world is contributing to the open source ZFS right now. Also, we know that the closed source ZFS right now is more advanced than the open source ZFS (zpool 31 vs 28). Oracle closed source ZFS is ahead, and probably developing faster too, than the open source ZFS right now. If anyone has any good way to draw more contributors into the open source tree, that would also be useful and appreciated. Gosh, it would be nice to get major players like Dell, HP, IBM, Apple contributing into that project. This is something that Illumos/Open source ZFS needs to decide what it wants, effectively we can't innovate ZFS without breaking capability... because our Illumos ZPool version 29 (if we innovate) will not be Oracle Zpool version 29. If we want open-source ZFS to we need to make that choice and let everyone know, apart from submitting bug fixes to zpool v28, are I'm not sure if other changed would be welcome? So honestly do we want to innovate ZFS (I do) or do we just want to follow Oracle? Bye, Deano de...@cloudpixies.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
From: Deano [mailto:de...@rattie.demon.co.uk] Sent: Wednesday, January 05, 2011 9:16 AM So honestly do we want to innovate ZFS (I do) or do we just want to follow Oracle? Well, you can't follow Oracle. Unless you wait till they release something, reverse engineer it, and attempt to reimplement it. I am quite sure you'll be sued if you do that. If you want forward development in the open source tree, you basically have only one option: Some major contributor must have a financial interest, and commit to a real concerted development effort, with their own roadmap, which is intentionally designed NOT to overlap with the Oracle roadmap. Otherwise, the code will stagnate. I am rooting for the open source projects, but I'm not optimistic personally. I think all major contributors (IBM, Apple, etc) will not participate for various reasons, and as a result, we'll experience bit rot... As presently evident by lack of zpool advancement beyond 28. So in my mind, Oracle and ZFS are now just like netapp and wafl. Well... I prefer Solaris and ZFS over netapp and wafl... So whenever I would have otherwise bought a netapp, I'll still buy the solaris server instead... But it's no longer a competitor against ubuntu or centos. Just the way Larry wants it. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
We do have a major commercial interest - Nexenta. It's been quiet but I do look forward to seeing something come out of that stable this year? :-) --- W. A. Khushil Dep - khushil@gmail.com - 07905374843 Visit my blog at http://www.khushil.com/ On 5 January 2011 14:34, Edward Ned Harvey opensolarisisdeadlongliveopensola...@nedharvey.com wrote: From: Deano [mailto:de...@rattie.demon.co.uk] Sent: Wednesday, January 05, 2011 9:16 AM So honestly do we want to innovate ZFS (I do) or do we just want to follow Oracle? Well, you can't follow Oracle. Unless you wait till they release something, reverse engineer it, and attempt to reimplement it. I am quite sure you'll be sued if you do that. If you want forward development in the open source tree, you basically have only one option: Some major contributor must have a financial interest, and commit to a real concerted development effort, with their own roadmap, which is intentionally designed NOT to overlap with the Oracle roadmap. Otherwise, the code will stagnate. I am rooting for the open source projects, but I'm not optimistic personally. I think all major contributors (IBM, Apple, etc) will not participate for various reasons, and as a result, we'll experience bit rot... As presently evident by lack of zpool advancement beyond 28. So in my mind, Oracle and ZFS are now just like netapp and wafl. Well... I prefer Solaris and ZFS over netapp and wafl... So whenever I would have otherwise bought a netapp, I'll still buy the solaris server instead... But it's no longer a competitor against ubuntu or centos. Just the way Larry wants it. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On Wed, Jan 5, 2011 at 15:34, Edward Ned Harvey opensolarisisdeadlongliveopensola...@nedharvey.com wrote: From: Deano [mailto:de...@rattie.demon.co.uk] Sent: Wednesday, January 05, 2011 9:16 AM So honestly do we want to innovate ZFS (I do) or do we just want to follow Oracle? Well, you can't follow Oracle. Unless you wait till they release something, reverse engineer it, and attempt to reimplement it. that's not my understanding - while we will have to wait, oracle is supposed to release *some* source code afterwards to satisfy some claim or other. I agree, some would argue that that should have already happened with S11 express... I don't know it has, but that's not *the* release of S11, is it? And once the code is released, even if after the fact, it's not reverse-engineering anymore, is it? Michael PS: just in case: even while at Oracle, I had no insight into any of these plans, much less do I have now. -- regards/mit freundlichen Grüssen Michael Schuster ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
-Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Michael Schuster Sent: Wednesday, January 05, 2011 9:42 AM To: Edward Ned Harvey Cc: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] A few questions On Wed, Jan 5, 2011 at 15:34, Edward Ned Harvey opensolarisisdeadlongliveopensola...@nedharvey.com wrote: From: Deano [mailto:de...@rattie.demon.co.uk] Sent: Wednesday, January 05, 2011 9:16 AM So honestly do we want to innovate ZFS (I do) or do we just want to follow Oracle? Well, you can't follow Oracle. Unless you wait till they release something, reverse engineer it, and attempt to reimplement it. that's not my understanding - while we will have to wait, oracle is supposed to release *some* source code afterwards to satisfy some claim or other. I agree, some would argue that that should have already happened with S11 express... I don't know it has, but that's not *the* release of S11, is it? And once the code is released, even if after the fact, it's not reverse-engineering anymore, is it? Not exactly. Oracle hasn't publicly committed to anything like that. There were several news articles last year referencing a leaked internal memo that I believe was more of a proposal than a plan. Even if Oracle did 'commit' to releasing code, they could easily just decide not to. -Will ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
From: Michael Schuster [mailto:michaelspriv...@gmail.com] Well, you can't follow Oracle. Unless you wait till they release something, reverse engineer it, and attempt to reimplement it. that's not my understanding - while we will have to wait, oracle is supposed to release *some* source code afterwards to satisfy some Where do you get that from? AFAIK, there is no official word about oracle opening anything moving forward, but there are plenty of unofficial reports that it will not be opened. Nobody in the field is holding any hope for that to change anymore, most importantly illumos and nexenta. (At least with regards to ZFS and all the other projects relevant to solaris.) I know in the case of SGE/OGE, it's officially closed source now. As of Dec 31st, sunsource is being decomissioned, and the announcement of officially closing the SGE source and decomissioning the open source community went out on Dec 24th. So all of this leads me to believe, with very little reservation, that the new developments beyond zpool 28 are closed source moving forward. There's very little breathing room remaining for hope of that being open sourced again. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
From: Khushil Dep [mailto:khushil@gmail.com] We do have a major commercial interest - Nexenta. It's been quiet but I do look forward to seeing something come out of that stable this year? :-) I'll agree to call Nexenta a major commerical interest, in regards to contribution to the open source ZFS tree, if they become an officially supported OS on Dell, HP, and/or IBM hardware. Otherwise, they're just simply too small to keep up with the rate of development of the closed source ZFS tree, and destined to be left in the dust. And if Nexenta does become a seriously viable competitor against netapp and oracle... Watch out for lawsuits... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On 01/ 4/11 11:48 PM, Tim Cook wrote: On Tue, Jan 4, 2011 at 8:21 PM, Garrett D'Amore garr...@nexenta.com mailto:garr...@nexenta.com wrote: On 01/ 4/11 09:15 PM, Tim Cook wrote: On Mon, Jan 3, 2011 at 5:56 AM, Garrett D'Amore garr...@nexenta.com mailto:garr...@nexenta.com wrote: On 01/ 3/11 05:08 AM, Robert Milkowski wrote: On 12/26/10 05:40 AM, Tim Cook wrote: On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling richard.ell...@gmail.com mailto:richard.ell...@gmail.com wrote: There are more people outside of Oracle developing for ZFS than inside Oracle. This has been true for some time now. Pardon my skepticism, but where is the proof of this claim (I'm quite certain you know I mean no disrespect)? Solaris11 Express was a massive leap in functionality and bugfixes to ZFS. I've seen exactly nothing out of outside of Oracle in the time since it went closed. We used to see updates bi-weekly out of Sun. Nexenta spending hundreds of man-hours on a GUI and userland apps isn't work on ZFS. Exactly my observation as well. I haven't seen any ZFS related development happening at Ilumos or Nexenta, at least not yet. Just because you've not seen it yet doesn't imply it isn't happening. Please be patient. - Garrett Or, conversely, don't make claims of all this code contribution prior to having anything to show for your claimed efforts. Duke Nukem Forever was going to be the greatest video game ever created... we were told to be patient... we're still waiting for that too. Um, have you not been paying attention? I've delivered quite a lot of contribution to illumos already, just not in ZFS. Take a close look -- there almost certainly wouldn't *be* an open source version of OS/Net had I not done the work to enable this in libc, kernel crypto, and other bits. This work is still higher priority than ZFS innovation for a variety of reasons -- mostly because we need a viable and supportable illumos upon which to build those ZFS innovations. That said, much of the ZFS work I hope to contribute to illumos needs more baking, but some of it is already open source in NexentaStor. (You can for a start look at zfs-monitor, the WORM support, and support for hardware GZIP acceleration all as things that Nexenta has innovated in ZFS, and which are open source today if not part of illumos. Check out http://www.nexenta.org for source code access.) So there, money placed where mouth is. You? - Garrett The claim was that there are more people contributing code from outside of Oracle than inside to zfs. Your contributions to Illumos do absolutely nothing to backup that claim. ZFS-monitor is not ZFS code (it's an FMA module), WORM also isn't ZFS code, it's an OS level operation, and GZIP hardware acceleration is produced by Indra networks, and has absolutely nothing to do with ZFS. Does it help ZFS? Sure, but that's hardly a code contribution to ZFS when it's simply a hardware acceleration card that accelerates ALL gzip code. Um... you have obviously not looked at the code. Our WORM code is not some basic OS guarantees on top of ZFS, but modifications to the ZFS code itself so that ZFS *itself* honors the WORM property, which is implemented as a property on the ZFS filesystem. Likewise, the GZIP hardware acceleration support includes specific modifications to the ZFS kernel filesystem code. Of course, we've not done anything major to change the fundamental way that ZFS stores data... is that what you're talking about? I think you must have a very narrow idea of what constitutes an innovation in ZFS. So, great job picking three projects that are not proof of developers working on ZFS. And great job not providing any proof to the claim there are more developers working on ZFS outside of Oracle than within. Nexenta don't represent that majority actually. A large number of ZFS folks -- people with names like Leventhal, Ahrens, Wilson, and Gregg, are working on ZFS related work at Delphix and Joyent, or so I've been told. I don't have first hand knowledge of *what* the details are, but I'm looking forward to seeing the results. This ignores the contributions from people working on ZFS on other platforms as well. Of course, since I know longer work there, I don't really know how many people Oracle still has working on ZFS. They could have tasked 1,000 people with it. Or they could have shut the project down entirely. But of the people who had, up until Oracle shut down the open code, made non-trivial contributions to ZFS, I think the majority of *those* people can be found working outside of Oracle now, and I think most of them are still working on ZFS
Re: [zfs-discuss] A few questions
Edward Ned Harvey wrote From: Deano [mailto:de...@rattie.demon.co.uk] Sent: Wednesday, January 05, 2011 9:16 AM So honestly do we want to innovate ZFS (I do) or do we just want to follow Oracle? Well, you can't follow Oracle. Unless you wait till they release something, reverse engineer it, and attempt to reimplement it. I am quite sure you'll be sued if you do that. If you want forward development in the open source tree, you basically have only one option: Some major contributor must have a financial interest, and commit to a real concerted development effort, with their own roadmap, which is intentionally designed NOT to overlap with the Oracle roadmap. Otherwise, the code will stagnate. Why does it need a big backer? Erm ZFS isn't that large or amazingly complex code. It is *good* code but take 100s of developers and a fortune to develop? Erm nope (which I'd bet it never had at Sun either). Why not overlap Oracle? what has it got to do with Oracle if we have split into ZFS (Oracle) and OpenZFS in future. OpenZFS will get whatever features developers feel that want or they need to develop for it. This is the fundamental choice of Open source ZFS, illumos and OpenIndiania (and other distributions) have to decide, what is there purpose? Is it a free compatible (though trailing) version of Oracle Solaris OR a platform that shared an ancestor with Oracle Solaris via Sun OpenSolaris but now is its own evolutionary species, with no more connection than I have with a 15th cousin removed on my great, great, great, grandfathers side. This isn't even a theoretical what if situation for me, I have a major modification to ZFS (still being developed), it has no basis on Oracle or anybody elses needs just mine. It is what I felt I needed and ZFS was the right base for it. Now will that go into OpenZFS? Honestly I don't know yet, because not sure it would be wanted (it will be incompatible with Oracle ZFS) and personally, commercially I'm not sure if it's the right move to open source the feature. I bet I'm not the only small developer out there in a similar situation, the landscape is very unclear about what actually the community wants to do going forward, and whether we will have or even want OpenZFS and Oracle ZFS or Oracle ZFS and 90% compatibles (always trailing) or Oracle ZFS + DevA ZFS + DevB ZFS + DevC ZFS. Bye, Deano de...@cloudpixies.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
From: Richard Elling [mailto:richard.ell...@nexenta.com] I'll agree to call Nexenta a major commerical interest, in regards to contribution to the open source ZFS tree, if they become an officially supported OS on Dell, HP, and/or IBM hardware. NexentaStor is officially supported on Dell, HP, and IBM hardware. The only question is, what is your definition of 'support'? Many NexentaStor I don't want to argue about this, but I'll just try to clarify what I meant: Presently, I have a dell server with officially supported solaris, and it's as unreliable as pure junk. It's just the backup server, so I'm free to frequently create destroy it... And as such, I frequently do recreate and destroy it. It is entirely stable running RHEL (centos) because Dell and RedHat have a partnership with a serious number of human beings and machines looking for and fixing any compatibility issues. For my solaris instability, I blame the fact that solaris developers don't do significant quality assurance on non-sun hardware. To become officially compatible, the whole qualification process is like this: Somebody installs it, doesn't see any problems, and then calls it certified. They reformat with something else, and move on. They don't build their business on that platform, so they don't detect stability issues like the ones reported... System crashes once per week and so forth. Solaris therefore passes the test, and becomes one of the options available on the drop-down menu for OSes with a new server. (Of course that's been discontinued by oracle, but that's how it was in the past.) Developers need to eat their own food. Smoke your own crack. Hardware engineers at Dell need to actually use your OS on their hardware, for their development efforts. I would be willing to bet Sun hardware engineers use a significant percentage of solaris servers for their work... And guess what solaris engineers don't use? Non-sun hardware. Pretty safe bet you won't find any Dell servers in the server room where solaris developers do their thing. If you want to be taken seriously as an alternative storage option, you've got to at LEAST be listed as a factory-distributed OS that is an option to ship with the new server, and THEN, when people such as myself buy those things, we've got to have a good enough experience that we don't all bitch and flame about it afterward. Nexenta, you need a real and serious partnership with Dell, HP, IBM. Get their developers to run YOUR OS on the servers which they use for development. Get them to sell your product bundled with their product. And dedicate real and serious engineering into bugfixes working with customers, to truly identify root causes of instability, with a real OS development and engineering and support group. It's got to be STABLE, that's the #1 requirement. I previously made the comparison... Even close-source solaris ZFS is a better alternative to close-source netapp wafl. So for now, those are the only two enterprise supportable options I'm willing to stake my career on, and I'll buy Sun hardware with Solaris. But I really wish I could feel confident buying a cheaper Dell server and running ZFS on it. Nexenta, if you make yourself look like a serious competitor against solaris, and really truly form an awesome stable partnership with Dell, I will happily buy your stuff instead of Oracle. Even if you are a little behind in feature offering. But I will not buy your stuff if I can't feel perfectly confident in its stability. Ever heard the phrase Nobody ever got fired for buying IBM. You're the little guys. If you want to compete against the big guys, you've got to kick ass. And don't get sued into oblivion. Even today's feature set is perfectly adequate for at least a couple of years to come. If you put all your effort into stability and bugfixes, serious partnerships with Dell, HP, IBM, and become extremely professional looking and stable, with fanatical support... You don't have to worry about new feature development for some while. Stability is #1 and not disappearing is a pretty huge threat right now. Based on my experience, I would not recommend buying Dell with Solaris, even if that were still an option. If you want solaris, buy sun/oracle hardware, because then you can actually expect it to work reliably. And if solaris isn't stable on dell ... then all the solaris derivatives including nexenta can't be trusted either, no matter how much you claim it's supported. Show me the HCL, and show me the partnership between your software engineers and Dell's hardware engineers. Make me believe there is a serious and thorough qualification process. Do a huge volume. Your volume must be large enough to justify dedicating some engineers to serious bugfix efforts in the field. Otherwise... When I need to buy something stable... I'm going to buy solaris on sun hardware, because I know that's thoroughly tried, tested, and stable.
Re: [zfs-discuss] A few questions
On Wed, 5 Jan 2011, Edward Ned Harvey wrote: with regards to ZFS and all the other projects relevant to solaris.) I know in the case of SGE/OGE, it's officially closed source now. As of Dec 31st, sunsource is being decomissioned, and the announcement of officially closing the SGE source and decomissioning the open source community went out on Dec 24th. So all of this leads me to believe, with very little reservation, that the new developments beyond zpool 28 are closed source moving forward. There's very little breathing room remaining for hope of that being open sourced again. I have no idea what you are talking about. Best I can tell, SGE/OGE is a reference to Sun Grid Engine, which has nothing to do with zfs. The only annoucement and discussion I can find via Google is written by you. It was pretty clear even a year ago that Sun Grid Engine was going away. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] a few questions - Oracle
It is sad that such a lovely file system is now in Oracle's unresponsive hands. I hope someone builds another open file system just like it. I could never find anything like it to protect my data like it does. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] a few questions - Oracle
On 01/ 4/11 01:19 PM, webd...@gmail.com wrote: It is sad that such a lovely file system is now in Oracle's unresponsive hands. I hope someone builds another open file system just like it. I could never find anything like it to protect my data like it does. ___ I have to reply to this. While Oracle may not seem responsive, they are innovating on the zfs still. I haven't seen it stand still when Oracle took over Sun. Also, if you do your homework, there is a BSD version floating around, and a Linux version also. To boot, Illumos has the last open source release which brings it to Openindania. So what are you talking about? Paul ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On 01/ 3/11 04:28 PM, Richard Elling wrote: On Jan 3, 2011, at 5:08 AM, Robert Milkowski wrote: On 12/26/10 05:40 AM, Tim Cook wrote: On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling richard.ell...@gmail.com mailto:richard.ell...@gmail.com wrote: There are more people outside of Oracle developing for ZFS than inside Oracle. This has been true for some time now. Pardon my skepticism, but where is the proof of this claim (I'm quite certain you know I mean no disrespect)? Solaris11 Express was a massive leap in functionality and bugfixes to ZFS. I've seen exactly nothing out of outside of Oracle in the time since it went closed. We used to see updates bi-weekly out of Sun. Nexenta spending hundreds of man-hours on a GUI and userland apps isn't work on ZFS. Exactly my observation as well. I haven't seen any ZFS related development happening at Ilumos or Nexenta, at least not yet. I am quite sure you understand how pipelines work :-) Are you suggesting that Nexenta is developing new ZFS features behind closed doors (like Oracle...) and then will share code later-on? Somehow I don't think so... but I would love to be proved wrong :) -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On 01/ 4/11 11:35 PM, Robert Milkowski wrote: On 01/ 3/11 04:28 PM, Richard Elling wrote: On Jan 3, 2011, at 5:08 AM, Robert Milkowski wrote: On 12/26/10 05:40 AM, Tim Cook wrote: On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling richard.ell...@gmail.com mailto:richard.ell...@gmail.com wrote: There are more people outside of Oracle developing for ZFS than inside Oracle. This has been true for some time now. Pardon my skepticism, but where is the proof of this claim (I'm quite certain you know I mean no disrespect)? Solaris11 Express was a massive leap in functionality and bugfixes to ZFS. I've seen exactly nothing out of outside of Oracle in the time since it went closed. We used to see updates bi-weekly out of Sun. Nexenta spending hundreds of man-hours on a GUI and userland apps isn't work on ZFS. Exactly my observation as well. I haven't seen any ZFS related development happening at Ilumos or Nexenta, at least not yet. I am quite sure you understand how pipelines work :-) Are you suggesting that Nexenta is developing new ZFS features behind closed doors (like Oracle...) and then will share code later-on? Somehow I don't think so... but I would love to be proved wrong :) I mean I would love to see Nexenta start delivering real innovation in Solaris/Illumos kernel (zfs, networking, ...), not that I would love to see it happening behind a closed doors :) -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On 01/ 3/11 05:08 AM, Robert Milkowski wrote: On 12/26/10 05:40 AM, Tim Cook wrote: On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling richard.ell...@gmail.com mailto:richard.ell...@gmail.com wrote: There are more people outside of Oracle developing for ZFS than inside Oracle. This has been true for some time now. Pardon my skepticism, but where is the proof of this claim (I'm quite certain you know I mean no disrespect)? Solaris11 Express was a massive leap in functionality and bugfixes to ZFS. I've seen exactly nothing out of outside of Oracle in the time since it went closed. We used to see updates bi-weekly out of Sun. Nexenta spending hundreds of man-hours on a GUI and userland apps isn't work on ZFS. Exactly my observation as well. I haven't seen any ZFS related development happening at Ilumos or Nexenta, at least not yet. Just because you've not seen it yet doesn't imply it isn't happening. Please be patient. - Garrett -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] a few questions - Oracle
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Paul Gress On 01/ 4/11 01:19 PM, webd...@gmail.com wrote: It is sad that such a lovely file system is now in Oracle's unresponsive hands. I hope someone builds another open file system just like it. I could never find anything like it to protect my data like it does. I have to reply to this. While Oracle may not seem responsive, they are innovating on the zfs still. I haven't seen it stand still when Oracle took over Sun. Also, if you do your homework, there is a BSD version floating around, and a Linux version also. To boot, Illumos has the last open source release which brings it to Openindania. So what are you talking about? Also, another open file system like it ... anything like it to protect my data... Go use Linux, and BTRFS. It is GPL, and guess what. Also developed by Oracle. But it's GPL, and it's included by default in many of the latest linuxes. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On Mon, Jan 3, 2011 at 5:56 AM, Garrett D'Amore garr...@nexenta.com wrote: On 01/ 3/11 05:08 AM, Robert Milkowski wrote: On 12/26/10 05:40 AM, Tim Cook wrote: On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling richard.ell...@gmail.com wrote: There are more people outside of Oracle developing for ZFS than inside Oracle. This has been true for some time now. Pardon my skepticism, but where is the proof of this claim (I'm quite certain you know I mean no disrespect)? Solaris11 Express was a massive leap in functionality and bugfixes to ZFS. I've seen exactly nothing out of outside of Oracle in the time since it went closed. We used to see updates bi-weekly out of Sun. Nexenta spending hundreds of man-hours on a GUI and userland apps isn't work on ZFS. Exactly my observation as well. I haven't seen any ZFS related development happening at Ilumos or Nexenta, at least not yet. Just because you've not seen it yet doesn't imply it isn't happening. Please be patient. - Garrett Or, conversely, don't make claims of all this code contribution prior to having anything to show for your claimed efforts. Duke Nukem Forever was going to be the greatest video game ever created... we were told to be patient... we're still waiting for that too. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On Tue, Jan 4, 2011 at 8:21 PM, Garrett D'Amore garr...@nexenta.com wrote: On 01/ 4/11 09:15 PM, Tim Cook wrote: On Mon, Jan 3, 2011 at 5:56 AM, Garrett D'Amore garr...@nexenta.comwrote: On 01/ 3/11 05:08 AM, Robert Milkowski wrote: On 12/26/10 05:40 AM, Tim Cook wrote: On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling richard.ell...@gmail.com wrote: There are more people outside of Oracle developing for ZFS than inside Oracle. This has been true for some time now. Pardon my skepticism, but where is the proof of this claim (I'm quite certain you know I mean no disrespect)? Solaris11 Express was a massive leap in functionality and bugfixes to ZFS. I've seen exactly nothing out of outside of Oracle in the time since it went closed. We used to see updates bi-weekly out of Sun. Nexenta spending hundreds of man-hours on a GUI and userland apps isn't work on ZFS. Exactly my observation as well. I haven't seen any ZFS related development happening at Ilumos or Nexenta, at least not yet. Just because you've not seen it yet doesn't imply it isn't happening. Please be patient. - Garrett Or, conversely, don't make claims of all this code contribution prior to having anything to show for your claimed efforts. Duke Nukem Forever was going to be the greatest video game ever created... we were told to be patient... we're still waiting for that too. Um, have you not been paying attention? I've delivered quite a lot of contribution to illumos already, just not in ZFS. Take a close look -- there almost certainly wouldn't *be* an open source version of OS/Net had I not done the work to enable this in libc, kernel crypto, and other bits. This work is still higher priority than ZFS innovation for a variety of reasons -- mostly because we need a viable and supportable illumos upon which to build those ZFS innovations. That said, much of the ZFS work I hope to contribute to illumos needs more baking, but some of it is already open source in NexentaStor. (You can for a start look at zfs-monitor, the WORM support, and support for hardware GZIP acceleration all as things that Nexenta has innovated in ZFS, and which are open source today if not part of illumos. Check out http://www.nexenta.org for source code access.) So there, money placed where mouth is. You? - Garrett The claim was that there are more people contributing code from outside of Oracle than inside to zfs. Your contributions to Illumos do absolutely nothing to backup that claim. ZFS-monitor is not ZFS code (it's an FMA module), WORM also isn't ZFS code, it's an OS level operation, and GZIP hardware acceleration is produced by Indra networks, and has absolutely nothing to do with ZFS. Does it help ZFS? Sure, but that's hardly a code contribution to ZFS when it's simply a hardware acceleration card that accelerates ALL gzip code. So, great job picking three projects that are not proof of developers working on ZFS. And great job not providing any proof to the claim there are more developers working on ZFS outside of Oracle than within. You're going to need a hell of a lot bigger bank account to cash the check than what you've got. As for me, I don't recall making any claims on this list that I can't back up, so I'm not really sure what you're getting at. I can only assume the defensive tone of your email is because you've been called out and can't backup the claims either. So again: if you've got code in the works, great. Talk about it when it's ready. Stop throwing out baseless claims that you have no proof of and then fall back on just be patient, it's coming. We've heard that enough from Oracle and Sun already. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On 12/26/10 05:40 AM, Tim Cook wrote: On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling richard.ell...@gmail.com mailto:richard.ell...@gmail.com wrote: There are more people outside of Oracle developing for ZFS than inside Oracle. This has been true for some time now. Pardon my skepticism, but where is the proof of this claim (I'm quite certain you know I mean no disrespect)? Solaris11 Express was a massive leap in functionality and bugfixes to ZFS. I've seen exactly nothing out of outside of Oracle in the time since it went closed. We used to see updates bi-weekly out of Sun. Nexenta spending hundreds of man-hours on a GUI and userland apps isn't work on ZFS. Exactly my observation as well. I haven't seen any ZFS related development happening at Ilumos or Nexenta, at least not yet. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On Mon, 3 Jan 2011, Robert Milkowski wrote: Exactly my observation as well. I haven't seen any ZFS related development happening at Ilumos or Nexenta, at least not yet. There seems to be plenty of zfs work on the FreeBSD project, but primarily with porting the latest available sources to FreeBSD (going very well!) rather than with developing zfs itself. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On Jan 3, 2011, at 5:08 AM, Robert Milkowski wrote: On 12/26/10 05:40 AM, Tim Cook wrote: On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling richard.ell...@gmail.com wrote: There are more people outside of Oracle developing for ZFS than inside Oracle. This has been true for some time now. Pardon my skepticism, but where is the proof of this claim (I'm quite certain you know I mean no disrespect)? Solaris11 Express was a massive leap in functionality and bugfixes to ZFS. I've seen exactly nothing out of outside of Oracle in the time since it went closed. We used to see updates bi-weekly out of Sun. Nexenta spending hundreds of man-hours on a GUI and userland apps isn't work on ZFS. Exactly my observation as well. I haven't seen any ZFS related development happening at Ilumos or Nexenta, at least not yet. I am quite sure you understand how pipelines work :-) -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On 1/3/2011 8:28 AM, Richard Elling wrote: On Jan 3, 2011, at 5:08 AM, Robert Milkowski wrote: On 12/26/10 05:40 AM, Tim Cook wrote: On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling richard.ell...@gmail.com mailto:richard.ell...@gmail.com wrote: There are more people outside of Oracle developing for ZFS than inside Oracle. This has been true for some time now. Pardon my skepticism, but where is the proof of this claim (I'm quite certain you know I mean no disrespect)? Solaris11 Express was a massive leap in functionality and bugfixes to ZFS. I've seen exactly nothing out of outside of Oracle in the time since it went closed. We used to see updates bi-weekly out of Sun. Nexenta spending hundreds of man-hours on a GUI and userland apps isn't work on ZFS. Exactly my observation as well. I haven't seen any ZFS related development happening at Ilumos or Nexenta, at least not yet. I am quite sure you understand how pipelines work :-) -- richard I'm getting pretty close to my pain threshold on the BP_rewrite stuff, since not having that feature's holding up a big chunk of work I'd like to push. If anyone outside of Oracle is working on some sort of change to ZFS that will allow arbitrary movement/placement of pre-written slabs, can they please contact me? I'm pretty much at the point where I'm going to start diving into that chunk of the source to see if there's something little old me can do, and I'd far rather help on someone else's implementation than have to do it myself from scratch. I'd prefer a private contact, as I realize that such work may not be ready for public discussion yet. Thanks, folks! Oh, and this is completely just me, not Oracle talking in any way. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On Jan 3, 2011, at 2:10 PM, Erik Trimble wrote On 1/3/2011 8:28 AM, Richard Elling wrote: On Jan 3, 2011, at 5:08 AM, Robert Milkowski wrote: On 12/26/10 05:40 AM, Tim Cook wrote: On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling richard.ell...@gmail.com wrote: There are more people outside of Oracle developing for ZFS than inside Oracle. This has been true for some time now. Pardon my skepticism, but where is the proof of this claim (I'm quite certain you know I mean no disrespect)? Solaris11 Express was a massive leap in functionality and bugfixes to ZFS. I've seen exactly nothing out of outside of Oracle in the time since it went closed. We used to see updates bi-weekly out of Sun. Nexenta spending hundreds of man-hours on a GUI and userland apps isn't work on ZFS. Exactly my observation as well. I haven't seen any ZFS related development happening at Ilumos or Nexenta, at least not yet. I am quite sure you understand how pipelines work :-) -- richard I'm getting pretty close to my pain threshold on the BP_rewrite stuff, since not having that feature's holding up a big chunk of work I'd like to push. If anyone outside of Oracle is working on some sort of change to ZFS that will allow arbitrary movement/placement of pre-written slabs, can they please contact me? I'm pretty much at the point where I'm going to start diving into that chunk of the source to see if there's something little old me can do, and I'd far rather help on someone else's implementation than have to do it myself from scratch. I'd prefer a private contact, as I realize that such work may not be ready for public discussion yet. Thanks, folks! Oh, and this is completely just me, not Oracle talking in any way. Oracle doesn't seem to say much at all :-( But for those interested, Nexenta is actively hiring people to work in this area. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On Dec 21, 2010, at 5:05 AM, Deano wrote: The question therefore is, is there room in the software implementation to achieve performance and reliability numbers similar to expensive drives whilst using relative cheap drives? For some definition of similar, yes. But using relatively cheap drives does not mean the overall system cost will be cheap. For example, $250 will buy 8.6K random IOPS @ 4KB in an SSD[1], but to do that with cheap disks might require eighty 7,200 rpm SATA disks. ZFS is good but IMHO easy to see how it can be improved to better meet this situation, I can’t currently say when this line of thinking and code will move from research to production level use (tho I have a pretty good idea ;) ) but I wouldn’t bet on the status quo lasting much longer. In some ways the removal of OpenSolaris may actually be a good thing, as its catalyized a number of developers from the view that zfs is Oracle led, to thinking “what can we do with zfs code as a base”? There are more people outside of Oracle developing for ZFS than inside Oracle. This has been true for some time now. Ffor example how about sticking a cheap 80GiB commodity SSD in the storage case. When a resilver or defrag is required, use it as a scratch space to give you a block of fast IOPs storage space to accelerate the slow parts. When its done secure erase and power it down, ready for the next time a resilver needs to happen. The hardware is available, just needs someone to write the software… In general, SSDs will not speed resilver unless the resilvering disk is an SSD. [1] http://www.intel.com/cd/channel/reseller/asmo-na/eng/products/nand/feature/index.htm -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling richard.ell...@gmail.comwrote: On Dec 21, 2010, at 5:05 AM, Deano wrote: The question therefore is, is there room in the software implementation to achieve performance and reliability numbers similar to expensive drives whilst using relative cheap drives? For some definition of similar, yes. But using relatively cheap drives does not mean the overall system cost will be cheap. For example, $250 will buy 8.6K random IOPS @ 4KB in an SSD[1], but to do that with cheap disks might require eighty 7,200 rpm SATA disks. ZFS is good but IMHO easy to see how it can be improved to better meet this situation, I can’t currently say when this line of thinking and code will move from research to production level use (tho I have a pretty good idea ;) ) but I wouldn’t bet on the status quo lasting much longer. In some ways the removal of OpenSolaris may actually be a good thing, as its catalyized a number of developers from the view that zfs is Oracle led, to thinking “what can we do with zfs code as a base”? There are more people outside of Oracle developing for ZFS than inside Oracle. This has been true for some time now. Pardon my skepticism, but where is the proof of this claim (I'm quite certain you know I mean no disrespect)? Solaris11 Express was a massive leap in functionality and bugfixes to ZFS. I've seen exactly nothing out of outside of Oracle in the time since it went closed. We used to see updates bi-weekly out of Sun. Nexenta spending hundreds of man-hours on a GUI and userland apps isn't work on ZFS. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
It's worse on raidzN than on mirrors, because the number of items which must be read is higher in radizN, assuming you're using larger vdev's and therefore more items exist scattered about inside that vdev. You therefore have a higher number of things which must be randomly read before you reach completion. In that case, isn't the answer to have a dedicated parity disk (or 2 or 3 depending on what raidz* is used), ala raid-dp. Wouldn't this effectively be the 'same' as a mirror when resilvering (the only difference being parity vs actual data), as it's doing so from a single disk. raid-dp covers the parity disk from failure so raidz1 probably wouldn't be sensible as if the parity disk fails. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On 21/12/2010 05:44, Richard Elling wrote: On Dec 20, 2010, at 7:31 AM, Phil Harman phil.har...@gmail.com mailto:phil.har...@gmail.com wrote: On 20/12/2010 13:59, Richard Elling wrote: On Dec 20, 2010, at 2:42 AM, Phil Harman phil.har...@gmail.com mailto:phil.har...@gmail.com wrote: Why does resilvering take so long in raidz anyway? Because it's broken. There were some changes a while back that made it more broken. broken is the wrong term here. It functions as designed and correctly resilvers devices. Disagreeing with the design is quite different than proving a defect. It might be the wrong term in general, but I think it does apply in the budget home media server context of this thread. If you only have a few slow drives, you don't have performance. Like trying to win the Indianapolis 500 with a tricycle... The context of this thread is a budget home media server (certainly not the Indy 500, but perhaps not as humble as tricycle touring either). And whilst it is a habit of the hardware advocate to blame the software ... and vice versa ... it's not much help to those of us trying to build good enough systems across the performance and availability spectrum. I think we can agree that ZFS currently doesn't play well on cheap disks. I think we can also agree that the performance of ZFS resilvering is known to be suboptimal under certain conditions. ... and those conditions are also a strength. For example, most file systems are nowhere near full. With ZFS you only resilver data. For those who recall the resilver throttles in SVM or VXVM, you will appreciate not having to resilver non-data. I'd love to see the data and analysis for the assertion that most files systems are nowhere near full, discounting, of course, any trivial cases. In my experience, in any cost conscious scenario, in the home or the enterprise, the expectation is that I'll get to use the majority of the space I've paid for (generally through the nose from the storage silo team in the enterprise scenario). To borrow your illustration, even Indy 500 teams care about fuel consumption. What I don't appreciate is having to resilver significantly more data than the drive can contain. But when it comes to the crunch, what I'd really appreciate was a bounded resilver time measured in hours not days or weeks. For a long time at Sun, the rule was correctness is a constraint, performance is a goal. However, in the real world, performance is often also a constraint (just as a quick but erroneous answer is a wrong answer, so also, a slow but correct answer can also be wrong). Then one brave soul at Sun once ventured that if Linux is faster, it's a Solaris bug! and to his surprise, the idea caught on. I later went on to tell people that ZFS delievered RAID where I = inexpensive, so I'm a just a little frustrated when that promise becomes less respected over time. First it was USB drives (which I agreed with), now it's SATA (and I'm not so sure). slow doesn't begin with an i :-) Both ZFS and RAID promised to play in the inexpensive space. There has been a lot of discussion, anecdotes and some data on this list. slow because I use devices with poor random write(!) performance is very different than broken. Again, context is everything. For example, if someone was building a business critical NAS appliance from consumer grade parts, I'd be the first to say are you nuts?! Unfortunately, the math does not support your position... Actually, the math (e.g. raw drive metrics) doesn't lead me to expect such a disparity. The resilver doesn't do a single pass of the drives, but uses a smarter temporal algorithm based on metadata. A design that only does a single pass does not handle the temporal changes. Many RAID implementations use a mix of spatial and temporal resilvering and suffer with that design decision. Actually, it's easy to see how a combined spatial and temporal approach could be implemented to an advantage for mirrored vdevs. However, the current implentation has difficulty finishing the job if there's a steady flow of updates to the pool. Please define current. There are many releases of ZFS, and many improvements have been made over time. What has not improved is the random write performance of consumer-grade HDDs. I was led to believe this was not yet fixed in Solaris 11, and that there are therefore doubts about what Solaris 10 update may see the fix, if any. As far as I'm aware, the only way to get bounded resilver times is to stop the workload until resilvering is completed. I know of no RAID implementation that bounds resilver times for HDDs. I believe it is not possible. OTOH, whether a resilver takes 10 seconds or 10 hours makes little difference in data availability. Indeed, this is why we often throttle resilvering activity. See previous discussions on this forum regarding the dueling RFEs. I don't share your disbelief or little difference analysys. If it is true that no
Re: [zfs-discuss] A few questions
On Dec 20, 2010, at 7:31 AM, Phil Harman phil.har...@gmail.com wrote: If you only have a few slow drives, you don't have performance. Like trying to win the Indianapolis 500 with a tricycle... Well you can put a jet engine on a tricycle and perhaps win it… Or you can change the race course to only allow a tricycle space to move. In the context of storage we have 2 factors hardware and software, having faster and more reliable spindles is no reason to suggest that better software can’t be used to beat it. The simple example is ZIL SSD, where using some software and even a cheap commodity SSD will outperform sync writes than any amount of expensive spindle drives. Before ZIL software is was easy to argue that the only way of speeding up writes was more faster spindles. The question therefore is, is there room in the software implementation to achieve performance and reliability numbers similar to expensive drives whilst using relative cheap drives? ZFS is good but IMHO easy to see how it can be improved to better meet this situation, I can’t currently say when this line of thinking and code will move from research to production level use (tho I have a pretty good idea ;) ) but I wouldn’t bet on the status quo lasting much longer. In some ways the removal of OpenSolaris may actually be a good thing, as its catalyized a number of developers from the view that zfs is Oracle led, to thinking “what can we do with zfs code as a base”? Ffor example how about sticking a cheap 80GiB commodity SSD in the storage case. When a resilver or defrag is required, use it as a scratch space to give you a block of fast IOPs storage space to accelerate the slow parts. When its done secure erase and power it down, ready for the next time a resilver needs to happen. The hardware is available, just needs someone to write the software… Bye, Deano ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On 21/12/2010 13:05, Deano wrote: On Dec 20, 2010, at 7:31 AM, Phil Harman phil.har...@gmail.com mailto:phil.har...@gmail.com wrote: If you only have a few slow drives, you don't have performance. Like trying to win the Indianapolis 500 with a tricycle... Actually, I didn't say that, Richard did :) Well you can put a jet engine on a tricycle and perhaps win it… Or you can change the race course to only allow a tricycle space to move. In the context of storage we have 2 factors hardware and software, having faster and more reliable spindles is no reason to suggest that better software can’t be used to beat it. The simple example is ZIL SSD, where using some software and even a cheap commodity SSD will outperform sync writes than any amount of expensive spindle drives. Before ZIL software is was easy to argue that the only way of speeding up writes was more faster spindles. The question therefore is, is there room in the software implementation to achieve performance and reliability numbers similar to expensive drives whilst using relative cheap drives? ZFS is good but IMHO easy to see how it can be improved to better meet this situation, I can’t currently say when this line of thinking and code will move from research to production level use (tho I have a pretty good idea ;) ) but I wouldn’t bet on the status quo lasting much longer. In some ways the removal of OpenSolaris may actually be a good thing, as its catalyized a number of developers from the view that zfs is Oracle led, to thinking “what can we do with zfs code as a base”? Ffor example how about sticking a cheap 80GiB commodity SSD in the storage case. When a resilver or defrag is required, use it as a scratch space to give you a block of fast IOPs storage space to accelerate the slow parts. When its done secure erase and power it down, ready for the next time a resilver needs to happen. The hardware is available, just needs someone to write the software… Bye, Deano ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
Doh sorry about that, the threading got very confused on my mail reader! Bye, Deano From: Phil Harman [mailto:phil.har...@gmail.com] Sent: 21 December 2010 13:12 To: Deano Cc: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] A few questions On 21/12/2010 13:05, Deano wrote: On Dec 20, 2010, at 7:31 AM, Phil Harman phil.har...@gmail.com wrote: If you only have a few slow drives, you don't have performance. Like trying to win the Indianapolis 500 with a tricycle... Actually, I didn't say that, Richard did :) Well you can put a jet engine on a tricycle and perhaps win it… Or you can change the race course to only allow a tricycle space to move. In the context of storage we have 2 factors hardware and software, having faster and more reliable spindles is no reason to suggest that better software can’t be used to beat it. The simple example is ZIL SSD, where using some software and even a cheap commodity SSD will outperform sync writes than any amount of expensive spindle drives. Before ZIL software is was easy to argue that the only way of speeding up writes was more faster spindles. The question therefore is, is there room in the software implementation to achieve performance and reliability numbers similar to expensive drives whilst using relative cheap drives? ZFS is good but IMHO easy to see how it can be improved to better meet this situation, I can’t currently say when this line of thinking and code will move from research to production level use (tho I have a pretty good idea ;) ) but I wouldn’t bet on the status quo lasting much longer. In some ways the removal of OpenSolaris may actually be a good thing, as its catalyized a number of developers from the view that zfs is Oracle led, to thinking “what can we do with zfs code as a base”? Ffor example how about sticking a cheap 80GiB commodity SSD in the storage case. When a resilver or defrag is required, use it as a scratch space to give you a block of fast IOPs storage space to accelerate the slow parts. When its done secure erase and power it down, ready for the next time a resilver needs to happen. The hardware is available, just needs someone to write the software… Bye, Deano ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
From: edmud...@mail.bounceswoosh.org [mailto:edmud...@mail.bounceswoosh.org] On Behalf Of Eric D. Mudama On Mon, Dec 20 at 19:19, Edward Ned Harvey wrote: If there is no correlation between on-disk order of blocks for different disks within the same vdev, then all hope is lost; it's essentially impossible to optimize the resilver/scrub order unless the on-disk order of multiple disks is highly correlated or equal by definition. Very little is impossible. Drives have been optimally ordering seeks for 35+ years. I'm guessing Unless your drive is able to queue up a request to read every single used part of the drive... Which is larger than the command queue for any reasonable drive in the world... The point is, in order to be optimal you have to eliminate all those seeks, and perform sequential reads only. The only seeks you should do are to skip over unused space. If you're able to sequentially read the whole drive, skipping all the unused space, then you're guaranteed to complete faster (or equal) than either (a) sequentially reading the whole drive, or (b) seeking all over the drive to read the used parts in random order. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
From: Richard Elling [mailto:richard.ell...@gmail.com] Now suppose you have a raidz with 3 disks (disk1, disk2, disk3, where disk3 is resilvering). You find some way of ordering all the used blocks of disk1... Which means disk1 will be able to read in optimal order and speed. Sounds like prefetching :-) Ok. Prefetch every used sector in the pool. Problem solved. Let the disks sort all the requests into on-disk order. Unless perhaps the number of requests would exceed the limits of what the drive is able to sort ... Which seems ... more than likely. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On Tue, Dec 21 at 8:24, Edward Ned Harvey wrote: From: edmud...@mail.bounceswoosh.org [mailto:edmud...@mail.bounceswoosh.org] On Behalf Of Eric D. Mudama On Mon, Dec 20 at 19:19, Edward Ned Harvey wrote: If there is no correlation between on-disk order of blocks for different disks within the same vdev, then all hope is lost; it's essentially impossible to optimize the resilver/scrub order unless the on-disk order of multiple disks is highly correlated or equal by definition. Very little is impossible. Drives have been optimally ordering seeks for 35+ years. I'm guessing Unless your drive is able to queue up a request to read every single used part of the drive... Which is larger than the command queue for any reasonable drive in the world... The point is, in order to be optimal you have to eliminate all those seeks, and perform sequential reads only. The only seeks you should do are to skip over unused space. I don't think you read my whole post. I was saying this seek calculation pre-processing would have to be done by the host server, and while not impossible, is not trivial. Present the next 32 seeks to each device while the pre-processor works on the complete list of future seeks, and the drive will do as well as possible. If you're able to sequentially read the whole drive, skipping all the unused space, then you're guaranteed to complete faster (or equal) than either (a) sequentially reading the whole drive, or (b) seeking all over the drive to read the used parts in random order. Yes, I understand how that works. --eric -- Eric D. Mudama edmud...@mail.bounceswoosh.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
From: edmud...@mail.bounceswoosh.org [mailto:edmud...@mail.bounceswoosh.org] On Behalf Of Eric D. Mudama Unless your drive is able to queue up a request to read every single used part of the drive... Which is larger than the command queue for any reasonable drive in the world... The point is, in order to be optimal you have to eliminate all those seeks, and perform sequential reads only. The only seeks you should do are to skip over unused space. I don't think you read my whole post. I was saying this seek calculation pre-processing would have to be done by the host server, and while not impossible, is not trivial. Present the next 32 seeks to each device while the pre-processor works on the complete list of future seeks, and the drive will do as well as possible. I did read that, but now I think, perhaps I misunderstand it, or you misunderstood me? I am thinking... If you're just queueing up a few reads at a time (less than infinity, or less than 99% of the pool) ... I would not assume that these 32 seeks are even remotely sequential I mean ... 32 blocks in a pool of presumably millions of blocks... I would assume they are essentially random, are they not? In my mind, which is likely wrong or at least oversimplified, I think if you want to order the list of blocks to read according to disk order (which should at least be theoretically possible on mirrors, but perhaps not even physically possible on raidz)... You would have to first generate a list of all the blocks to be read, and then sort it. Rough estimate, for any pool of a reasonable size, that sounds like some GB of ram to me. Maybe there's a less-than-perfect sort algorithm which has a much lower memory footprint? Like a simple hashing algorithm that will guarantee the next few thousand seeks are in disk order... Although they will skip or jump over many blocks that will have to be done later ... An algorithm which is not a perfect sort, but given some repetition and multiple passes over the disk, might achieve an acceptable level of performance versus memory footprint... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
Thanks Edward. I do agree about mirrored rpool (equivalent to Windows OS volume); not doing it goes against one of my principles when building enterprise servers. Is there any argument against using the rpool for all data storage as well as being the install volume? Say for example I chucked 15x 1TB disks in there and created a mirrored rpool during installation, using 2 disks. If I added another 6 mirrors (12 disks) to it that would give me an rpool of 7TB. The 15th disk being a spare. Or, say I selected 3 disks during install, does this create a 3 way mirrored rpool or does it give you the option of creating raidz? If so, I could then create a further 4x 3 drive raidz's, giving me a 10TB rpool. Or, I could use 2 smaller disks (say 80GB) for the rpool, then create 4x 3 drive raidz's, giving me an 8TB rpool. Again this gives me a spare disk. Either of these 3 should keep resilvering times to a minimum, against say one big raidz2 of 13 disks. Why does resilvering take so long in raidz anyway? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
Oh, does anyone know if resilvering efficiency is improved or fixed in Solaris 11 Express, as that is what i'm using. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
Why does resilvering take so long in raidz anyway? Because it's broken. There were some changes a while back that made it more broken. There has been a lot of discussion, anecdotes and some data on this list. The resilver doesn't do a single pass of the drives, but uses a smarter temporal algorithm based on metadata. However, the current implentation has difficulty finishing the job if there's a steady flow of updates to the pool. As far as I'm aware, the only way to get bounded resilver times is to stop the workload until resilvering is completed. The problem exists for mirrors too, but is not as marked because mirror reconstruction is inherently simpler. I believe Oracle is aware of the problem, but most of the core ZFS team has left. And of course, a fix for Oracle Solaris no longer means a fix for the rest of us. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
Hi, Which brings up an interesting question... IF it were fixed in for example illumos or freebsd is there a plan for how to handle possible incompatible zfs implementations? Currently the basic version numbering only works as it implies only one stream of development, now with multiple possible stream does ZFS need to move to a feature bit system or are we going to have to have forks or multiple incompatible versions? Thanks, Deano -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Phil Harman Sent: 20 December 2010 10:43 To: Lanky Doodle Cc: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] A few questions Why does resilvering take so long in raidz anyway? Because it's broken. There were some changes a while back that made it more broken. There has been a lot of discussion, anecdotes and some data on this list. The resilver doesn't do a single pass of the drives, but uses a smarter temporal algorithm based on metadata. However, the current implentation has difficulty finishing the job if there's a steady flow of updates to the pool. As far as I'm aware, the only way to get bounded resilver times is to stop the workload until resilvering is completed. The problem exists for mirrors too, but is not as marked because mirror reconstruction is inherently simpler. I believe Oracle is aware of the problem, but most of the core ZFS team has left. And of course, a fix for Oracle Solaris no longer means a fix for the rest of us. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
I believe Oracle is aware of the problem, but most of the core ZFS team has left. And of course, a fix for Oracle Solaris no longer means a fix for the rest of us. OK, that is a bit concerning then. As good as ZFS may be, i'm not sure I want to committ to a file system that is 'broken' and may not be fully fixed, if at all. Hmnnn... -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On 20/12/2010 11:03, Deano wrote: Hi, Which brings up an interesting question... IF it were fixed in for example illumos or freebsd is there a plan for how to handle possible incompatible zfs implementations? Currently the basic version numbering only works as it implies only one stream of development, now with multiple possible stream does ZFS need to move to a feature bit system or are we going to have to have forks or multiple incompatible versions? Thanks, Deano Changes to the resilvering implementation don't necessarily require changes to the on disk format (although they could). Of course, there might be an issue moving a pool mid-resilver from one implementation to another. With arguably considerably more ZFS expertise outside Oracle than in, there's a good chance the community will get to a fix first. It would then be interesting to see whether NIH prevails, or perhaps even a new spirit of share and share alike. You may say I'm a dreamer ... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On 20/12/2010 11:29, Lanky Doodle wrote: I believe Oracle is aware of the problem, but most of the core ZFS team has left. And of course, a fix for Oracle Solaris no longer means a fix for the rest of us. OK, that is a bit concerning then. As good as ZFS may be, i'm not sure I want to committ to a file system that is 'broken' and may not be fully fixed, if at all. Hmnnn... My home server is still running snv_82, and my iMac is running Apple's last public beta release for Leopard. The way I see it, the on-disk format is sound, and the basic always consistent on disk promise seems to be worth something. My files are read-mostly, and performance isn't an issue for me. ZFS has protected my data for several years now in the face of various hardware issues. I'll upgrade my NAS appliance to OpenSolaris snv_134b sometime soon, but as far as I can tell, I can't use Oracle Solaris 11 Express for licensing reasons (I have backups of business data). I'll be watching Illumos with interest, but snv_82 has served me well for 3 years, so I figure snv_134b probably has quite a lot of useful life left in it, and maybe then brtfs will be ready for prime time? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
Phil Harman phil.har...@gmail.com wrote: Changes to the resilvering implementation don't necessarily require changes to the on disk format (although they could). Of course, there might be an issue moving a pool mid-resilver from one implementation to another. We seem to come to a similar problem as wuth UFS 20 years ago. At that time, Sun did enhance the UFS on-disk format but the *BSDs did not follow this change even though the format change was documented in the related include files. For a future ZFS development, thee may be a need to allow an implementation to implement on-disk version 1..21 + 24 and another implementation to support on-disk version 1..23 + 25. These thoughts of course are void in case that Oracle continues the OSS decisions for Solaris and other Solaris variants can import the code related to recent enhancements. Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On Dec 20, 2010, at 2:42 AM, Phil Harman phil.har...@gmail.com wrote: Why does resilvering take so long in raidz anyway? Because it's broken. There were some changes a while back that made it more broken. broken is the wrong term here. It functions as designed and correctly resilvers devices. Disagreeing with the design is quite different than proving a defect. There has been a lot of discussion, anecdotes and some data on this list. slow because I use devices with poor random write(!) performance is very different than broken. The resilver doesn't do a single pass of the drives, but uses a smarter temporal algorithm based on metadata. A design that only does a single pass does not handle the temporal changes. Many RAID implementations use a mix of spatial and temporal resilvering and suffer with that design decision. However, the current implentation has difficulty finishing the job if there's a steady flow of updates to the pool. Please define current. There are many releases of ZFS, and many improvements have been made over time. What has not improved is the random write performance of consumer-grade HDDs. As far as I'm aware, the only way to get bounded resilver times is to stop the workload until resilvering is completed. I know of no RAID implementation that bounds resilver times for HDDs. I believe it is not possible. OTOH, whether a resilver takes 10 seconds or 10 hours makes little difference in data availability. Indeed, this is why we often throttle resilvering activity. See previous discussions on this forum regarding the dueling RFEs. The problem exists for mirrors too, but is not as marked because mirror reconstruction is inherently simpler. Resilver time is bounded by the random write performance of the resilvering device. Mirroring or raidz make no difference. I believe Oracle is aware of the problem, but most of the core ZFS team has left. And of course, a fix for Oracle Solaris no longer means a fix for the rest of us. Some improvements were made post-b134 and pre-b148. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
Thanks relling. I suppose at the end of the day any file system/volume manager has it's flaws so perhaps it's better to look at the positives of each and decide based on them. So, back to my question above, is there a deciding argument [i]against[/i] putting data on the install volume (rpool). Forget about mirroring for a sec; 1) Select 3 disks during install creating raidz1. Create a further 4x 3 drive raidz1's, giving me a 10TB rpool with no spare disks 2) Select 5 disks during install creating raidz1. Create a further 2x 5 drive raidsz1's giving me a 12TB rpool with no spare disks 3) Select 7 disks during install creating raidz1. Create a further 7 drive raidz1 giving me 12TB rpool with 1 spare disk As there is no space gain between 2) and 3) there is no point going for 3), other than having a spare disk, but resilver times would be slower. So it becomes between 1) and 2). Neither offer spare disks but 1) would offer faster resilver times with upto 5 simultaneous disk failures and 2) would offer 2TB extra space with upto 3 simultaneous disk failures. FYI, I am using Samsung SpinPoint F2's, which have the variable RPM speeds (http://www.scan.co.uk/products/1tb-samsung-hd103si-ecogreen-f2-sata-3gb-s-32mb-cache-89-ms-ncq) I may wait at least until I get the next 4 drives in (I actually have 6 at the mo, not 5) taking me to 10, before migrating to ZFS so plenty of time to think about it and hopefully time for them to fix resilvering! ;-) Thanks again... -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On 20/12/2010 13:59, Richard Elling wrote: On Dec 20, 2010, at 2:42 AM, Phil Harman phil.har...@gmail.com mailto:phil.har...@gmail.com wrote: Why does resilvering take so long in raidz anyway? Because it's broken. There were some changes a while back that made it more broken. broken is the wrong term here. It functions as designed and correctly resilvers devices. Disagreeing with the design is quite different than proving a defect. It might be the wrong term in general, but I think it does apply in the budget home media server context of this thread. I think we can agree that ZFS currently doesn't play well on cheap disks. I think we can also agree that the performance of ZFS resilvering is known to be suboptimal under certain conditions. For a long time at Sun, the rule was correctness is a constraint, performance is a goal. However, in the real world, performance is often also a constraint (just as a quick but erroneous answer is a wrong answer, so also, a slow but correct answer can also be wrong). Then one brave soul at Sun once ventured that if Linux is faster, it's a Solaris bug! and to his surprise, the idea caught on. I later went on to tell people that ZFS delievered RAID where I = inexpensive, so I'm a just a little frustrated when that promise becomes less respected over time. First it was USB drives (which I agreed with), now it's SATA (and I'm not so sure). There has been a lot of discussion, anecdotes and some data on this list. slow because I use devices with poor random write(!) performance is very different than broken. Again, context is everything. For example, if someone was building a business critical NAS appliance from consumer grade parts, I'd be the first to say are you nuts?! The resilver doesn't do a single pass of the drives, but uses a smarter temporal algorithm based on metadata. A design that only does a single pass does not handle the temporal changes. Many RAID implementations use a mix of spatial and temporal resilvering and suffer with that design decision. Actually, it's easy to see how a combined spatial and temporal approach could be implemented to an advantage for mirrored vdevs. However, the current implentation has difficulty finishing the job if there's a steady flow of updates to the pool. Please define current. There are many releases of ZFS, and many improvements have been made over time. What has not improved is the random write performance of consumer-grade HDDs. I was led to believe this was not yet fixed in Solaris 11, and that there are therefore doubts about what Solaris 10 update may see the fix, if any. As far as I'm aware, the only way to get bounded resilver times is to stop the workload until resilvering is completed. I know of no RAID implementation that bounds resilver times for HDDs. I believe it is not possible. OTOH, whether a resilver takes 10 seconds or 10 hours makes little difference in data availability. Indeed, this is why we often throttle resilvering activity. See previous discussions on this forum regarding the dueling RFEs. I don't share your disbelief or little difference analysys. If it is true that no current implementation succeeds, isn't that a great opportunity to change the rules? Wasn't resilver time vs availability was a major factor in Adam Leventhal's paper introducing the need for RAIDZ3? The appropriateness or otherwise of resilver throttling depends on the context. If I can tolerate further failures without data loss (e.g. RAIDZ2 with one failed device, or RAIDZ3 with two failed devices), or if I can recover business critical data in a timely manner, then great. But there may come a point where I would rather take a short term performance hit to close the window on total data loss. The problem exists for mirrors too, but is not as marked because mirror reconstruction is inherently simpler. Resilver time is bounded by the random write performance of the resilvering device. Mirroring or raidz make no difference. This only holds in a quiesced system. I believe Oracle is aware of the problem, but most of the core ZFS team has left. And of course, a fix for Oracle Solaris no longer means a fix for the rest of us. Some improvements were made post-b134 and pre-b148. That is, indeed, good news. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On Dec 18, 2010, at 12:23 PM, Lanky Doodle wrote: Now this is getting really complex, but can you have server failover in ZFS, much like DFS-R in Windows - you point clients to a clustered ZFS namespace so if a complete server failed nothing is interrupted. This is the purpose of an Amber Road dual-head cluster (7310C, 7410C, etc.) -- not only the storage pool fails over, but also the server IP address fails over, so that NFS, etc. shares remain active, when one storage head goes down. Amber Road uses ZFS, but the clustering and failover are not related to the filesystem type. Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Lanky Doodle I believe Oracle is aware of the problem, but most of the core ZFS team has left. And of course, a fix for Oracle Solaris no longer means a fix for the rest of us. OK, that is a bit concerning then. As good as ZFS may be, i'm not sure I want to committ to a file system that is 'broken' and may not be fully fixed, if at all. ZFS is not broken. It is, however, a weak spot, that resilver is very inefficient. For example: On my server, which is made up of 10krpm SATA drives, 1TB each... My drives can each sustain 1Gbit/sec sequential read/write. This means, if I needed to resilver the entire drive (in a mirror) sequentially, it would take ... 8,000 sec = 133 minutes. About 2 hours. In reality, I have ZFS mirrors, and disks are around 70% full, and resilver takes 12-14 hours. So although resilver is broken by some standards, it is bounded, and you can limit it to something which is survivable, by using mirrors instead of raidz. For most people, even using 5-disk, or 7-disk raidzN will still be fine. But you start getting unsustainable if you get up to 21-disk radiz3 for example. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Lanky Doodle Is there any argument against using the rpool for all data storage as well as being the install volume? Generally speaking, you can't do it. The rpool is only supported on mirrors, not raidz. I believe this is because you need rpool in order to load the kernel, and until the kernel is loaded, there's just no reasonable way to have a fully zfs-aware, supports-every-feature bootloader able to read rpool in order to fetch the kernel. Normally, you'll dedicate 2 disks to the OS, and then you build additional separate data pools. If you absolutely need all the disk space of the OS disks, then you partition the OS into a smaller section of the OS disks and assign the remaining space to some pool. But doing that partitioning scheme can be complex, and if you're not careful, risky. I don't advise it unless you truly have your back against a wall for more disk space. Why does resilvering take so long in raidz anyway? There are some really long and sometimes complex threads in this mailing list discussing that. Fundamentally ... First of all, it's not always true. It depends on your usage behavior and the type of disks you're using. But the typical usage includes reading writing a lot of files, essentially randomly over time, creating and deleting snapshots, using spindle disks, so the typical usage behavior does have a resilver performance problem. The root cause of the problem is that ZFS does not resilver the whole disk... It only resilvers the used portions of the disk. Sounds like a performance enhancer, right? It would be, if the disks were mostly empty ... or if ZFS were resilvering a partial disk, in order according to disk layout. Unfortunately, it's resilvering according to the temporal order blocks were written, and usually a disk is significantly full (say, 50% or more) and as such, the disks have to thrash all around, performing all sorts of random reads, until eventually it can read all the used parts in random order. It's worse on raidzN than on mirrors, because the number of items which must be read is higher in radizN, assuming you're using larger vdev's and therefore more items exist scattered about inside that vdev. You therefore have a higher number of things which must be randomly read before you reach completion. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
-Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Edward Ned Harvey Sent: Monday, December 20, 2010 11:46 AM To: 'Lanky Doodle'; zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] A few questions From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Lanky Doodle I believe Oracle is aware of the problem, but most of the core ZFS team has left. And of course, a fix for Oracle Solaris no longer means a fix for the rest of us. OK, that is a bit concerning then. As good as ZFS may be, i'm not sure I want to committ to a file system that is 'broken' and may not be fully fixed, if at all. ZFS is not broken. It is, however, a weak spot, that resilver is very inefficient. For example: On my server, which is made up of 10krpm SATA drives, 1TB each... My drives can each sustain 1Gbit/sec sequential read/write. This means, if I needed to resilver the entire drive (in a mirror) sequentially, it would take ... 8,000 sec = 133 minutes. About 2 hours. In reality, I have ZFS mirrors, and disks are around 70% full, and resilver takes 12-14 hours. So although resilver is broken by some standards, it is bounded, and you can limit it to something which is survivable, by using mirrors instead of raidz. For most people, even using 5-disk, or 7-disk raidzN will still be fine. But you start getting unsustainable if you get up to 21-disk radiz3 for example. This argument keeps coming up on the list, but I don't see where anyone has made a good suggestion about whether this can even be 'fixed' or how it would be done. As I understand it, you have two basic types of array reconstruction: in a mirror you can make a block-by-block copy and that's easy, but in a parity array you have to perform a calculation on the existing data and/or existing parity to reconstruct the missing piece. This is pretty easy when you can guarantee that all your stripes are the same width, start/end on the same sectors/boundaries/whatever and thus know a piece of them lives on all drives in the set. I don't think this is possible with ZFS since we have variable stripe width. A failed disk d may or may not contain data from stripe s (or transaction t). This information has to be discovered by looking at the transaction records. Right? Can someone speculate as to how you could rebuild a variable stripe width array without replaying all the available transactions? I am no filesystem engineer but I can't wrap my head around how this could be handled any better than it already is. I've read that resilvering is throttled - presumably to keep performance degradation to a minimum during the process - maybe this could be a tunable (e.g. priority: low, normal, high)? Do we know if resilvers on a mirror are actually handled differently from those on a raidz? Sorry if this has already been explained. I think this is an issue that everyone who uses ZFS should understand completely before jumping in, because the behavior (while not 'wrong') is clearly NOT the same as with more conventional arrays. -Will ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On 12/20/2010 9:20 AM, Saxon, Will wrote: -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Edward Ned Harvey Sent: Monday, December 20, 2010 11:46 AM To: 'Lanky Doodle'; zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] A few questions From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Lanky Doodle I believe Oracle is aware of the problem, but most of the core ZFS team has left. And of course, a fix for Oracle Solaris no longer means a fix for the rest of us. OK, that is a bit concerning then. As good as ZFS may be, i'm not sure I want to committ to a file system that is 'broken' and may not be fully fixed, if at all. ZFS is not broken. It is, however, a weak spot, that resilver is very inefficient. For example: On my server, which is made up of 10krpm SATA drives, 1TB each... My drives can each sustain 1Gbit/sec sequential read/write. This means, if I needed to resilver the entire drive (in a mirror) sequentially, it would take ... 8,000 sec = 133 minutes. About 2 hours. In reality, I have ZFS mirrors, and disks are around 70% full, and resilver takes 12-14 hours. So although resilver is broken by some standards, it is bounded, and you can limit it to something which is survivable, by using mirrors instead of raidz. For most people, even using 5-disk, or 7-disk raidzN will still be fine. But you start getting unsustainable if you get up to 21-disk radiz3 for example. This argument keeps coming up on the list, but I don't see where anyone has made a good suggestion about whether this can even be 'fixed' or how it would be done. As I understand it, you have two basic types of array reconstruction: in a mirror you can make a block-by-block copy and that's easy, but in a parity array you have to perform a calculation on the existing data and/or existing parity to reconstruct the missing piece. This is pretty easy when you can guarantee that all your stripes are the same width, start/end on the same sectors/boundaries/whatever and thus know a piece of them lives on all drives in the set. I don't think this is possible with ZFS since we have variable stripe width. A failed disk d may or may not contain data from stripe s (or transaction t). This information has to be discovered by looking at the transaction records. Right? Can someone speculate as to how you could rebuild a variable stripe width array without replaying all the available transactions? I am no filesystem engineer but I can't wrap my head around how this could be handled any better than it already is. I've read that resilvering is throttled - presumably to keep performance degradation to a minimum during the process - maybe this could be a tunable (e.g. priority: low, normal, high)? Do we know if resilvers on a mirror are actually handled differently from those on a raidz? Sorry if this has already been explained. I think this is an issue that everyone who uses ZFS should understand completely before jumping in, because the behavior (while not 'wrong') is clearly NOT the same as with more conventional arrays. -Will the problem is NOT the checksum/error correction overhead. that's relatively trivial. The problem isn't really even variable width (i.e. variable number of disks one crosses) slabs. The problem boils down to this: When ZFS does a resilver, it walks the METADATA tree to determine what order to rebuild things from. That means, it resilvers the very first slab ever written, then the next oldest, etc. The problem here is that slab age has nothing to do with where that data physically resides on the actual disks. If you've used the zpool as a WORM device, then, sure, there should be a strict correlation between increasing slab age and locality on the disk. However, in any reasonable case, files get deleted regularly. This means that the probability that for a slab B, written immediately after slab A, it WON'T be physically near slab A. In the end, the problem is that using metadata order, while reducing the total amount of work to do in the resilver (as you only resilver live data, not every bit on the drive), increases the physical inefficiency for each slab. That is, seek time between cyclinders begins to dominate your slab reconstruction time. In RAIDZ, this problem is magnified by both the much larger average vdev size vs mirrors, and the necessity that all drives containing a slab information return that data before the corrected data can be written to the resilvering drive. Thus, current ZFS resilvering tends to be seek-time limited, NOT throughput limited. This is really the fault of the underlying media, not ZFS. For instance, if you have a raidZ of SSDs (where seek time is negligible, but throughput isn't), they resilver really, really fast. In fact, they resilver at the maximum write throughput rate. However, HDs
Re: [zfs-discuss] A few questions
On 12/20/2010 9:20 AM, Saxon, Will wrote: -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Edward Ned Harvey Sent: Monday, December 20, 2010 11:46 AM To: 'Lanky Doodle'; zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] A few questions From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Lanky Doodle I believe Oracle is aware of the problem, but most of the core ZFS team has left. And of course, a fix for Oracle Solaris no longer means a fix for the rest of us. OK, that is a bit concerning then. As good as ZFS may be, i'm not sure I want to committ to a file system that is 'broken' and may not be fully fixed, if at all. ZFS is not broken. It is, however, a weak spot, that resilver is very inefficient. For example: On my server, which is made up of 10krpm SATA drives, 1TB each... My drives can each sustain 1Gbit/sec sequential read/write. This means, if I needed to resilver the entire drive (in a mirror) sequentially, it would take ... 8,000 sec = 133 minutes. About 2 hours. In reality, I have ZFS mirrors, and disks are around 70% full, and resilver takes 12-14 hours. So although resilver is broken by some standards, it is bounded, and you can limit it to something which is survivable, by using mirrors instead of raidz. For most people, even using 5-disk, or 7-disk raidzN will still be fine. But you start getting unsustainable if you get up to 21-disk radiz3 for example. This argument keeps coming up on the list, but I don't see where anyone has made a good suggestion about whether this can even be 'fixed' or how it would be done. As I understand it, you have two basic types of array reconstruction: in a mirror you can make a block-by-block copy and that's easy, but in a parity array you have to perform a calculation on the existing data and/or existing parity to reconstruct the missing piece. This is pretty easy when you can guarantee that all your stripes are the same width, start/end on the same sectors/boundaries/whatever and thus know a piece of them lives on all drives in the set. I don't think this is possible with ZFS since we have variable stripe width. A failed disk d may or may not contain data from stripe s (or transaction t). This information has to be discovered by looking at the transaction records. Right? Can someone speculate as to how you could rebuild a variable stripe width array without replaying all the available transactions? I am no filesystem engineer but I can't wrap my head around how this could be handled any better than it already is. I've read that resilvering is throttled - presumably to keep performance degradation to a minimum during the process - maybe this could be a tunable (e.g. priority: low, normal, high)? Do we know if resilvers on a mirror are actually handled differently from those on a raidz? Sorry if this has already been explained. I think this is an issue that everyone who uses ZFS should understand completely before jumping in, because the behavior (while not 'wrong') is clearly NOT the same as with more conventional arrays. -Will As far as a possible fix, here's what I can see: [Note: I'm not a kernel or FS-level developer. I would love to be able to fix this myself, but I have neither the aptitude nor the [extensive] time to learn such skill] We can either (a) change how ZFS does resilvering or (b) repack the zpool layouts to avoid the problem in the first place. In case (a), my vote would be to seriously increase the number of in-flight resilver slabs, AND allow for out-of-time-order slab resilvering. By that, I mean that ZFS would read several disk-sequential slabs, and then mark them as done. This would mean a *lot* of scanning the metadata tree (since leaves all over the place could be done). Frankly, I can't say how bad that would be; the problem is that for ANY resilver, ZFS would have to scan the entire metadata tree to see if it had work to do, rather than simply look for the latest completed leave, then assume everything after that needs to be done. There'd also be the matter of determining *if* one should read a disk sector... In case (b), we need the ability to move slabs around on the physical disk (via the mythical Block Pointer Re-write method). If there is that underlying mechanism, then a defrag utility can be run to repack the zpool to the point where chronological creation time = physical layout. Which then substantially mitigates the seek time problem. I can't fix (a) - I don't understand the codebase well enough. Neither can I do the BP-rewrite implementation. However, if I can get BP-rewrite, I've got a prototype defragger that seems to work well (under simulation). I'm sure it could use some performance improvement, but it works reasonably well on a simulated fragmented pool. Please, Santa, can a good little boy get
Re: [zfs-discuss] A few questions
Erik, just a hypothetical what-if ... In the case of resilvering on a mirrored disk, why not take a snapshot, and then resilver by doing a pure block copy from the snapshot? It would be sequential, so long as the original data was unmodified; and random access in dealing with the modified blocks only, right. After the original snapshot had been replicated, a second pass would be done, in order to update the clone to 100% live data. Not knowing enough about the inner workings of ZFS snapshots, I don't know why this would not be doable. (I'm biased towards mirrors for busy filesystems.) I'm supposing that a block-level snapshot is not doable -- or is it? Mark On Dec 20, 2010, at 1:27 PM, Erik Trimble wrote: On 12/20/2010 9:20 AM, Saxon, Will wrote: -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Edward Ned Harvey Sent: Monday, December 20, 2010 11:46 AM To: 'Lanky Doodle'; zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] A few questions From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Lanky Doodle I believe Oracle is aware of the problem, but most of the core ZFS team has left. And of course, a fix for Oracle Solaris no longer means a fix for the rest of us. OK, that is a bit concerning then. As good as ZFS may be, i'm not sure I want to committ to a file system that is 'broken' and may not be fully fixed, if at all. ZFS is not broken. It is, however, a weak spot, that resilver is very inefficient. For example: On my server, which is made up of 10krpm SATA drives, 1TB each... My drives can each sustain 1Gbit/sec sequential read/write. This means, if I needed to resilver the entire drive (in a mirror) sequentially, it would take ... 8,000 sec = 133 minutes. About 2 hours. In reality, I have ZFS mirrors, and disks are around 70% full, and resilver takes 12-14 hours. So although resilver is broken by some standards, it is bounded, and you can limit it to something which is survivable, by using mirrors instead of raidz. For most people, even using 5-disk, or 7-disk raidzN will still be fine. But you start getting unsustainable if you get up to 21-disk radiz3 for example. This argument keeps coming up on the list, but I don't see where anyone has made a good suggestion about whether this can even be 'fixed' or how it would be done. As I understand it, you have two basic types of array reconstruction: in a mirror you can make a block-by-block copy and that's easy, but in a parity array you have to perform a calculation on the existing data and/or existing parity to reconstruct the missing piece. This is pretty easy when you can guarantee that all your stripes are the same width, start/end on the same sectors/boundaries/whatever and thus know a piece of them lives on all drives in the set. I don't think this is possible with ZFS since we have variable stripe width. A failed disk d may or may not contain data from stripe s (or transaction t). This information has to be discovered by looking at the transaction records. Right? Can someone speculate as to how you could rebuild a variable stripe width array without replaying all the available transactions? I am no filesystem engineer but I can't wrap my head around how this could be handled any better than it already is. I've read that resilvering is throttled - presumably to keep performance degradation to a minimum during the process - maybe this could be a tunable (e.g. priority: low, normal, high)? Do we know if resilvers on a mirror are actually handled differently from those on a raidz? Sorry if this has already been explained. I think this is an issue that everyone who uses ZFS should understand completely before jumping in, because the behavior (while not 'wrong') is clearly NOT the same as with more conventional arrays. -Will the problem is NOT the checksum/error correction overhead. that's relatively trivial. The problem isn't really even variable width (i.e. variable number of disks one crosses) slabs. The problem boils down to this: When ZFS does a resilver, it walks the METADATA tree to determine what order to rebuild things from. That means, it resilvers the very first slab ever written, then the next oldest, etc. The problem here is that slab age has nothing to do with where that data physically resides on the actual disks. If you've used the zpool as a WORM device, then, sure, there should be a strict correlation between increasing slab age and locality on the disk. However, in any reasonable case, files get deleted regularly. This means that the probability that for a slab B, written immediately after slab A, it WON'T be physically near slab A. In the end, the problem is that using metadata order, while reducing the total amount of work to do in the resilver
Re: [zfs-discuss] A few questions
On 12/20/2010 11:56 AM, Mark Sandrock wrote: Erik, just a hypothetical what-if ... In the case of resilvering on a mirrored disk, why not take a snapshot, and then resilver by doing a pure block copy from the snapshot? It would be sequential, so long as the original data was unmodified; and random access in dealing with the modified blocks only, right. After the original snapshot had been replicated, a second pass would be done, in order to update the clone to 100% live data. Not knowing enough about the inner workings of ZFS snapshots, I don't know why this would not be doable. (I'm biased towards mirrors for busy filesystems.) I'm supposing that a block-level snapshot is not doable -- or is it? Mark Snapshots on ZFS are true snapshots - they take a picture of the current state of the system. They DON'T copy any data around when created. So, a ZFS snapshot would be just as fragmented as the ZFS filesystem was at the time. The problem is this: Let's say I write block A, B, C, and D on a clean zpool (what kind, it doesn't matter). I now delete block C. Later on, I write block E. There is a probability (increasing dramatically as times goes on), that the on-disk layout will now look like: A, B, E, D rather than A, B, [space], D, E So, in the first case, I can do a sequential read to get A B, but then must do a seek to get D, and a seek to get E. The fragmentation problem is mainly due to file deletion, NOT to file re-writing. (though, in ZFS, being a C-O-W filesystem, re-writing generally looks like a delete-then-write process, rather than a modify process). -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On Mon, 20 Dec 2010 11:27:41 PST Erik Trimble erik.trim...@oracle.com wrote: The problem boils down to this: When ZFS does a resilver, it walks the METADATA tree to determine what order to rebuild things from. That means, it resilvers the very first slab ever written, then the next oldest, etc. The problem here is that slab age has nothing to do with where that data physically resides on the actual disks. If you've used the zpool as a WORM device, then, sure, there should be a strict correlation between increasing slab age and locality on the disk. However, in any reasonable case, files get deleted regularly. This means that the probability that for a slab B, written immediately after slab A, it WON'T be physically near slab A. In the end, the problem is that using metadata order, while reducing the total amount of work to do in the resilver (as you only resilver live data, not every bit on the drive), increases the physical inefficiency for each slab. That is, seek time between cyclinders begins to dominate your slab reconstruction time. In RAIDZ, this problem is magnified by both the much larger average vdev size vs mirrors, and the necessity that all drives containing a slab information return that data before the corrected data can be written to the resilvering drive. Thus, current ZFS resilvering tends to be seek-time limited, NOT throughput limited. This is really the fault of the underlying media, not ZFS. For instance, if you have a raidZ of SSDs (where seek time is negligible, but throughput isn't), they resilver really, really fast. In fact, they resilver at the maximum write throughput rate. However, HDs are severely seek-limited, so that dominates HD resilver time. You guys may be interested in a solution I used in a totally different situation. There an identical tree data structure had to be maintained on every node of a distributed system. When a new node was added, it needed to be initialized with an identical copy before it could be put in operation. But this had to be done while the rest of the system was operational and there may even be updates from a central node during the `mirroring' operation. Some of these updates could completely change the tree! Starting at the root was not going to work since a subtree that was being copied may stop existing in the middle and its space reused! In a way this is a similar problem (but worse!). I needed something foolproof and simple. My algorithm started copying sequentially from the start. If N blocks were already copied when an update comes along, updates of any block with block# N are ignored (since the sequential copy would get to them eventually). Updates of any block# = N were queued up (further update of the same block would overwrite the old update, to reduce work). Periodically they would be flushed out to the new node. This was paced so at to not affect the normal operation much. I should think a variation would work for active filesystems. You sequentially read some amount of data from all the disks from which data for the new disk to be prepared and write it out sequentially. Each time read enough data so that reading time dominates any seek time. Handle concurrent updates as above. If you dedicate N% of time to resilvering, the total time to complete resilver will be 100/N times sequential read time of the whole disk. (For example, 1TB disk, 100MBps io speed, 25% for resilver = under 12 hours). How much worse this gets depends on the amount of updates during resilvering. At the time of resilvering your FS is more likely to be near full than near empty so I wouldn't worry about optimizing the mostly empty FS case. Bakul ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On Dec 20, 2010, at 2:05 PM, Erik Trimble wrote: On 12/20/2010 11:56 AM, Mark Sandrock wrote: Erik, just a hypothetical what-if ... In the case of resilvering on a mirrored disk, why not take a snapshot, and then resilver by doing a pure block copy from the snapshot? It would be sequential, so long as the original data was unmodified; and random access in dealing with the modified blocks only, right. After the original snapshot had been replicated, a second pass would be done, in order to update the clone to 100% live data. Not knowing enough about the inner workings of ZFS snapshots, I don't know why this would not be doable. (I'm biased towards mirrors for busy filesystems.) I'm supposing that a block-level snapshot is not doable -- or is it? Mark Snapshots on ZFS are true snapshots - they take a picture of the current state of the system. They DON'T copy any data around when created. So, a ZFS snapshot would be just as fragmented as the ZFS filesystem was at the time. But if one does a raw (block) copy, there isn't any fragmentation -- except for the COW updates. If there were no updates to the snapshot, then it becomes a 100% sequential block copy operation. But even with COW updates, presumably the large majority of the copy would still be sequential i/o. Maybe for the 2nd pass, the filesystem would have to be locked, so the operation would ever complete, but if this is fairly short in relation to the overall resilvering time, then it could still be a win in many cases. I'm probably not explaining it well, and may be way off, but it seemed an interesting notion. Mark The problem is this: Let's say I write block A, B, C, and D on a clean zpool (what kind, it doesn't matter). I now delete block C. Later on, I write block E. There is a probability (increasing dramatically as times goes on), that the on-disk layout will now look like: A, B, E, D rather than A, B, [space], D, E So, in the first case, I can do a sequential read to get A B, but then must do a seek to get D, and a seek to get E. The fragmentation problem is mainly due to file deletion, NOT to file re-writing. (though, in ZFS, being a C-O-W filesystem, re-writing generally looks like a delete-then-write process, rather than a modify process). -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
From: Erik Trimble [mailto:erik.trim...@oracle.com] We can either (a) change how ZFS does resilvering or (b) repack the zpool layouts to avoid the problem in the first place. In case (a), my vote would be to seriously increase the number of in-flight resilver slabs, AND allow for out-of-time-order slab resilvering. Question for any clueful person: Suppose you have a mirror to resilver, made of disk1 and disk2, where disk2 failed and is resilvering. If you have an algorithm to create a list of all the used blocks of disk1 in disk order, then you're able to resilver the mirror extremely fast, because all the reads will be sequential in nature, plus you get to skip past all the unused space. Now suppose you have a raidz with 3 disks (disk1, disk2, disk3, where disk3 is resilvering). You find some way of ordering all the used blocks of disk1... Which means disk1 will be able to read in optimal order and speed. Does that necessarily imply disk2 will also work well? Does the on-disk order of blocks of disk1 necessarily match the order of blocks on disk2? If there is no correlation between on-disk order of blocks for different disks within the same vdev, then all hope is lost; it's essentially impossible to optimize the resilver/scrub order unless the on-disk order of multiple disks is highly correlated or equal by definition. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Erik Trimble In the case of resilvering on a mirrored disk, why not take a snapshot, and then resilver by doing a pure block copy from the snapshot? It would be sequential, So, a ZFS snapshot would be just as fragmented as the ZFS filesystem was at the time. I think Mark was suggesting something like dd copy device 1 onto device 2, in order to guarantee a first-pass sequential resilver. And my response would be: Creative thinking and suggestions are always a good thing. In fact, the above suggestion is already faster than the present-day solution for what I'm calling typical usage, but there are an awful lot of use cases where the dd solution would be worse... Such as a pool which is largely sequential already, or largely empty, or made of high IOPS devices such as SSD. However, there is a desire to avoid resilvering unused blocks, so I hope a better solution is possible... The fundamental requirement for a better optimized solution would be a way to resilver according to disk ordering... And it's just a question for somebody that actually knows the answer ... How terrible is the idea of figuring out the on-disk order? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On Mon, Dec 20 at 19:19, Edward Ned Harvey wrote: If there is no correlation between on-disk order of blocks for different disks within the same vdev, then all hope is lost; it's essentially impossible to optimize the resilver/scrub order unless the on-disk order of multiple disks is highly correlated or equal by definition. Very little is impossible. Drives have been optimally ordering seeks for 35+ years. I'm guessing that the trick (difficult, but not impossible) is how to solve a travelling salesman route pathing problem where you have billions or trillions of transactions, and do it fast enough that it was worth doing any extra computation besides just giving the device 32+ queued commands at a time that align with the elements of each ordered transaction ID. Add to that all the complexity of unwinding the error recovery in the event that you fail checksum validation on transaction N-1 after moving past transaction N, which would be a required capability if you wanted to queue more than a single transaction for verification at a time. Oh, and do all of the above without noticably affecting the throughput of the applications already running on the system. --eric -- Eric D. Mudama edmud...@mail.bounceswoosh.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
It well may be that different methods are optimal for different use cases. Mechanical disk vs. SSD; mirrored vs. raidz[123]; sparse vs. populated; etc. It would be interesting to read more in this area, if papers are available. I'll have to take a look. ... Or does someone have pointers? Mark On Dec 20, 2010, at 6:28 PM, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Erik Trimble In the case of resilvering on a mirrored disk, why not take a snapshot, and then resilver by doing a pure block copy from the snapshot? It would be sequential, So, a ZFS snapshot would be just as fragmented as the ZFS filesystem was at the time. I think Mark was suggesting something like dd copy device 1 onto device 2, in order to guarantee a first-pass sequential resilver. And my response would be: Creative thinking and suggestions are always a good thing. In fact, the above suggestion is already faster than the present-day solution for what I'm calling typical usage, but there are an awful lot of use cases where the dd solution would be worse... Such as a pool which is largely sequential already, or largely empty, or made of high IOPS devices such as SSD. However, there is a desire to avoid resilvering unused blocks, so I hope a better solution is possible... The fundamental requirement for a better optimized solution would be a way to resilver according to disk ordering... And it's just a question for somebody that actually knows the answer ... How terrible is the idea of figuring out the on-disk order? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On Dec 20, 2010, at 7:31 AM, Phil Harman phil.har...@gmail.com wrote: On 20/12/2010 13:59, Richard Elling wrote: On Dec 20, 2010, at 2:42 AM, Phil Harman phil.har...@gmail.com wrote: Why does resilvering take so long in raidz anyway? Because it's broken. There were some changes a while back that made it more broken. broken is the wrong term here. It functions as designed and correctly resilvers devices. Disagreeing with the design is quite different than proving a defect. It might be the wrong term in general, but I think it does apply in the budget home media server context of this thread. If you only have a few slow drives, you don't have performance. Like trying to win the Indianapolis 500 with a tricycle... I think we can agree that ZFS currently doesn't play well on cheap disks. I think we can also agree that the performance of ZFS resilvering is known to be suboptimal under certain conditions. ... and those conditions are also a strength. For example, most file systems are nowhere near full. With ZFS you only resilver data. For those who recall the resilver throttles in SVM or VXVM, you will appreciate not having to resilver non-data. For a long time at Sun, the rule was correctness is a constraint, performance is a goal. However, in the real world, performance is often also a constraint (just as a quick but erroneous answer is a wrong answer, so also, a slow but correct answer can also be wrong). Then one brave soul at Sun once ventured that if Linux is faster, it's a Solaris bug! and to his surprise, the idea caught on. I later went on to tell people that ZFS delievered RAID where I = inexpensive, so I'm a just a little frustrated when that promise becomes less respected over time. First it was USB drives (which I agreed with), now it's SATA (and I'm not so sure). slow doesn't begin with an i :-) There has been a lot of discussion, anecdotes and some data on this list. slow because I use devices with poor random write(!) performance is very different than broken. Again, context is everything. For example, if someone was building a business critical NAS appliance from consumer grade parts, I'd be the first to say are you nuts?! Unfortunately, the math does not support your position... The resilver doesn't do a single pass of the drives, but uses a smarter temporal algorithm based on metadata. A design that only does a single pass does not handle the temporal changes. Many RAID implementations use a mix of spatial and temporal resilvering and suffer with that design decision. Actually, it's easy to see how a combined spatial and temporal approach could be implemented to an advantage for mirrored vdevs. However, the current implentation has difficulty finishing the job if there's a steady flow of updates to the pool. Please define current. There are many releases of ZFS, and many improvements have been made over time. What has not improved is the random write performance of consumer-grade HDDs. I was led to believe this was not yet fixed in Solaris 11, and that there are therefore doubts about what Solaris 10 update may see the fix, if any. As far as I'm aware, the only way to get bounded resilver times is to stop the workload until resilvering is completed. I know of no RAID implementation that bounds resilver times for HDDs. I believe it is not possible. OTOH, whether a resilver takes 10 seconds or 10 hours makes little difference in data availability. Indeed, this is why we often throttle resilvering activity. See previous discussions on this forum regarding the dueling RFEs. I don't share your disbelief or little difference analysys. If it is true that no current implementation succeeds, isn't that a great opportunity to change the rules? Wasn't resilver time vs availability was a major factor in Adam Leventhal's paper introducing the need for RAIDZ3? No, it wasn't. There are two failure modes we can model given the data provided by disk vendors: 1. failures by time (MTBF) 2. failures by bits read (UER) Over time, the MTBF has improved, but the failures by bits read has not improved. Just a few years ago enterprise class HDDs had an MTBF of around 1 million hours. Today, they are in the range of 1.6 million hours. Just looking at the size of the numbers, the probability that a drive will fail in one hour is on the order of 10e-6. By contrast, the failure rate by bits read has not improved much. Consumer class HDDs are usually spec'ed at 1 error per 1e14 bits read. To put this in perspective, a 2TB disk has around 1.6e13 bits. Or, the probability of an unrecoverable read if you read every bit on a 2TB is growing well above 10%. Some of the better enterprise class HDDs are rated two orders of magnitude better, but the only way to get much better is to use more bits for ECC... hence the move towards 4KB sectors. In other words, the probability of losing data
Re: [zfs-discuss] A few questions
On Dec 20, 2010, at 4:19 PM, Edward Ned Harvey opensolarisisdeadlongliveopensola...@nedharvey.com wrote: From: Erik Trimble [mailto:erik.trim...@oracle.com] We can either (a) change how ZFS does resilvering or (b) repack the zpool layouts to avoid the problem in the first place. In case (a), my vote would be to seriously increase the number of in-flight resilver slabs, AND allow for out-of-time-order slab resilvering. Question for any clueful person: Suppose you have a mirror to resilver, made of disk1 and disk2, where disk2 failed and is resilvering. If you have an algorithm to create a list of all the used blocks of disk1 in disk order, then you're able to resilver the mirror extremely fast, because all the reads will be sequential in nature, plus you get to skip past all the unused space. Sounds like the definition of random access :-) Now suppose you have a raidz with 3 disks (disk1, disk2, disk3, where disk3 is resilvering). You find some way of ordering all the used blocks of disk1... Which means disk1 will be able to read in optimal order and speed. Sounds like prefetching :-) Does that necessarily imply disk2 will also work well? Does the on-disk order of blocks of disk1 necessarily match the order of blocks on disk2? This is an interesting question, that will become more interesting as the physical sector size gets bigger... -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Alexander Lesle at Dezember, 17 2010, 17:48 Lanky Doodle wrote in [1]: By single drive mirrors, I assume, in a 14 disk setup, you mean 7 sets of 2 disk mirrors - I am thinking of traditional RAID1 here. Or do you mean 1 massive mirror with all 14 disks? Edward means a set of two-way-mirrors. Correct. mirror disk0 disk1 mirror disk2 disk3 mirror disk4 disk5 ... You would normally call this a stripe of mirrors. Even though the ZFS concept of striping is more advanced than traditional raid striping... We still call this a ZFS stripe for lack of any other term. A ZFS stripe has all the good characteristics of raid concatenation and striping, without any of the bad characteristics. It can utilize bandwidth on multiple disks when it wants to, or use a single device when it wants to for small blocks. It can dynamically add randomly sized devices, and it can be done one-at-a-time. Gaining everything good of traditional raid stripe or concatenation, without any of the negatives of traditional raid stripe and concatenation. At Sol11 Express Oracle announced that at TestInstall you can set RootPool to mirror during installation. At the moment I try it out in a VM but I didnt find this option. :-( Actually, even in solaris 10, I habitually install the root filesystem onto a ZFS mirror. You just select 2 disks, and it's automatically a mirror. zpool create lankyserver mirror vdev1 vdev2 mirror vdev3 vdev4 When you need more space you can expand a bundle of two disks to your lankyserver. Each pair with the same capacity is effective. zpool add lankyserver mirror vdev5 vdev6 mirror vdev7 vdev8 ... Correct. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
From: Bob Friesenhahn [mailto:bfrie...@simple.dallas.tx.us] Sent: Friday, December 17, 2010 9:16 PM While I agree that smaller vdevs are more reliable, your statement about the failure being more likely be in the same vdev if you have only 2 vdev's to be a rather useless statement. The probability of vdev failure does not have anything to do with the number of vdevs. However, the probability of vdev failure increases tremendously if there is only one vdev and there is a second disk failure. I'm not sure you got what I meant. I'll rephrase and see if it's more clear: Correct, the number of vdev's doesn't affect the probability of a failure in a specific vdev, but the number of disks in a vdev does. Lanky said he was considering 2x7disk raidz, versus 3x5disk raidz. So when I said he's more likely to have a 2nd disk fail in the same vdev if he only has 2 vdev's ... That was meant to be taken in context, not as a generalization about pools in general. Consider a single disk. Let P be the probability of the disk failing, within 1 day. If you have 5 disks in a raidz vdev, and one fails, there are 4 remaining. If resilver will last 8 days, then the probability of a 2nd disk failing is 4*8*P = 32P If you have 7 disks in a raidz vdev, and one fails, there are 6 remaining. If a resilver will last 12 days, then the probability of a 2nd disk failing is 6*12*P = 72P ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On the subject of where to install ZFS, I was planning to use either Compact Flash or USB drive (both of which would be mounted internally); using up 2 of the drive bays for a mirrored install is possibly a waste of physical space, considering it's a) a home media server and b) the config can be backed up to a protected ZFS pool - if the CF or USB drive failed I would just replace and restore the config. Can you have an equivalent of a global hot spare in ZFS. If I did go down the mirror route (mirror disk0 disk1 mirror disk2 disk3 mirror disk4 disk5 etc) all the way up to 14 disks that would leave the 15th disk spare. Now this is getting really complex, but can you have server failover in ZFS, much like DFS-R in Windows - you point clients to a clustered ZFS namespace so if a complete server failed nothing is interrupted. I am still undecided as to mirror vs RAID Z. I am going to be ripping uncompressed Blu-Rays so space is vital. I use RAID DP in NetApp kit at work and I'm guessing RAID Z2 is the equivalent? I have 5TB space at the moment so going to the expense of mirroring for only 2TB extra doesn't seem much of a pay off. Maybe a compromise of 2x 7-disk RAID Z1 with global hotspare is the way to go? Put it this way, I currently use Windows Home Server, which has no true disk failure protection, so any of ZFS's redundancy schemes is going to be a step up; is there an equivalent system in ZFS where if 1 disk fails you only lose that disks data, like unRAID? Thanks everyone for your input so far :) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Lanky Doodle On the subject of where to install ZFS, I was planning to use either Compact Flash or USB drive (both of which would be mounted internally); using up 2 of the drive bays for a mirrored install is possibly a waste of physical space, considering it's a) a home media server and b) the config can be backed up to a protected ZFS pool - if the CF or USB drive failed I would just replace and restore the config. All of the above is correct. One thing you should keep in mind however: If your unmirrored rpool (usb fob) fails... Although yes you can restore assuming you have been sufficiently backing it up ... You will suffer an ungraceful halt. Maybe you can live with that. Can you have an equivalent of a global hot spare in ZFS. If I did go down the mirror route (mirror disk0 disk1 mirror disk2 disk3 mirror disk4 disk5 etc) all the way up to 14 disks that would leave the 15th disk spare. Check the zpool man page for spare, but I know you can have spares assigned to a vdev, and I'm pretty sure you can assign any given spare to multiples, effectively making it a global hotspare. So yes is the answer. Now this is getting really complex, but can you have server failover in ZFS, much like DFS-R in Windows - you point clients to a clustered ZFS namespace so if a complete server failed nothing is interrupted. If that's somehow possible, it's something I don't know. I don't believe you can do that with ZFS. I am still undecided as to mirror vs RAID Z. I am going to be ripping uncompressed Blu-Rays so space is vital. For both read and write, raidz works extremely well for sequential operations. It sounds like you're probably going to be doing mostly sequential operations, so raidz should perform very well for you. A lot of people will avoid raidzN because it doesn't perform very well for random reads, so they opt for mirrors instead. But in your case, no so much. In your case, the only reason I can think to avoid raidz would be if you're worrying about resilver times. That's a valid concern, but you can linearly choose any number of disks you want ... You could make raidz using 3-disks each... It's just a compromise between the mirror and the larger raidz vdev. I use RAID DP in NetApp kit at work and I'm guessing RAID Z2 is the equivalent? Yup, raid-dp and raidz2 are conceptually pretty much the same. Put it this way, I currently use Windows Home Server, which has no true disk failure protection, so any of ZFS's redundancy schemes is going to be a step up; is there an equivalent system in ZFS where if 1 disk fails you only lose that disks data, like unRAID? No. Not unless you make that many separate volumes. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
Thanks for all the replies. The bit about combining zpools came from this command on the southbrain tutorial; zpool create mail \ mirror c6t600D0230006C1C4C0C50BE5BC9D49100d0 c6t600D0230006B66680C50AB7821F0E900d0 \ mirror c6t600D0230006B66680C50AB0187D75000d0 c6t600D0230006C1C4C0C50BE27386C4900d0 I admit I was getting confused between zpools and vdevs, thinking in the above command that each mirror was a zpool and not a vdev. Just so i'm correct, a normal command would like like zpool create mypool raidz disk1 disk2 disk3 disk4 disk5 which would result in a zpool called my pool, which is made up of a 5 disk raidz vdev? This means that zpools don't actually 'contain' physical devices, which is what I originally thought. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On 12/17/2010 2:12 AM, Lanky Doodle wrote: Thanks for all the replies. The bit about combining zpools came from this command on the southbrain tutorial; zpool create mail \ mirror c6t600D0230006C1C4C0C50BE5BC9D49100d0 c6t600D0230006B66680C50AB7821F0E900d0 \ mirror c6t600D0230006B66680C50AB0187D75000d0 c6t600D0230006C1C4C0C50BE27386C4900d0 I admit I was getting confused between zpools and vdevs, thinking in the above command that each mirror was a zpool and not a vdev. Just so i'm correct, a normal command would like like zpool create mypool raidz disk1 disk2 disk3 disk4 disk5 which would result in a zpool called my pool, which is made up of a 5 disk raidz vdev? This means that zpools don't actually 'contain' physical devices, which is what I originally thought. You are correct that the above will have a single vdev of 5 disks. Here's a shorthand note: A zpool is made of 1 or more vdevs. Each vdev can be a raidz, mirror, or single device (either a file or disk). So, you *can* have a zpool which has solely physical drives: e.g. zpool create tank disk1 disk2 disk3 will create a pool with 3 disks, with data being striped across the devices as desired. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
OK cool. One last question. Reading the Admin Guid for ZFS, it says: [i]A more complex conceptual RAID-Z configuration would look similar to the following: raidz c1t0d0 c2t0d0 c3t0d0 c4t0d0 c5t0d0 c6t0d0 c7t0d0 raidz c8t0d0 c9t0d0 c10t0d0 c11t0d0 c12t0d0 c13t0d0 c14t0d0 If you are creating a RAID-Z configuration with many disks, as in this example, a RAID-Z configuration with 14 disks is better split into a two 7-disk groupings. RAID-Z configurations with single-digit groupings of disks should perform better[/i] This is relevant as my final setup was planned to be 15 disks, so only one more than the example. So, do I drop one disk and go with 2 7 drive vdevs, or stick to 3 5 drive vdevs. Also, does anyone have anything to add re the security of CIFS when used with Windows clients? Thanks again guys, and gals... -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Lanky Doodle This is relevant as my final setup was planned to be 15 disks, so only one more than the example. So, do I drop one disk and go with 2 7 drive vdevs, or stick to 3 5 drive vdevs. Both ways are fine. Consider the balance between redundancy and drive space. Also, in the event of a resilver, the 3x5 radiz will be faster. In rough numbers, suppose you have 1TB drives, 70% full. Then your resilver might be 8 days instead of 12 days. That's important when you consider the fact that during that window, you have degraded redundancy. Another failed disk in the same vdev would destroy the entire pool. Also if a 2nd disk fails during resilver, it's more likely to be in the same vdev, if you have only 2 vdev's. Your odds are better with smaller vdev's, both because the resilver completes faster, and the probability of a 2nd failure in the same vdev is smaller. For both performance and reliability reasons, I recommend nothing except single-drive mirrors, except in extreme data-is-not-important situations. At least, that's my recommendation until someday, when the resilver efficiency is improved, or fixed. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
Thanks! By single drive mirrors, I assume, in a 14 disk setup, you mean 7 sets of 2 disk mirrors - I am thinking of traditional RAID1 here. Or do you mean 1 massive mirror with all 14 disks? This is always a tough one for me. I too prefer RAID1 where redundancy is king, but the trade off for me would be 5GB of 'wasted' space - total of 7GB in mirror and 12GB in 3x RAIDZ. Decisions, decisions. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
You should take a look at the ZFS best practices guide for RAIDZ and mirrored configuration recommendations: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide Its easy for me to say because I don't have to buy storage but mirrored storage pools are currently more flexible, provide good performance, and replacing/resilvering data on disks is faster. Thanks, Cindy On 12/17/10 09:48, Lanky Doodle wrote: Thanks! By single drive mirrors, I assume, in a 14 disk setup, you mean 7 sets of 2 disk mirrors - I am thinking of traditional RAID1 here. Or do you mean 1 massive mirror with all 14 disks? This is always a tough one for me. I too prefer RAID1 where redundancy is king, but the trade off for me would be 5GB of 'wasted' space - total of 7GB in mirror and 12GB in 3x RAIDZ. Decisions, decisions. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
at Dezember, 17 2010, 17:48 Lanky Doodle wrote in [1]: By single drive mirrors, I assume, in a 14 disk setup, you mean 7 sets of 2 disk mirrors - I am thinking of traditional RAID1 here. Or do you mean 1 massive mirror with all 14 disks? Edward means a set of two-way-mirrors. Do you remember what he wrote: Also, in the event of a resilver, the 3x5 radiz will be faster. In rough numbers, suppose you have 1TB drives, 70% full. Then your resilver might be 8 days instead of 12 days. That's important when you consider the fact that during that window, you have degraded redundancy. Another failed disk in the same vdev would destroy the entire pool. Also if a 2nd disk fails during resilver, it's more likely to be in the same vdev, if you have only 2 vdev's. Your odds are better with smaller vdev's, both because the resilver completes faster, and the probability of a 2nd failure in the same vdev is smaller. And this scene is a horrible notion. In that time resilvering is running you have to hope that nothing fails. In his example between 192 to 288 hours - thats a long a very long time. And be aware that a disk will broken at some point. This is always a tough one for me. I too prefer RAID1 where redundancy is king, but the trade off for me would be 5GB of 'wasted' space - total of 7GB in mirror and 12GB in 3x RAIDZ. You lost at most space when you make a pool with mirrors BUT the I/O is much faster and its more secure and you have all the features of zfs too. http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance Decisions, decisions. My suggestion is make a two-way-mirror of small disks or ssd for the OS. This is not easy to do after installation, you have to look for a howto. Sorry I dont find the link at the moment. At Sol11 Express Oracle announced that at TestInstall you can set RootPool to mirror during installation. At the moment I try it out in a VM but I didnt find this option. :-( zpool create lankyserver mirror vdev1 vdev2 mirror vdev3 vdev4 When you need more space you can expand a bundle of two disks to your lankyserver. Each pair with the same capacity is effective. zpool add lankyserver mirror vdev5 vdev6 mirror vdev7 vdev8 ... Consider that its a good decision when you plan one spare disk. You can using the zpool add command when you want to add a spare disk at a later time. http://docs.sun.com/app/docs/doc/819-2240/zpool-1m?a=view When you build a raidz pool every disk in this pool must have the same space as the smallest disk have or bigger. Raidz pool uses only this space that the smallest disk have. The rest of the bigger disk is waste. At a mirrored pool only the pair must have the same space so you can use a pair of 1 TB disks, one pair of 2 TB disks at the same pool. In this scene your spare disk _must have_ the biggest space. Read this for your decision: http://constantin.glez.de/blog/2010/01/home-server-raid-greed-and-why-mirroring-still-best -- Best Regards Alexander Dezember, 17 2010 [1] mid:382802084.111292604519623.javamail.tweb...@sf-app1 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On Fri, 17 Dec 2010, Edward Ned Harvey wrote: Also if a 2nd disk fails during resilver, it's more likely to be in the same vdev, if you have only 2 vdev's. Your odds are better with smaller vdev's, both because the resilver completes faster, and the probability of a 2nd failure in the same vdev is smaller. While I agree that smaller vdevs are more reliable, your statement about the failure being more likely be in the same vdev if you have only 2 vdev's to be a rather useless statement. The probability of vdev failure does not have anything to do with the number of vdevs. However, the probability of vdev failure increases tremendously if there is only one vdev and there is a second disk failure. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
Also, at present I have 5x 1TB drives to use in my home server so I plan to create a RAID-Z1 pool which will have my shares on it (Movies, Music, Pictures etc). I then plan to increase this in sets of 5 (so another 5x 1TB drives in Jan and nother 5 in Feb/March so that I can avoid all disks being from the same batch). I did plan on creating seperate zpoolz with each set of 5 drives; drives 1-5 volume0 zpool drives 6-10 volume1 zpool drives 11-15 volume2 zpool Although this seems a good idea to start with, there are issues with it performance-wise. If you fill up VDEV0 (drives 1-5) and then attach VDEV1 (drives 6-10), new writes will still be initially striped across the two VDEVs, leading to a performance impact on writes. There is currently no way of balancing VDEV fills without manualy backup/restore or in-vdev copying the data from one place to another and then removing the original data. so that I can sustain 3 simultaneous drives failures, as long as it's one drive from each set. However I think this will mean each zpool will have independant shares which I don't want. I have used this guide - http://southbrain.com/south/tutorials/zpools.html - which says you can combine zpools into a 'parent' zpool, but can this be done in my scenario (staggered) as it looks like the child zpools have to be created before the parent is done. So basically I'd need to be able to; For the scheme to work as above, start with something like # zpool create mypool raidz1 c0t1d0 c0t2d0 c0t3d0 c2t4d0 c2t5d0 Later, you'll add the new vdev # zpool add mypool raidz1 c0t6d0 c0t7d0 c0t8d0 c2t9d0 c2t10d0 This will work as described above. However, I would do this somehow differently. Start off with, say, 6 1TB drives in RAIDz2 and set autoexpand=on on the pool (remember compression=on on the zfs pool fs too). # zpool create mypool raidz2 c0t1d0 c0t2d0 c0t3d0 c2t4d0 c2t5d0 c2t6d0 # zpool set autoexpand on mypool # zfs set compression=on mypool Compression is lzjb, and it won't compress much for audio or video, but then, won't hurt much either. When this starts to get somewhat close to a fill, get new, larger drives and replace the one by one with the older 1TB drives. Once all are replaced by larger, say 1,5TB drives, whops, your array is larger. This will scale better performance-wise and you won't need that many controllers. Also, with RAIDz2, you can lose any two drives. Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On Thu, Dec 16, 2010 at 12:59 AM, Lanky Doodle lanky_doo...@hotmail.com wrote: I have been playing with ZFS for a few days now on a test PC, and I plan to use if for my home media server after being very impressed! Works great for that. Have a similar setup at home, using FreeBSD. Also, at present I have 5x 1TB drives to use in my home server so I plan to create a RAID-Z1 pool which will have my shares on it (Movies, Music, Pictures etc). I then plan to increase this in sets of 5 (so another 5x 1TB drives in Jan and nother 5 in Feb/March so that I can avoid all disks being from the same batch). I did plan on creating seperate zpoolz with each set of 5 drives; No no no. Create 1 pool. Create the pool initially with a single 5-drive raidz vdev. Later, add the next five drives to the system, and create a new raidz vdev *in the same pool*. Voila. You now have the equivalent of a RAID50, as ZFS will stripe writes to both vdevs, increaseing the overall size *and* speed of the pool. Later, add the next five drives to the system, and create a new raidz vdev in the same pool. Voila. You now have a pool with 3 vdevs, with read/writes being striped across all three. You can still lose 3 drives (1 per vdev) before losing the pool. The commands to do this are along the lines of: # zpool create mypool raidz disk1 disk2 disk3 disk4 disk5 # zpool add mypool raidz disk6 disk7 disk8 disk9 disk10 # zpool add mypool raidz disk11 disk12 disk13 disk14 disk15 Creating 1 pool gives you the best performance and the most flexibility. Use separate filesystems on top of that pool if you want to tweak all the different properties. Going with 1 pool also increases your chances for dedupe, as dedupe is done at the pool level. -- Freddie Cash fjwc...@gmail.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
Hi Lanky, Other follow-up posters have given you good advice. I don't see where you are getting the idea that you can combine pools with pools. You can't do this and I don't see that the southbrain tutorial illustrates this either. All of his examples for creating redundant pools are reasonable. As others have said, you can create a RAIDZ pool with one vdev of say 5 disks, and then later add another 5 disks, and so on. Thanks, Cindy On 12/16/10 01:59, Lanky Doodle wrote: Hiya, I have been playing with ZFS for a few days now on a test PC, and I plan to use if for my home media server after being very impressed! I've got the basics of creating zpools and zfs filesystems with compression and dedup etc, but I'm wondering if there's a better way to handle security. I'm using Windows 7 clients by the way. I have used this 'guide' to do the permissions - http://www.slepicka.net/?p=37 Also, at present I have 5x 1TB drives to use in my home server so I plan to create a RAID-Z1 pool which will have my shares on it (Movies, Music, Pictures etc). I then plan to increase this in sets of 5 (so another 5x 1TB drives in Jan and nother 5 in Feb/March so that I can avoid all disks being from the same batch). I did plan on creating seperate zpoolz with each set of 5 drives; drives 1-5 volume0 zpool drives 6-10 volume1 zpool drives 11-15 volume2 zpool so that I can sustain 3 simultaneous drives failures, as long as it's one drive from each set. However I think this will mean each zpool will have independant shares which I don't want. I have used this guide - http://southbrain.com/south/tutorials/zpools.html - which says you can combine zpools into a 'parent' zpool, but can this be done in my scenario (staggered) as it looks like the child zpools have to be created before the parent is done. So basically I'd need to be able to; Create volume0 zpool now Create volume1 zpool in Jan, then combine volume0 and volume1 into a parent zpool Create volume2 in Feb/March and add to parent zpool I know I could just add each disk to volume0 zpool but I've read it's a bugger to do and that creating seperate zpools with news disks is a much better way to go. I think that's it for now. Sorry for the mammoth first post! Thanks ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
Thanks for the reply. In that case, wouldn't it be better to, as you say, start with a 6 drive Z2, then just keep adding drives until the case is full, for a single Z2 zpool? Or even Z3, if that's available now? I have an 11x 5.1/4 bay case, with 3x 5-in-3 hot swap caddies giving me 15 drive bays. Hence the plan to start with 5, then 10, then all the way to 15. This seems a more logical (and cheaper) solution than keep replacing with bigger drives as they come to market. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Lanky Doodle In that case, wouldn't it be better to, as you say, start with a 6 drive Z2, then just keep adding drives until the case is full, for a single Z2 zpool? Doesn't work that way. You can create a vdev, and later, you can add more vdev's. So you can create a raidz now, and later you can add another raidz. But you cannot create a raidz now, and later just add onesy-twosy disks to increase the size incrementally. Or even Z3, if that's available now? Raidz3 is available now. There is only one thing to be aware of. ZFS resilvering is very inefficient for typical usage scenarios. The time to resilver divides by the number of vdev's in the pool (meaning 10 mirrors will resilver 10x faster than an equivalently sized raidzN) and the time to resilver is doubled when you have several disks within the vdev. Due to inefficiency, we're talking about 12 hours (on my server) to resilver a 1TB disk which is around 70% used. This would have been ~3 weeks if I had one big raidz3. So it matters. Your multiple raidz vdev's of each 5-6 disks is a reasonable compromise. I have an 11x 5.1/4 bay case, with 3x 5-in-3 hot swap caddies giving me 15 drive bays. Hence the plan to start with 5, then 10, then all the way to 15. This seems a more logical (and cheaper) solution than keep replacing with bigger drives as they come to market. 'Course, you can also replace bigger drives as they come to market, too. ;-) If you've got 5 disks in a raidz... First scrub it. Then, replace one disk with a larger disk, and wait for resilver. Replace each disk, one by one, with larger disks. And eventually when you do the last one ... Your pool becomes larger. (Depending on your defaults, manual intervention may be required to make the pool autoexpand when the devices have all been upgraded.) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss