Re: [zfs-discuss] A few questions

2011-01-09 Thread Pasi Kärkkäinen
On Sat, Jan 08, 2011 at 12:33:50PM -0500, Edward Ned Harvey wrote:
  From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
  boun...@opensolaris.org] On Behalf Of Garrett D'Amore
  
  When you purchase NexentaStor from a top-tier Nexenta Hardware Partner,
  you get a product that has been through a rigorous qualification process
 
 How do I do this, exactly?  I am serious.  Before too long, I'm going to
 need another server, and I would very seriously consider reprovisioning my
 unstable Dell Solaris server to become a linux or some other stable machine.
 The role it's currently fulfilling is the backup server, which basically
 does nothing except zfs receive from the primary Sun solaris 10u9 file
 server.  Since the role is just for backups, it's a perfect opportunity for
 experimentation, hence the Dell hardware with solaris.  I'd be happy to put
 some other configuration in there experimentally instead ... say ...
 nexenta.  Assuming it will be just as good at zfs receive from the primary
 server.
 
 Is there some specific hardware configuration you guys sell?  Or recommend?
 How about a Dell R510/R610/R710?  Buy the hardware separately and buy
 NexentaStor as just a software product?  Or buy a somehow more certified
 hardware  software bundle together?
 
 If I do encounter a bug, where the only known fact is that the system keeps
 crashing intermittently on an approximately weekly basis, and there is
 absolutely no clue what's wrong in hardware or software...  How do you guys
 handle it?
 
 If you'd like to follow up offlist, that's fine.  Then just email me at the
 email address:  nexenta at nedharvey.com
 (I use disposable email addresses on mailing lists like this, so at any
 random unknown time, I'll destroy my present alias and start using a new
 one.)
 

Hey,

Other OS's have had problems with the Broadcom NICs aswell..

See for example this RHEL5 bug: 
https://bugzilla.redhat.com/show_bug.cgi?id=520888
Host crashing probably due to MSI-X IRQs with bnx2 NIC..

And VMware vSphere ESX/ESXi 4.1 crashing with bnx2x: 
http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1029368

So I guess there are firmware/driver problems affecting not just Solaris
but also other operating systems..

-- Pasi

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-09 Thread Edward Ned Harvey
 From: Pasi Kärkkäinen [mailto:pa...@iki.fi]
 
 Other OS's have had problems with the Broadcom NICs aswell..

Yes.  The difference is, when I go to support.dell.com and punch in my
service tag, I can download updated firmware and drivers for RHEL that (at
least supposedly) solve the problem.  I haven't tested it, but the dell
support guy told me it has worked for RHEL users.  There is nothing
available to download for solaris.

Also, the bcom is not the only problem on that server.  After I added-on an
intel network card and disabled the bcom, the weekly crashes stopped, but
now it's ...  I don't know ... once every 3 weeks with a slightly different
mode of failure.  This is yet again, rare enough that the system could very
well pass a certification test, but not rare enough for me to feel
comfortable putting into production as a primary mission critical server.

I really think there are only two ways in the world to engineer a good solid
server:
(a) Smoke your own crack.  Systems engineering teams use the same systems
that are sold to customers.
or
(b) Sell millions of 'em.  So despite whether or not the engineering team
uses them, you're still going to have sufficient mass to dedicate engineers
to the purpose of post-sales bug solving.

I suppose a third way, which has certainly happened in history but not very
applicable to me...  Is to simply charge such ridiculously high prices for
your servers that you can dedicate engineers to post-sales bug solving, even
if you only sold a handful of those systems in the whole world.  Things like
munitions-strength cray and alphaservers etc in the past have sometimes fit
into this category.

I do feel confident assuming that solaris kernel engineers use sun servers
primarily for their server infrastructure.  So I feel safe buying this
configuration.  The only thing there is to gain by buying something else is
lower prices... or maybe some obscure fringe detail that I can't think of.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-09 Thread Richard Elling
On Jan 9, 2011, at 4:19 PM, Edward Ned Harvey 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:

 From: Pasi Kärkkäinen [mailto:pa...@iki.fi]
 
 Other OS's have had problems with the Broadcom NICs aswell..
 
 Yes.  The difference is, when I go to support.dell.com and punch in my
 service tag, I can download updated firmware and drivers for RHEL that (at
 least supposedly) solve the problem.  I haven't tested it, but the dell
 support guy told me it has worked for RHEL users.  There is nothing
 available to download for solaris.

The drivers are written by Broadcom and are, AFAIK, closed source.
By going through Dell, you are going through a middle-man. For example,

http://www.broadcom.com/support/ethernet_nic/netxtremeii10.php

where you see the release of the Solaris drivers was at the same time
as Windows.

 
 Also, the bcom is not the only problem on that server.  After I added-on an
 intel network card and disabled the bcom, the weekly crashes stopped, but
 now it's ...  I don't know ... once every 3 weeks with a slightly different
 mode of failure.  This is yet again, rare enough that the system could very
 well pass a certification test, but not rare enough for me to feel
 comfortable putting into production as a primary mission critical server.
 
 I really think there are only two ways in the world to engineer a good solid
 server:
 (a) Smoke your own crack.  Systems engineering teams use the same systems
 that are sold to customers.

This is rarely practical, not to mention that product development
is often not in the systems engineering organization.

 or
 (b) Sell millions of 'em.  So despite whether or not the engineering team
 uses them, you're still going to have sufficient mass to dedicate engineers
 to the purpose of post-sales bug solving.

yes, indeed :-)
 -- richard
 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-09 Thread Michael Sullivan
Just to add a bit to this, I just love sweeping generalizations...

On 9 Jan 2011, at 19:33 , Richard Elling wrote:

 On Jan 9, 2011, at 4:19 PM, Edward Ned Harvey 
 opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
 
 From: Pasi Kärkkäinen [mailto:pa...@iki.fi]
 
 Other OS's have had problems with the Broadcom NICs aswell..
 
 Yes.  The difference is, when I go to support.dell.com and punch in my
 service tag, I can download updated firmware and drivers for RHEL that (at
 least supposedly) solve the problem.  I haven't tested it, but the dell
 support guy told me it has worked for RHEL users.  There is nothing
 available to download for solaris.
 
 The drivers are written by Broadcom and are, AFAIK, closed source.
 By going through Dell, you are going through a middle-man. For example,
 
 http://www.broadcom.com/support/ethernet_nic/netxtremeii10.php
 
 where you see the release of the Solaris drivers was at the same time
 as Windows.
 

What Richard says is true.

Broadcom have been a source of contention in the Linux world as well as the 
*BSD world due to the proprietary nature of their firmware.  
OpenSolaris/Solaris users are not the only ones who have complained about this. 
 There's been much uproar in the FOSS community about Broadcom and their 
drivers.  As a result, I've seen some pretty nasty hacks like people using the 
Windows drivers linked into their kernel - *gack*  I forget all the gory 
details, but it was rather disgusting as I recall, bubblegum, bailing wire, 
duct tape and all.

Dell and Red Hat aren't exactly a marriage made in heaven either.  I've had 
problems getting support from both Dell and Red Hat, them pointing fingers at 
each other rather than solving the problem.  Like most people, I've had to come 
up with my own work-arounds, like others with the Broadcom issue, using a 
known quantity NIC.

When dealing with Dell as a corporate buyer, they have always made it quite 
clear that they are primarily a Windows platform.  Linux, oh yes, we have that 
too...

 Also, the bcom is not the only problem on that server.  After I added-on an
 intel network card and disabled the bcom, the weekly crashes stopped, but
 now it's ...  I don't know ... once every 3 weeks with a slightly different
 mode of failure.  This is yet again, rare enough that the system could very
 well pass a certification test, but not rare enough for me to feel
 comfortable putting into production as a primary mission critical server.

I've never been particularly warm and fuzzy with Dell servers.  They seem to 
like to change their chipsets slightly while a model is in production.  This 
can cause all sorts of problems which are difficult to diagnose since an 
identical Dell system will have no problems, and it's mate will crash weekly.

 
 I really think there are only two ways in the world to engineer a good solid
 server:
 (a) Smoke your own crack.  Systems engineering teams use the same systems
 that are sold to customers.
 
 This is rarely practical, not to mention that product development
 is often not in the systems engineering organization.
 
 or
 (b) Sell millions of 'em.  So despite whether or not the engineering team
 uses them, you're still going to have sufficient mass to dedicate engineers
 to the purpose of post-sales bug solving.
 
 yes, indeed :-)
 -- richard

As for certified systems, It's my understanding that Nexenta themselves don't 
certify anything.  They have systems which are recommended and supported by 
their network of VAR's.  It just so happens that SuperMicro is one of the 
brands of choice, but even then one must adhere to a fairly tight HCL.  The 
same holds true for Solaris/OpenSolaris with third-party hardware.

SATA Controllers and multiplexers are also another example of the drivers being 
written by the manufacturer and Solaris/OpenSolaris are not a priority over 
Windows and Linux, in that order.

Deviation from items which are not somewhat plain vanilla and are not listed 
on the HCL is just asking for trouble.

Mike

---
Michael Sullivan   
michael.p.sulli...@me.com
http://www.kamiogi.net/
Mobile: +1-662-202-7716
US Phone: +1-561-283-2034
JP Phone: +81-50-5806-6242



smime.p7s
Description: S/MIME cryptographic signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-09 Thread Brad Stone
 As for certified systems, It's my understanding that Nexenta themselves don't 
 certify anything.  They have systems which are recommended and supported by 
 their network of VAR's.

The certified solutions listed on Nexenta's website were certified by Nexenta.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-08 Thread Garrett D'Amore

On 01/ 6/11 05:28 AM, Edward Ned Harvey wrote:

From: Khushil Dep [mailto:khushil@gmail.com]

I've deployed large SAN's on both SuperMicro 825/826/846 and Dell
R610/R710's and I've not found any issues so far. I always make a point of
installing Intel chipset NIC's on the DELL's and disabling the Broadcom ones
but other than that it's always been plain sailing - hardware-wise anyway.
 

not found any issues, except the broadcom one which causes the system to crash 
regularly in the default factory configuration.

How did you learn about the broadcom issue for the first time?  I had to learn 
the hard way, and with all the involvement of both Dell and Oracle support 
teams, nobody could tell me what I needed to change.  We literally replaced 
every component of the server twice over a period of 1 year, and I spent 
mandays upgrading and downgrading firmwares randomly trying to find a stable 
configuration.  I scoured the internet to find this little tidbit about 
replacing the broadcom NIC, and randomly guessed, and replaced my nic with an 
intel card to make the problem go away.

The same system doesn't have a problem running RHEL/centos.

What will be the new problem in the next line of servers?  Why, during my 
internet scouring, did I find a lot of other reports, of people who needed to 
disable c-states (didn't work for me) and lots of false leads indicating 
firmware downgrade would fix my broadcom issue?

See my point?  Next time I buy a server, I do not have confidence to simply 
expect solaris on dell to work reliably.  The same goes for solaris 
derivatives, and all non-sun hardware.  There simply is not an adequate 
qualification and/or support process.
   


When you purchase NexentaStor from a top-tier Nexenta Hardware Partner, 
you get a product that has been through a rigorous qualification process 
which includes the hardware and software configuration matched together, 
tested with an extensive battery.  You also can get a higher level of 
support than is offered to people who build their own systems.


Oracle is *not* the only company capable of performing in depth testing 
of Solaris.


I can also know enough about problems that Oracle customers (or rather 
Sun customers) faced with Solaris on Sun hardware -- such as the 
terrible nvidia ethernet problems on first generation U20 and U40 
problems, or the marvell SATA problems on Thumper -- that I know that 
your picture of Oracle isn't nearly as rosy as you believe.  Of course, 
I also lived (as a Sun employee) through the UltraSPARC-II ECC fiasco...


  - Garrett

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-08 Thread Fajar A. Nugraha
On Thu, Jan 6, 2011 at 11:36 PM, Garrett D'Amore garr...@nexenta.com wrote:
 On 01/ 6/11 05:28 AM, Edward Ned Harvey wrote:
 See my point?  Next time I buy a server, I do not have confidence to
 simply expect solaris on dell to work reliably.  The same goes for solaris
 derivatives, and all non-sun hardware.  There simply is not an adequate
 qualification and/or support process.


 When you purchase NexentaStor from a top-tier Nexenta Hardware Partner, you

Where is the list? Is this the one on
http://www.nexenta.com/corp/technology-partners-overview/certified-technology-partners
?

 get a product that has been through a rigorous qualification process which
 includes the hardware and software configuration matched together, tested
 with an extensive battery.  You also can get a higher level of support than
 is offered to people who build their own systems.

 Oracle is *not* the only company capable of performing in depth testing of
 Solaris.

Does this roughly mean I can expect similar (or even better) hardware
compatibility support and with nexentastor on supermicro as solaris on
oracle/sun hardware?

-- 
Fajar
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-08 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Garrett D'Amore
 
 When you purchase NexentaStor from a top-tier Nexenta Hardware Partner,
 you get a product that has been through a rigorous qualification process

How do I do this, exactly?  I am serious.  Before too long, I'm going to
need another server, and I would very seriously consider reprovisioning my
unstable Dell Solaris server to become a linux or some other stable machine.
The role it's currently fulfilling is the backup server, which basically
does nothing except zfs receive from the primary Sun solaris 10u9 file
server.  Since the role is just for backups, it's a perfect opportunity for
experimentation, hence the Dell hardware with solaris.  I'd be happy to put
some other configuration in there experimentally instead ... say ...
nexenta.  Assuming it will be just as good at zfs receive from the primary
server.

Is there some specific hardware configuration you guys sell?  Or recommend?
How about a Dell R510/R610/R710?  Buy the hardware separately and buy
NexentaStor as just a software product?  Or buy a somehow more certified
hardware  software bundle together?

If I do encounter a bug, where the only known fact is that the system keeps
crashing intermittently on an approximately weekly basis, and there is
absolutely no clue what's wrong in hardware or software...  How do you guys
handle it?

If you'd like to follow up offlist, that's fine.  Then just email me at the
email address:  nexenta at nedharvey.com
(I use disposable email addresses on mailing lists like this, so at any
random unknown time, I'll destroy my present alias and start using a new
one.)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-08 Thread Stephan Budach

Am 08.01.11 18:33, schrieb Edward Ned Harvey:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Garrett D'Amore

When you purchase NexentaStor from a top-tier Nexenta Hardware Partner,
you get a product that has been through a rigorous qualification process

How do I do this, exactly?  I am serious.  Before too long, I'm going to
need another server, and I would very seriously consider reprovisioning my
unstable Dell Solaris server to become a linux or some other stable machine.
The role it's currently fulfilling is the backup server, which basically
does nothing except zfs receive from the primary Sun solaris 10u9 file
server.  Since the role is just for backups, it's a perfect opportunity for
experimentation, hence the Dell hardware with solaris.  I'd be happy to put
some other configuration in there experimentally instead ... say ...
nexenta.  Assuming it will be just as good at zfs receive from the primary
server.

Is there some specific hardware configuration you guys sell?  Or recommend?
How about a Dell R510/R610/R710?  Buy the hardware separately and buy
NexentaStor as just a software product?  Or buy a somehow more certified
hardware  software bundle together?

If I do encounter a bug, where the only known fact is that the system keeps
crashing intermittently on an approximately weekly basis, and there is
absolutely no clue what's wrong in hardware or software...  How do you guys
handle it?

If you'd like to follow up offlist, that's fine.  Then just email me at the
email address:  nexenta at nedharvey.com
(I use disposable email addresses on mailing lists like this, so at any
random unknown time, I'll destroy my present alias and start using a new
one.)

___
Hmm… that'd interest me as well - I do have 4 Dell PE R610, that are 
running OSol or Sol11Expr. I actually bought a Sun Fire X4170 M2, since 
I couldn't get my R610 stable, just as Edward points out.


So, if you guys think that NexentaStor avoids these issues, then I'd 
seriously consider to jumpship - so either please don't continue 
offlist, or please include me in that conversation. ;)


Cheers,
budy

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-08 Thread Garrett D'Amore

On 01/ 8/11 10:43 AM, Stephan Budach wrote:

Am 08.01.11 18:33, schrieb Edward Ned Harvey:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Garrett D'Amore

When you purchase NexentaStor from a top-tier Nexenta Hardware Partner,
you get a product that has been through a rigorous qualification 
process

How do I do this, exactly?  I am serious.  Before too long, I'm going to
need another server, and I would very seriously consider 
reprovisioning my
unstable Dell Solaris server to become a linux or some other stable 
machine.
The role it's currently fulfilling is the backup server, which 
basically

does nothing except zfs receive from the primary Sun solaris 10u9 file
server.  Since the role is just for backups, it's a perfect 
opportunity for
experimentation, hence the Dell hardware with solaris.  I'd be happy 
to put

some other configuration in there experimentally instead ... say ...
nexenta.  Assuming it will be just as good at zfs receive from the 
primary

server.

Is there some specific hardware configuration you guys sell?  Or 
recommend?

How about a Dell R510/R610/R710?  Buy the hardware separately and buy
NexentaStor as just a software product?  Or buy a somehow more certified
hardware  software bundle together?

If I do encounter a bug, where the only known fact is that the system 
keeps

crashing intermittently on an approximately weekly basis, and there is
absolutely no clue what's wrong in hardware or software...  How do 
you guys

handle it?


Such problems are handled on a case by case basis.  Usually we can do 
some analysis from a crash dump, but not always.   My team includes 
several people who are experienced with such analysis, and when problems 
like this occur, we are called into action.


Ultimately this usually results in a patch, sometimes workaround 
suggestions, and sometimes even binary relief (which happens faster than 
a regular patch, but without the deeper QA.)


  - Garrett
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-06 Thread Darren J Moffat

On 06/01/2011 00:14, Edward Ned Harvey wrote:

solaris engineers don't use?  Non-sun hardware.  Pretty safe bet you won't
find any Dell servers in the server room where solaris developers do their
thing.


You would lose that bet, not only would you find Dell you would many 
other big names as well as white box hand build systems too.


Solaris developers use a lot of different hardware - Sun never made 
laptops so many of us have Apple (running Solaris on the metal and/or 
under virtualisation) or Toshiba or Fujitsu etc laptops.  There are also 
many workstations around the company that aren't Sun hardware as well as 
servers.


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-06 Thread Khushil Dep
I've deployed large SAN's on both SuperMicro 825/826/846 and Dell
R610/R710's and I've not found any issues so far. I always make a point of
installing Intel chipset NIC's on the DELL's and disabling the Broadcom ones
but other than that it's always been plain sailing - hardware-wise anyway.

I've always found that the real issue is formulating SOP's to match what the
organisation is used to with legacy storage systems, educating the admins
who will manage it going forward and doing the technical hand-over to folks
who may not know or want to know a whole lot of *nix land.

My 2p. YMMV.

---
W. A. Khushil Dep - khushil@gmail.com -  07905374843
Windows - Linux - Solaris - ZFS - Nexenta - Development - Consulting 
Contracting
http://www.khushil.com/ - http://www.facebook.com/GlobalOverlord





On 6 January 2011 00:14, Edward Ned Harvey 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:

  From: Richard Elling [mailto:richard.ell...@nexenta.com]
 
   I'll agree to call Nexenta a major commerical interest, in regards to
  contribution to the open source ZFS tree, if they become an officially
  supported OS on Dell, HP, and/or IBM hardware.
 
  NexentaStor is officially supported on Dell, HP, and IBM hardware.  The
 only
  question is, what is your definition of 'support'?  Many NexentaStor

 I don't want to argue about this, but I'll just try to clarify what I
 meant:

 Presently, I have a dell server with officially supported solaris, and it's
 as unreliable as pure junk.  It's just the backup server, so I'm free to
 frequently create  destroy it... And as such, I frequently do recreate and
 destroy it.  It is entirely stable running RHEL (centos) because Dell and
 RedHat have a partnership with a serious number of human beings and
 machines
 looking for and fixing any compatibility issues.  For my solaris
 instability, I blame the fact that solaris developers don't do significant
 quality assurance on non-sun hardware.  To become officially compatible,
 the whole qualification process is like this:  Somebody installs it,
 doesn't
 see any problems, and then calls it certified.  They reformat with
 something else, and move on.  They don't build their business on that
 platform, so they don't detect stability issues like the ones reported...
 System crashes once per week and so forth.  Solaris therefore passes the
 test, and becomes one of the options available on the drop-down menu for
 OSes with a new server.  (Of course that's been discontinued by oracle, but
 that's how it was in the past.)

 Developers need to eat their own food.  Smoke your own crack.  Hardware
 engineers at Dell need to actually use your OS on their hardware, for their
 development efforts.  I would be willing to bet Sun hardware engineers use
 a
 significant percentage of solaris servers for their work...  And guess what
 solaris engineers don't use?  Non-sun hardware.  Pretty safe bet you won't
 find any Dell servers in the server room where solaris developers do their
 thing.

 If you want to be taken seriously as an alternative storage option, you've
 got to at LEAST be listed as a factory-distributed OS that is an option to
 ship with the new server, and THEN, when people such as myself buy those
 things, we've got to have a good enough experience that we don't all bitch
 and flame about it afterward.

 Nexenta, you need a real and serious partnership with Dell, HP, IBM.  Get
 their developers to run YOUR OS on the servers which they use for
 development.  Get them to sell your product bundled with their product.
  And
 dedicate real and serious engineering into bugfixes working with customers,
 to truly identify root causes of instability, with a real OS development
 and
 engineering and support group.  It's got to be STABLE, that's the #1
 requirement.

 I previously made the comparison...  Even close-source solaris  ZFS is a
 better alternative to close-source netapp  wafl.  So for now, those are
 the
 only two enterprise supportable options I'm willing to stake my career on,
 and I'll buy Sun hardware with Solaris.  But I really wish I could feel
 confident buying a cheaper Dell server and running ZFS on it.  Nexenta, if
 you make yourself look like a serious competitor against solaris, and
 really
 truly form an awesome stable partnership with Dell, I will happily buy your
 stuff instead of Oracle.  Even if you are a little behind in feature
 offering.  But I will not buy your stuff if I can't feel perfectly
 confident
 in its stability.

 Ever heard the phrase Nobody ever got fired for buying IBM.  You're the
 little guys.  If you want to compete against the big guys, you've got to
 kick ass.  And don't get sued into oblivion.

 Even today's feature set is perfectly adequate for at least a couple of
 years to come.  If you put all your effort into stability and bugfixes,
 serious partnerships with Dell, HP, IBM, and become extremely professional
 looking and stable, with fanatical support...  You don't have to 

Re: [zfs-discuss] A few questions

2011-01-06 Thread Edward Ned Harvey
 From: Richard Elling [mailto:richard.ell...@nexenta.com]
 
 If I understand correctly, you want Dell, HP, and IBM to run OSes other
 
 I agree, but neither Dell, HP, nor IBM develop Windows...
 
 I'm not sure of the current state, but many of the Solaris engineers
develop
 on laptops and Sun did not offer a laptop product line.
 
 You will find them where Nexenta developers live :-)
 
 Wait a minute... this is patently false.  The big storage vendors: NetApp,
 EMC, Hitachi, Fujitsu, LSI... none run on HP, IBM, or Dell servers.

Like I said, not interested in arguing.  This is mostly just a bunch of
contradictions to what I said.

To each his own.  My conclusion is that I am not willing to stake my career
on the underdog alternative, when I know I can safely buy the sun hardware
and solaris.  I experimented once by buying solaris on dell.  It was a
proven failure, but that's why I did it on a cheap noncritical backup system
experimentally before expecting it to work in production.  Haven't seen any
underdog proven solid enough for me to deploy in enterprise yet.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-06 Thread J.P. King


This is a silly argument, but...

Haven't seen any underdog proven solid enough for me to deploy in 
enterprise yet.


I haven't seen any overdog proven solid enough for me to be able to rely 
on either.  Certainly not Solaris.  Don't get me wrong, I like(d) Solaris.
But every so often you'd find a bug and they'd take an age to fix it (or 
to declare that they wouldn't fix it).  In one case we had 18 months 
between reporting a problem and Sun fixing it.  In another case it was 
around 3 months and because we happened to have the source code we even 
told them where the bug was and what a fix could be.


Solaris (and the other overdogs) are worth it when you want someone else 
to do the grunt work and someone else to point at and blame, but lets not 
romanticize how good it or any of the others are.  What made Solaris (10 
at least) worth deploying were its features (dtrace, zfs, SMF, etc).


Julian
--
Julian King
Computer Officer, University of Cambridge, Unix Support
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-06 Thread Edward Ned Harvey
 From: Bob Friesenhahn [mailto:bfrie...@simple.dallas.tx.us]
 
 On Wed, 5 Jan 2011, Edward Ned Harvey wrote:
  with regards to ZFS and all the other projects relevant to solaris.)
 
  I know in the case of SGE/OGE, it's officially closed source now.  As of
Dec
  31st, sunsource is being decomissioned, and the announcement of
officially
  closing the SGE source and decomissioning the open source community
 went out
  on Dec 24th.  So all of this leads me to believe, with very little
  reservation, that the new developments beyond zpool 28 are closed
 source
  moving forward.  There's very little breathing room remaining for hope
of
  that being open sourced again.
 
 I have no idea what you are talking about.  Best I can tell, SGE/OGE
 is a reference to Sun Grid Engine, which has nothing to do with zfs.
 The only annoucement and discussion I can find via Google is written
 by you.  It was pretty clear even a year ago that Sun Grid Engine was
 going away.

Agreed, SGE/OGE has nothing to do with ZFS, unless you believe there's an
oracle culture which might apply to both.

The only thing written by me, as I recall, included links to the original
official announcements.  Following those links now, I see the archives have
been decomissioned.  So there ya go.  Since it's still in my inbox, I just
saved a copy for you here...  It is long winded, and the main points are:
SGE (now called OGE) is officially closed-source, and sunsouce.net
decommissioned.  There is an open source fork, which will not share code
development with the closed-source product.
http://dl.dropbox.com/u/543241/SGE_officially_closed/GE%20users%20GE%20annou
nce%20Changes%20for%20a%20Bright%20Future%20at%20Oracle.txt


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-06 Thread Edward Ned Harvey
 From: Khushil Dep [mailto:khushil@gmail.com]
 
 I've deployed large SAN's on both SuperMicro 825/826/846 and Dell
 R610/R710's and I've not found any issues so far. I always make a point of
 installing Intel chipset NIC's on the DELL's and disabling the Broadcom ones
 but other than that it's always been plain sailing - hardware-wise anyway.

not found any issues, except the broadcom one which causes the system to 
crash regularly in the default factory configuration.

How did you learn about the broadcom issue for the first time?  I had to learn 
the hard way, and with all the involvement of both Dell and Oracle support 
teams, nobody could tell me what I needed to change.  We literally replaced 
every component of the server twice over a period of 1 year, and I spent 
mandays upgrading and downgrading firmwares randomly trying to find a stable 
configuration.  I scoured the internet to find this little tidbit about 
replacing the broadcom NIC, and randomly guessed, and replaced my nic with an 
intel card to make the problem go away.

The same system doesn't have a problem running RHEL/centos.

What will be the new problem in the next line of servers?  Why, during my 
internet scouring, did I find a lot of other reports, of people who needed to 
disable c-states (didn't work for me) and lots of false leads indicating 
firmware downgrade would fix my broadcom issue?

See my point?  Next time I buy a server, I do not have confidence to simply 
expect solaris on dell to work reliably.  The same goes for solaris 
derivatives, and all non-sun hardware.  There simply is not an adequate 
qualification and/or support process.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-06 Thread Khushil Dep
Two fold really - firstly I remember the headaches I used to have
configuring Broadcom cards properly under Debain/Ubuntu but the sweetness
that was using an Intel NIC. Bottom line for me was that I know Intel
drivers have been around longer than Broadcom drivers and thus it would make
sense to ensure that we hand intel NIC's on the server. Secondly, I asked
Andy Bennett from Nexenta who told me it would make sense - always good to
get a second opinion :-)

There were/are reports all over Google about Broadcom issues with
Solaris/OpenSolaris so I didn't want to risk it. For a couple of hundred for
a quad port gig NIC - it's worth it when the entire solution is 90K+.

Sometimes (like the issue with bus-resets when some brands/firmware-rev's of
SSD's are used) the knowledge comes from people you work with (Nexenta rode
to the rescue here again - plug! plug! plug!) :-)

These are deployed in a couple of University and a very large data
capture/marketing company I used to work for and I know it works really well
and (plug! plug! plug) I know the dedicated support I got from the Nexenta
guys.

The difference as I see it is that OpenSolaris/ZFS/Dtrace/FMA allow you to
build your own solution to your own problem. Thinking of storage in a
completely new way instead of just a block of storage it becomes an
integrated part of performance engineering - certainly has been for the last
two installs I've been involved in.

I know why folks want a Certified solution with the likes of Dell/HP etc
but from my point of view (and all points of view are valid here), I know I
can deliver a cheaper, more focussed (and when I say that I'm not just doing
some marketing bs) solution for the requirement at hand. It's sometimes a
struggle to get customers/end-users to think of storage as more than just
storage. There's quite a lot of entrenched thinking to get around/over in
our field (try getting a Java dev to think clearly about thread handling and
massive SMP drawbacks for example).

Anyway - not trying to engage in an argument but it's always interesting to
find out why someone went for certain solutions over others.

My 2p. YMMV.

*goes off to collect cheque from Nexenta* ;-)

---
W. A. Khushil Dep - khushil@gmail.com -  07905374843
Windows - Linux - Solaris - ZFS - Nexenta - Development - Consulting 
Contracting
http://www.khushil.com/ - http://www.facebook.com/GlobalOverlord





On 6 January 2011 13:28, Edward Ned Harvey 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:

  From: Khushil Dep [mailto:khushil@gmail.com]
 
  I've deployed large SAN's on both SuperMicro 825/826/846 and Dell
  R610/R710's and I've not found any issues so far. I always make a point
 of
  installing Intel chipset NIC's on the DELL's and disabling the Broadcom
 ones
  but other than that it's always been plain sailing - hardware-wise
 anyway.

 not found any issues, except the broadcom one which causes the system to
 crash regularly in the default factory configuration.

 How did you learn about the broadcom issue for the first time?  I had to
 learn the hard way, and with all the involvement of both Dell and Oracle
 support teams, nobody could tell me what I needed to change.  We literally
 replaced every component of the server twice over a period of 1 year, and I
 spent mandays upgrading and downgrading firmwares randomly trying to find a
 stable configuration.  I scoured the internet to find this little tidbit
 about replacing the broadcom NIC, and randomly guessed, and replaced my nic
 with an intel card to make the problem go away.

 The same system doesn't have a problem running RHEL/centos.

 What will be the new problem in the next line of servers?  Why, during my
 internet scouring, did I find a lot of other reports, of people who needed
 to disable c-states (didn't work for me) and lots of false leads indicating
 firmware downgrade would fix my broadcom issue?

 See my point?  Next time I buy a server, I do not have confidence to simply
 expect solaris on dell to work reliably.  The same goes for solaris
 derivatives, and all non-sun hardware.  There simply is not an adequate
 qualification and/or support process.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-06 Thread Richard Elling
On Jan 5, 2011, at 7:44 AM, Edward Ned Harvey wrote:

 From: Khushil Dep [mailto:khushil@gmail.com]
 
 We do have a major commercial interest - Nexenta. It's been quiet but I do
 look forward to seeing something come out of that stable this year? :-)
 
 I'll agree to call Nexenta a major commerical interest, in regards to 
 contribution to the open source ZFS tree, if they become an officially 
 supported OS on Dell, HP, and/or IBM hardware.  

NexentaStor is officially supported on Dell, HP, and IBM hardware.  The only
question is, what is your definition of 'support'?  Many NexentaStor customers
today appear to be deploying on SuperMicro and Quanta systems, for obvious
cost reasons. Nexenta has good working relationships with these major vendors
and others.

As for investment, Nexenta has been and continues to hire the best engineers
and professional services people we can find. We see a lot of demand in the 
market and have been growing at an astonishing rate. If you'd like to contribute
to making software storage solutions rather than whining about what Oracle won't
discuss, check us out and send me your resume :-)
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-06 Thread Richard Elling
On Jan 5, 2011, at 4:14 PM, Edward Ned Harvey wrote:

 From: Richard Elling [mailto:richard.ell...@nexenta.com]
 
 I'll agree to call Nexenta a major commerical interest, in regards to
 contribution to the open source ZFS tree, if they become an officially
 supported OS on Dell, HP, and/or IBM hardware.
 
 NexentaStor is officially supported on Dell, HP, and IBM hardware.  The
 only
 question is, what is your definition of 'support'?  Many NexentaStor
 
 I don't want to argue about this, but I'll just try to clarify what I meant:
 
 Presently, I have a dell server with officially supported solaris, and it's
 as unreliable as pure junk.  It's just the backup server, so I'm free to
 frequently create  destroy it... And as such, I frequently do recreate and
 destroy it.  It is entirely stable running RHEL (centos) because Dell and
 RedHat have a partnership with a serious number of human beings and machines
 looking for and fixing any compatibility issues.  For my solaris
 instability, I blame the fact that solaris developers don't do significant
 quality assurance on non-sun hardware.  To become officially compatible,
 the whole qualification process is like this:  Somebody installs it, doesn't
 see any problems, and then calls it certified.  They reformat with
 something else, and move on.  They don't build their business on that
 platform, so they don't detect stability issues like the ones reported...
 System crashes once per week and so forth.  Solaris therefore passes the
 test, and becomes one of the options available on the drop-down menu for
 OSes with a new server.  (Of course that's been discontinued by oracle, but
 that's how it was in the past.)

If I understand correctly, you want Dell, HP, and IBM to run OSes other
than Microsoft and RHEL.  For the thousands of other OSes out there,
this is a significant barrier to entry. One can argue that the most significant
innovations in the past 5 years came from none of those companies -- they
came from Google, Apple, Amazon, Facebook, and the other innovators
who did not spend their efforts trying to beat Microsoft and get into the 
manufacturing floor of the big vendors.

 Developers need to eat their own food.  

I agree, but neither Dell, HP, nor IBM develop Windows...

 Smoke your own crack.  Hardware
 engineers at Dell need to actually use your OS on their hardware, for their
 development efforts.  I would be willing to bet Sun hardware engineers use a
 significant percentage of solaris servers for their work...  And guess what
 solaris engineers don't use?  Non-sun hardware.  

I'm not sure of the current state, but many of the Solaris engineers develop
on laptops and Sun did not offer a laptop product line.

 Pretty safe bet you won't
 find any Dell servers in the server room where solaris developers do their
 thing.

You will find them where Nexenta developers live :-)

 If you want to be taken seriously as an alternative storage option, you've
 got to at LEAST be listed as a factory-distributed OS that is an option to
 ship with the new server, and THEN, when people such as myself buy those
 things, we've got to have a good enough experience that we don't all bitch
 and flame about it afterward.

Wait a minute... this is patently false.  The big storage vendors: NetApp,
EMC, Hitachi, Fujitsu, LSI... none run on HP, IBM, or Dell servers.

 Nexenta, you need a real and serious partnership with Dell, HP, IBM.  Get
 their developers to run YOUR OS on the servers which they use for
 development.  Get them to sell your product bundled with their product.  And
 dedicate real and serious engineering into bugfixes working with customers,
 to truly identify root causes of instability, with a real OS development and
 engineering and support group.  It's got to be STABLE, that's the #1
 requirement.

There are many marketing activities are in progress towards this end.
One of Nexenta's major OEMs (Compellent) is being purchased by Dell. 
The deal is not done, so there is no public information on future plans,
to my knowledge.

 I previously made the comparison...  Even close-source solaris  ZFS is a
 better alternative to close-source netapp  wafl.  So for now, those are the
 only two enterprise supportable options I'm willing to stake my career on,
 and I'll buy Sun hardware with Solaris.  But I really wish I could feel
 confident buying a cheaper Dell server and running ZFS on it.  Nexenta, if
 you make yourself look like a serious competitor against solaris, and really
 truly form an awesome stable partnership with Dell, I will happily buy your
 stuff instead of Oracle.  Even if you are a little behind in feature
 offering.  But I will not buy your stuff if I can't feel perfectly confident
 in its stability.

I can assure you that we take stability very seriously.  And since you seem
to think the big box vendors are infallible, a sampling of those things we
(Nexenta) have to live with:

Re: [zfs-discuss] A few questions

2011-01-06 Thread Jeff Bacon
 From: Edward Ned Harvey
   opensolarisisdeadlongliveopensola...@nedharvey.com
 To: 'Khushil Dep' khushil@gmail.com
 Cc: Richard Elling richard.ell...@nexenta.com,
   zfs-discuss@opensolaris.org
 Subject: Re: [zfs-discuss] A few questions
 Message-ID: 000201cbada5$a3678270$ea3687...@nedharvey.com
 Content-Type: text/plain; charset=utf-8
 
  From: Khushil Dep [mailto:khushil@gmail.com]
 
  I've deployed large SAN's on both SuperMicro 825/826/846 and Dell
  R610/R710's and I've not found any issues so far. I always make a
 point of
  installing Intel chipset NIC's on the DELL's and disabling the
 Broadcom ones
  but other than that it's always been plain sailing - hardware-wise
 anyway.
 
 not found any issues, except the broadcom one which causes the
 system to crash regularly in the default factory configuration.
 
 How did you learn about the broadcom issue for the first time?  I had
 to learn the hard way, and with all the involvement of both Dell and
 Oracle support teams, nobody could tell me what I needed to change.
We
 literally replaced every component of the server twice over a period
of
 1 year, and I spent mandays upgrading and downgrading firmwares
 randomly trying to find a stable configuration.  I scoured the
internet
 to find this little tidbit about replacing the broadcom NIC, and
 randomly guessed, and replaced my nic with an intel card to make the
 problem go away.

20 years of doing this c*(# has taught me that most things only
get learned the hard way. I certainly won't bet my career solely
on the ability of the vendor to support the product, because they're
hardly omniscient. Testing, testing, and generous return policies
(and/or RD budget) 

 The same system doesn't have a problem running RHEL/centos.

Then you're not pushing it hard enough, or your stars are just
aligned nicely.

We have massive piles of Dell hardware, all types. Running CentOS
since at least 4.5. Every single one of those Dells has an Intel
NIC in it, and the Broadcoms disabled in the BIOS. Because every
time we do something stupid like let ourselves think oh, we could
maybe use those extra Broadcom ports for X, we get burned. 

High-volume financial trading system. Blew up on the bcoms.
Didn't matter what driver or tweak or fix. Plenty of man-days 
wasted debugging. Went with net.advice, put in Intel NIC.
No more problems. That was 3 years ago.  

Thought we could use the bcoms for our fileservers. Nope.

Thought we could use the bcoms for the dedicated drbd links
for our xen cluster. Nope. 

And we know we're not alone in this evaluation.

We could have spent forever chasing support to get someone
to fix it I suppose... but we have better things to do. 

 See my point?  Next time I buy a server, I do not have confidence to
 simply expect solaris on dell to work reliably.  The same goes for
 solaris derivatives, and all non-sun hardware.  There simply is not an
 adequate qualification and/or support process.

I'm not convinced ANYONE really has such a thing. Or that it's even
necessarily possible. 

In fact, I'm sure they don't. Cuz that's what it says in the fine
print on the support contracts and the purchase agreements - we do
not guarantee... 

I just prefer not to have any confidence for the most part.
It's easier and safer.

-bacon
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-05 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Tim Cook
 
 
 The claim was that there are more people contributing code from outside of
 Oracle than inside to zfs.  Your contributions to Illumos do absolutely
nothing

Guys, please let's just say this much:

To all those who are contributing to the open-source ZFS code, freebsd,
illumos project, and others, thank you very much.  :-)  We know certain
things are stable and production ready, but there has not yet been much
forward development after zpool 28, but the effort is well appreciated, and
for whatever comes next, yes we can all be patient.

Right now, Oracle is not contributing at all to the open source branches of
any of these projects.  So right now it's fair to say the non-oracle
contributions to the OPEN SOURCE ZFS outweighs the nonexistent oracle
contributions.  However, Oracle is continuing to develop the closed-source
ZFS.  

I don't know if anyone has real numbers, dollars contributed or number of
developer hours etc, but I think it's fair to say that oracle is probably
contributing more to the closed source ZFS right now, than the rest of the
world is contributing to the open source ZFS right now.  Also, we know that
the closed source ZFS right now is more advanced than the open source ZFS
(zpool 31 vs 28).  Oracle closed source ZFS is ahead, and probably
developing faster too, than the open source ZFS right now.

If anyone has any good way to draw more contributors into the open source
tree, that would also be useful and appreciated.  Gosh, it would be nice to
get major players like Dell, HP, IBM, Apple contributing into that project.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-05 Thread Deano
Edward Ned Harvey wrote
 I don't know if anyone has real numbers, dollars contributed or number of
 developer hours etc, but I think it's fair to say that oracle is probably
 contributing more to the closed source ZFS right now, than the rest of the
 world is contributing to the open source ZFS right now.  Also, we know
that
 the closed source ZFS right now is more advanced than the open source ZFS
 (zpool 31 vs 28).  Oracle closed source ZFS is ahead, and probably
 developing faster too, than the open source ZFS right now.

 If anyone has any good way to draw more contributors into the open source
 tree, that would also be useful and appreciated.  Gosh, it would be nice
to
 get major players like Dell, HP, IBM, Apple contributing into that
project.

This is something that Illumos/Open source ZFS needs to decide what it
wants, effectively we can't innovate ZFS without breaking capability...
because our Illumos ZPool version 29 (if we innovate) will not be Oracle
Zpool version 29.

If we want open-source ZFS to we need to make that choice and let everyone
know, apart from submitting bug fixes to zpool v28, are I'm not sure if
other changed would be welcome?

So honestly do we want to innovate ZFS (I do) or do we just want to follow
Oracle?

Bye,
Deano

de...@cloudpixies.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-05 Thread Edward Ned Harvey
 From: Deano [mailto:de...@rattie.demon.co.uk]
 Sent: Wednesday, January 05, 2011 9:16 AM
 
 So honestly do we want to innovate ZFS (I do) or do we just want to follow
 Oracle?

Well, you can't follow Oracle.  Unless you wait till they release something,
reverse engineer it, and attempt to reimplement it.  I am quite sure you'll
be sued if you do that.

If you want forward development in the open source tree, you basically have
only one option:  Some major contributor must have a financial interest, and
commit to a real concerted development effort, with their own roadmap, which
is intentionally designed NOT to overlap with the Oracle roadmap.
Otherwise, the code will stagnate.

I am rooting for the open source projects, but I'm not optimistic
personally.  I think all major contributors (IBM, Apple, etc) will not
participate for various reasons, and as a result, we'll experience bit
rot...  As presently evident by lack of zpool advancement beyond 28.

So in my mind, Oracle and ZFS are now just like netapp and wafl.  Well...  I
prefer Solaris and ZFS over netapp and wafl...  So whenever I would have
otherwise bought a netapp, I'll still buy the solaris server instead...  But
it's no longer a competitor against ubuntu or centos.

Just the way Larry wants it.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-05 Thread Khushil Dep
We do have a major commercial interest - Nexenta. It's been quiet but I do
look forward to seeing something come out of that stable this year? :-)

---
W. A. Khushil Dep - khushil@gmail.com -  07905374843

Visit my blog at http://www.khushil.com/






On 5 January 2011 14:34, Edward Ned Harvey 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:

  From: Deano [mailto:de...@rattie.demon.co.uk]
  Sent: Wednesday, January 05, 2011 9:16 AM
 
  So honestly do we want to innovate ZFS (I do) or do we just want to
 follow
  Oracle?

 Well, you can't follow Oracle.  Unless you wait till they release
 something,
 reverse engineer it, and attempt to reimplement it.  I am quite sure you'll
 be sued if you do that.

 If you want forward development in the open source tree, you basically have
 only one option:  Some major contributor must have a financial interest,
 and
 commit to a real concerted development effort, with their own roadmap,
 which
 is intentionally designed NOT to overlap with the Oracle roadmap.
 Otherwise, the code will stagnate.

 I am rooting for the open source projects, but I'm not optimistic
 personally.  I think all major contributors (IBM, Apple, etc) will not
 participate for various reasons, and as a result, we'll experience bit
 rot...  As presently evident by lack of zpool advancement beyond 28.

 So in my mind, Oracle and ZFS are now just like netapp and wafl.  Well...
  I
 prefer Solaris and ZFS over netapp and wafl...  So whenever I would have
 otherwise bought a netapp, I'll still buy the solaris server instead...
  But
 it's no longer a competitor against ubuntu or centos.

 Just the way Larry wants it.

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-05 Thread Michael Schuster
On Wed, Jan 5, 2011 at 15:34, Edward Ned Harvey
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
 From: Deano [mailto:de...@rattie.demon.co.uk]
 Sent: Wednesday, January 05, 2011 9:16 AM

 So honestly do we want to innovate ZFS (I do) or do we just want to follow
 Oracle?

 Well, you can't follow Oracle.  Unless you wait till they release something,
 reverse engineer it, and attempt to reimplement it.

that's not my understanding - while we will have to wait, oracle is
supposed to release *some* source code afterwards to satisfy some
claim or other. I agree, some would argue that that should have
already happened with S11 express... I don't know it has, but that's
not *the* release of S11, is it? And once the code is released, even
if after the fact, it's not reverse-engineering anymore, is it?

Michael
PS: just in case: even while at Oracle, I had no insight into any of
these plans, much less do I have now.
-- 
regards/mit freundlichen Grüssen
Michael Schuster
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-05 Thread Saxon, Will
 -Original Message-
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Michael Schuster
 Sent: Wednesday, January 05, 2011 9:42 AM
 To: Edward Ned Harvey
 Cc: zfs-discuss@opensolaris.org
 Subject: Re: [zfs-discuss] A few questions
 
 On Wed, Jan 5, 2011 at 15:34, Edward Ned Harvey
 opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
  From: Deano [mailto:de...@rattie.demon.co.uk]
  Sent: Wednesday, January 05, 2011 9:16 AM
 
  So honestly do we want to innovate ZFS (I do) or do we just want to
 follow
  Oracle?
 
  Well, you can't follow Oracle.  Unless you wait till they release something,
  reverse engineer it, and attempt to reimplement it.
 
 that's not my understanding - while we will have to wait, oracle is
 supposed to release *some* source code afterwards to satisfy some
 claim or other. I agree, some would argue that that should have
 already happened with S11 express... I don't know it has, but that's
 not *the* release of S11, is it? And once the code is released, even
 if after the fact, it's not reverse-engineering anymore, is it?

Not exactly. Oracle hasn't publicly committed to anything like that. There were 
several news articles last year referencing a leaked internal memo that I 
believe was more of a proposal than a plan. 

Even if Oracle did 'commit' to releasing code, they could easily just decide 
not to. 

-Will
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-05 Thread Edward Ned Harvey
 From: Michael Schuster [mailto:michaelspriv...@gmail.com]
 
  Well, you can't follow Oracle.  Unless you wait till they release
something,
  reverse engineer it, and attempt to reimplement it.
 
 that's not my understanding - while we will have to wait, oracle is
 supposed to release *some* source code afterwards to satisfy some

Where do you get that from?  AFAIK, there is no official word about oracle
opening anything moving forward, but there are plenty of unofficial reports
that it will not be opened.  Nobody in the field is holding any hope for
that to change anymore, most importantly illumos and nexenta.  (At least
with regards to ZFS and all the other projects relevant to solaris.)

I know in the case of SGE/OGE, it's officially closed source now.  As of Dec
31st, sunsource is being decomissioned, and the announcement of officially
closing the SGE source and decomissioning the open source community went out
on Dec 24th.  So all of this leads me to believe, with very little
reservation, that the new developments beyond zpool 28 are closed source
moving forward.  There's very little breathing room remaining for hope of
that being open sourced again.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-05 Thread Edward Ned Harvey
 From: Khushil Dep [mailto:khushil@gmail.com]
 
 We do have a major commercial interest - Nexenta. It's been quiet but I do
 look forward to seeing something come out of that stable this year? :-)

I'll agree to call Nexenta a major commerical interest, in regards to 
contribution to the open source ZFS tree, if they become an officially 
supported OS on Dell, HP, and/or IBM hardware.  Otherwise, they're just simply 
too small to keep up with the rate of development of the closed source ZFS 
tree, and destined to be left in the dust.

And if Nexenta does become a seriously viable competitor against netapp and 
oracle...  Watch out for lawsuits...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-05 Thread Garrett D'Amore

On 01/ 4/11 11:48 PM, Tim Cook wrote:



On Tue, Jan 4, 2011 at 8:21 PM, Garrett D'Amore garr...@nexenta.com 
mailto:garr...@nexenta.com wrote:


On 01/ 4/11 09:15 PM, Tim Cook wrote:



On Mon, Jan 3, 2011 at 5:56 AM, Garrett D'Amore
garr...@nexenta.com mailto:garr...@nexenta.com wrote:

On 01/ 3/11 05:08 AM, Robert Milkowski wrote:

On 12/26/10 05:40 AM, Tim Cook wrote:



On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling
richard.ell...@gmail.com
mailto:richard.ell...@gmail.com wrote:


There are more people outside of Oracle developing for
ZFS than inside Oracle.
This has been true for some time now.




Pardon my skepticism, but where is the proof of this claim
(I'm quite certain you know I mean no disrespect)?
 Solaris11 Express was a massive leap in functionality and
bugfixes to ZFS.  I've seen exactly nothing out of outside
of Oracle in the time since it went closed.  We used to
see updates bi-weekly out of Sun.  Nexenta spending
hundreds of man-hours on a GUI and userland apps isn't work
on ZFS.




Exactly my observation as well. I haven't seen any ZFS
related development happening at Ilumos or Nexenta, at least
not yet.


Just because you've not seen it yet doesn't imply it isn't
happening.  Please be patient.

   - Garrett



Or, conversely, don't make claims of all this code contribution
prior to having anything to show for your claimed efforts.  Duke
Nukem Forever was going to be the greatest video game ever
created... we were told to be patient... we're still waiting
for that too.



Um, have you not been paying attention?  I've delivered quite a
lot of contribution to illumos already, just not in ZFS.   Take a
close look -- there almost certainly wouldn't *be* an open source
version of OS/Net had I not done the work to enable this in libc,
kernel crypto, and other bits.  This work is still higher priority
than ZFS innovation for a variety of reasons -- mostly because we
need a viable and supportable illumos upon which to build those
ZFS innovations.

That said, much of the ZFS work I hope to contribute to illumos
needs more baking, but some of it is already open source in
NexentaStor.  (You can for a start look at zfs-monitor, the WORM
support, and support for hardware GZIP acceleration all as things
that Nexenta has innovated in ZFS, and which are open source today
if not part of illumos.  Check out http://www.nexenta.org for
source code access.)

So there, money placed where mouth is.  You?

   - Garrett



The claim was that there are more people contributing code from 
outside of Oracle than inside to zfs.  Your contributions to Illumos 
do absolutely nothing to backup that claim.  ZFS-monitor is not ZFS 
code (it's an FMA module), WORM also isn't ZFS code, it's an OS level 
operation, and GZIP hardware acceleration is produced by Indra 
networks, and has absolutely nothing to do with ZFS.  Does it help 
ZFS?  Sure, but that's hardly a code contribution to ZFS when it's 
simply a hardware acceleration card that accelerates ALL gzip code.


Um... you have obviously not looked at the code.

Our WORM code is not some basic OS guarantees on top of ZFS, but 
modifications to the ZFS code itself so that ZFS *itself* honors the 
WORM property, which is implemented as a property on the ZFS filesystem.


Likewise, the GZIP hardware acceleration support includes specific 
modifications to the ZFS kernel filesystem code.


Of course, we've not done anything major to change the fundamental way 
that ZFS stores data... is that what you're talking about?


I think you must have a very narrow idea of what constitutes an 
innovation in ZFS.




So, great job picking three projects that are not proof of developers 
working on ZFS.  And great job not providing any proof to the claim 
there are more developers working on ZFS outside of Oracle than within.


Nexenta don't represent that majority actually.  A large number of ZFS 
folks -- people with names like Leventhal, Ahrens, Wilson, and Gregg, 
are working on ZFS related work at Delphix and Joyent, or so I've been 
told.  I don't have first hand knowledge of *what* the details are, but 
I'm looking forward to seeing the results.


This ignores the contributions from people working on ZFS on other 
platforms as well.


Of course, since I know longer work there, I don't really know how many 
people Oracle still has working on ZFS.  They could have tasked 1,000 
people with it.  Or they could have shut the project down entirely.  But 
of the people who had, up until Oracle shut down the open code, made 
non-trivial contributions to ZFS, I think the majority of *those* people 
can be found working outside of Oracle now, and I think most of them are 
still working on ZFS 

Re: [zfs-discuss] A few questions

2011-01-05 Thread Deano

Edward Ned Harvey wrote
 From: Deano [mailto:de...@rattie.demon.co.uk]
 Sent: Wednesday, January 05, 2011 9:16 AM
 
 So honestly do we want to innovate ZFS (I do) or do we just want to follow
 Oracle?

 Well, you can't follow Oracle.  Unless you wait till they release
something,
 reverse engineer it, and attempt to reimplement it.  I am quite sure
you'll
 be sued if you do that.

 If you want forward development in the open source tree, you basically
have
 only one option:  Some major contributor must have a financial interest,
and
 commit to a real concerted development effort, with their own roadmap,
which
 is intentionally designed NOT to overlap with the Oracle roadmap.
 Otherwise, the code will stagnate.

Why does it need a big backer? Erm ZFS isn't that large or amazingly complex
code. It is *good* code but take 100s of developers and a fortune to
develop? Erm nope (which I'd bet it never had at Sun either).

Why not overlap Oracle? what has it got to do with Oracle if we have split
into ZFS (Oracle) and OpenZFS in future. OpenZFS will get whatever
features developers feel that want or they need to develop for it.

This is the fundamental choice of Open source ZFS, illumos and OpenIndiania
(and other distributions) have to decide, what is there purpose? Is it a
free compatible (though trailing) version of Oracle Solaris OR a platform
that shared an ancestor with Oracle Solaris via Sun OpenSolaris but now is
its own evolutionary species, with no more connection than I have with a
15th cousin removed on my great, great, great, grandfathers side.

This isn't even a theoretical what if situation for me, I have a major
modification to ZFS (still being developed), it has no basis on Oracle or
anybody elses needs just mine. It is what I felt I needed and ZFS was the
right base for it. Now will that go into OpenZFS? Honestly I don't know
yet, because not sure it would be wanted (it will be incompatible with
Oracle ZFS) and personally, commercially I'm not sure if it's the right move
to open source the feature.

I bet I'm not the only small developer out there in a similar situation, the
landscape is very unclear about what actually the community wants to do
going forward, and whether we will have or even want OpenZFS and Oracle
ZFS or Oracle ZFS and 90% compatibles (always trailing) or Oracle ZFS + DevA
ZFS + DevB ZFS + DevC ZFS.

Bye,
Deano

de...@cloudpixies.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-05 Thread Edward Ned Harvey
 From: Richard Elling [mailto:richard.ell...@nexenta.com]
 
  I'll agree to call Nexenta a major commerical interest, in regards to
 contribution to the open source ZFS tree, if they become an officially
 supported OS on Dell, HP, and/or IBM hardware.
 
 NexentaStor is officially supported on Dell, HP, and IBM hardware.  The
only
 question is, what is your definition of 'support'?  Many NexentaStor

I don't want to argue about this, but I'll just try to clarify what I meant:

Presently, I have a dell server with officially supported solaris, and it's
as unreliable as pure junk.  It's just the backup server, so I'm free to
frequently create  destroy it... And as such, I frequently do recreate and
destroy it.  It is entirely stable running RHEL (centos) because Dell and
RedHat have a partnership with a serious number of human beings and machines
looking for and fixing any compatibility issues.  For my solaris
instability, I blame the fact that solaris developers don't do significant
quality assurance on non-sun hardware.  To become officially compatible,
the whole qualification process is like this:  Somebody installs it, doesn't
see any problems, and then calls it certified.  They reformat with
something else, and move on.  They don't build their business on that
platform, so they don't detect stability issues like the ones reported...
System crashes once per week and so forth.  Solaris therefore passes the
test, and becomes one of the options available on the drop-down menu for
OSes with a new server.  (Of course that's been discontinued by oracle, but
that's how it was in the past.)

Developers need to eat their own food.  Smoke your own crack.  Hardware
engineers at Dell need to actually use your OS on their hardware, for their
development efforts.  I would be willing to bet Sun hardware engineers use a
significant percentage of solaris servers for their work...  And guess what
solaris engineers don't use?  Non-sun hardware.  Pretty safe bet you won't
find any Dell servers in the server room where solaris developers do their
thing.

If you want to be taken seriously as an alternative storage option, you've
got to at LEAST be listed as a factory-distributed OS that is an option to
ship with the new server, and THEN, when people such as myself buy those
things, we've got to have a good enough experience that we don't all bitch
and flame about it afterward.

Nexenta, you need a real and serious partnership with Dell, HP, IBM.  Get
their developers to run YOUR OS on the servers which they use for
development.  Get them to sell your product bundled with their product.  And
dedicate real and serious engineering into bugfixes working with customers,
to truly identify root causes of instability, with a real OS development and
engineering and support group.  It's got to be STABLE, that's the #1
requirement.

I previously made the comparison...  Even close-source solaris  ZFS is a
better alternative to close-source netapp  wafl.  So for now, those are the
only two enterprise supportable options I'm willing to stake my career on,
and I'll buy Sun hardware with Solaris.  But I really wish I could feel
confident buying a cheaper Dell server and running ZFS on it.  Nexenta, if
you make yourself look like a serious competitor against solaris, and really
truly form an awesome stable partnership with Dell, I will happily buy your
stuff instead of Oracle.  Even if you are a little behind in feature
offering.  But I will not buy your stuff if I can't feel perfectly confident
in its stability.

Ever heard the phrase Nobody ever got fired for buying IBM.  You're the
little guys.  If you want to compete against the big guys, you've got to
kick ass.  And don't get sued into oblivion.

Even today's feature set is perfectly adequate for at least a couple of
years to come.  If you put all your effort into stability and bugfixes,
serious partnerships with Dell, HP, IBM, and become extremely professional
looking and stable, with fanatical support...  You don't have to worry about
new feature development for some while.  Stability is #1 and not
disappearing is a pretty huge threat right now.

Based on my experience, I would not recommend buying Dell with Solaris, even
if that were still an option.  If you want solaris, buy sun/oracle hardware,
because then you can actually expect it to work reliably.  And if solaris
isn't stable on dell ... then all the solaris derivatives including nexenta
can't be trusted either, no matter how much you claim it's supported.

Show me the HCL, and show me the partnership between your software engineers
and Dell's hardware engineers.  Make me believe there is a serious and
thorough qualification process.  Do a huge volume.  Your volume must be
large enough to justify dedicating some engineers to serious bugfix efforts
in the field.  Otherwise...  When I need to buy something stable...  I'm
going to buy solaris on sun hardware, because I know that's thoroughly
tried, tested, and stable.


Re: [zfs-discuss] A few questions

2011-01-05 Thread Bob Friesenhahn

On Wed, 5 Jan 2011, Edward Ned Harvey wrote:

with regards to ZFS and all the other projects relevant to solaris.)

I know in the case of SGE/OGE, it's officially closed source now.  As of Dec
31st, sunsource is being decomissioned, and the announcement of officially
closing the SGE source and decomissioning the open source community went out
on Dec 24th.  So all of this leads me to believe, with very little
reservation, that the new developments beyond zpool 28 are closed source
moving forward.  There's very little breathing room remaining for hope of
that being open sourced again.


I have no idea what you are talking about.  Best I can tell, SGE/OGE 
is a reference to Sun Grid Engine, which has nothing to do with zfs. 
The only annoucement and discussion I can find via Google is written 
by you.  It was pretty clear even a year ago that Sun Grid Engine was 
going away.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] a few questions - Oracle

2011-01-04 Thread webdawg
It is sad that such a lovely file system is now in Oracle's unresponsive hands. 
 I hope someone builds another open file system just like it.  I could never 
find anything like it to protect my data like it does.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] a few questions - Oracle

2011-01-04 Thread Paul Gress

On 01/ 4/11 01:19 PM, webd...@gmail.com wrote:

It is sad that such a lovely file system is now in Oracle's unresponsive hands. 
 I hope someone builds another open file system just like it.  I could never 
find anything like it to protect my data like it does.

___


I have to reply to this.

While Oracle may not seem responsive, they are innovating on the zfs still.  I 
haven't seen it stand still when Oracle took over Sun.

Also, if you do your homework, there is a BSD version floating around, and a 
Linux version also.  To boot, Illumos has the last open source release which 
brings it to Openindania.

So what are you talking about?


Paul
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-04 Thread Robert Milkowski

 On 01/ 3/11 04:28 PM, Richard Elling wrote:

On Jan 3, 2011, at 5:08 AM, Robert Milkowski wrote:


On 12/26/10 05:40 AM, Tim Cook wrote:



On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling 
richard.ell...@gmail.com mailto:richard.ell...@gmail.com wrote:



There are more people outside of Oracle developing for ZFS than
inside Oracle.
This has been true for some time now.




Pardon my skepticism, but where is the proof of this claim (I'm 
quite certain you know I mean no disrespect)?  Solaris11 Express was 
a massive leap in functionality and bugfixes to ZFS.  I've seen 
exactly nothing out of outside of Oracle in the time since it went 
closed.  We used to see updates bi-weekly out of Sun.  Nexenta 
spending hundreds of man-hours on a GUI and userland apps isn't work 
on ZFS.





Exactly my observation as well. I haven't seen any ZFS related 
development happening at Ilumos or Nexenta, at least not yet.


I am quite sure you understand how pipelines work :-)



Are you suggesting that Nexenta is developing new ZFS features behind 
closed doors (like Oracle...) and then will share code later-on? Somehow 
I don't think so... but I would love to be proved wrong :)


--
Robert Milkowski
http://milek.blogspot.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-04 Thread Robert Milkowski

 On 01/ 4/11 11:35 PM, Robert Milkowski wrote:

On 01/ 3/11 04:28 PM, Richard Elling wrote:

On Jan 3, 2011, at 5:08 AM, Robert Milkowski wrote:


On 12/26/10 05:40 AM, Tim Cook wrote:



On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling 
richard.ell...@gmail.com mailto:richard.ell...@gmail.com wrote:



There are more people outside of Oracle developing for ZFS than
inside Oracle.
This has been true for some time now.




Pardon my skepticism, but where is the proof of this claim (I'm 
quite certain you know I mean no disrespect)?  Solaris11 Express 
was a massive leap in functionality and bugfixes to ZFS.  I've seen 
exactly nothing out of outside of Oracle in the time since it 
went closed.  We used to see updates bi-weekly out of Sun.  Nexenta 
spending hundreds of man-hours on a GUI and userland apps isn't 
work on ZFS.





Exactly my observation as well. I haven't seen any ZFS related 
development happening at Ilumos or Nexenta, at least not yet.


I am quite sure you understand how pipelines work :-)



Are you suggesting that Nexenta is developing new ZFS features behind 
closed doors (like Oracle...) and then will share code later-on? 
Somehow I don't think so... but I would love to be proved wrong :)


I mean I would love to see Nexenta start delivering real innovation in 
Solaris/Illumos kernel (zfs, networking, ...), not that I would love to 
see it happening behind a closed doors :)


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-04 Thread Garrett D'Amore

On 01/ 3/11 05:08 AM, Robert Milkowski wrote:

On 12/26/10 05:40 AM, Tim Cook wrote:



On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling 
richard.ell...@gmail.com mailto:richard.ell...@gmail.com wrote:



There are more people outside of Oracle developing for ZFS than
inside Oracle.
This has been true for some time now.




Pardon my skepticism, but where is the proof of this claim (I'm quite 
certain you know I mean no disrespect)?  Solaris11 Express was a 
massive leap in functionality and bugfixes to ZFS.  I've seen exactly 
nothing out of outside of Oracle in the time since it went closed. 
 We used to see updates bi-weekly out of Sun.  Nexenta spending 
hundreds of man-hours on a GUI and userland apps isn't work on ZFS.





Exactly my observation as well. I haven't seen any ZFS related 
development happening at Ilumos or Nexenta, at least not yet.


Just because you've not seen it yet doesn't imply it isn't happening.  
Please be patient.


   - Garrett



--
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] a few questions - Oracle

2011-01-04 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Paul Gress
 
 On 01/ 4/11 01:19 PM, webd...@gmail.com wrote:
 It is sad that such a lovely file system is now in Oracle's unresponsive
hands.  I
 hope someone builds another open file system just like it.  I could never
find
 anything like it to protect my data like it does.
 
 I have to reply to this.
 
 While Oracle may not seem responsive, they are innovating on the zfs
still.  I
 haven't seen it stand still when Oracle took over Sun.
 
 Also, if you do your homework, there is a BSD version floating around, and
a
 Linux version also.  To boot, Illumos has the last open source release
which
 brings it to Openindania.
 
 So what are you talking about?

Also, another open file system like it ... anything like it to protect my
data...

Go use Linux, and BTRFS.  It is GPL, and guess what.  Also developed by
Oracle.  But it's GPL, and it's included by default in many of the latest
linuxes.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-04 Thread Tim Cook
On Mon, Jan 3, 2011 at 5:56 AM, Garrett D'Amore garr...@nexenta.com wrote:

  On 01/ 3/11 05:08 AM, Robert Milkowski wrote:

 On 12/26/10 05:40 AM, Tim Cook wrote:



 On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling richard.ell...@gmail.com
  wrote:


 There are more people outside of Oracle developing for ZFS than inside
 Oracle.
 This has been true for some time now.




  Pardon my skepticism, but where is the proof of this claim (I'm quite
 certain you know I mean no disrespect)?  Solaris11 Express was a massive
 leap in functionality and bugfixes to ZFS.  I've seen exactly nothing out of
 outside of Oracle in the time since it went closed.  We used to see
 updates bi-weekly out of Sun.  Nexenta spending hundreds of man-hours on a
 GUI and userland apps isn't work on ZFS.



 Exactly my observation as well. I haven't seen any ZFS related development
 happening at Ilumos or Nexenta, at least not yet.


 Just because you've not seen it yet doesn't imply it isn't happening.
 Please be patient.

- Garrett



Or, conversely, don't make claims of all this code contribution prior to
having anything to show for your claimed efforts.  Duke Nukem Forever was
going to be the greatest video game ever created... we were told to be
patient... we're still waiting for that too.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-04 Thread Tim Cook
On Tue, Jan 4, 2011 at 8:21 PM, Garrett D'Amore garr...@nexenta.com wrote:

  On 01/ 4/11 09:15 PM, Tim Cook wrote:



 On Mon, Jan 3, 2011 at 5:56 AM, Garrett D'Amore garr...@nexenta.comwrote:

  On 01/ 3/11 05:08 AM, Robert Milkowski wrote:

 On 12/26/10 05:40 AM, Tim Cook wrote:



 On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling 
 richard.ell...@gmail.com wrote:


 There are more people outside of Oracle developing for ZFS than inside
 Oracle.
 This has been true for some time now.




  Pardon my skepticism, but where is the proof of this claim (I'm quite
 certain you know I mean no disrespect)?  Solaris11 Express was a massive
 leap in functionality and bugfixes to ZFS.  I've seen exactly nothing out of
 outside of Oracle in the time since it went closed.  We used to see
 updates bi-weekly out of Sun.  Nexenta spending hundreds of man-hours on a
 GUI and userland apps isn't work on ZFS.



 Exactly my observation as well. I haven't seen any ZFS related development
 happening at Ilumos or Nexenta, at least not yet.


  Just because you've not seen it yet doesn't imply it isn't happening.
 Please be patient.

- Garrett



  Or, conversely, don't make claims of all this code contribution prior to
 having anything to show for your claimed efforts.  Duke Nukem Forever was
 going to be the greatest video game ever created... we were told to be
 patient... we're still waiting for that too.



 Um, have you not been paying attention?  I've delivered quite a lot of
 contribution to illumos already, just not in ZFS.   Take a close look --
 there almost certainly wouldn't *be* an open source version of OS/Net had I
 not done the work to enable this in libc, kernel crypto, and other bits.
 This work is still higher priority than ZFS innovation for a variety of
 reasons -- mostly because we need a viable and supportable illumos upon
 which to build those ZFS innovations.

 That said, much of the ZFS work I hope to contribute to illumos needs more
 baking, but some of it is already open source in NexentaStor.  (You can for
 a start look at zfs-monitor, the WORM support, and support for hardware GZIP
 acceleration all as things that Nexenta has innovated in ZFS, and which are
 open source today if not part of illumos.  Check out
 http://www.nexenta.org for source code access.)

 So there, money placed where mouth is.  You?

- Garrett



The claim was that there are more people contributing code from outside of
Oracle than inside to zfs.  Your contributions to Illumos do absolutely
nothing to backup that claim.  ZFS-monitor is not ZFS code (it's an FMA
module), WORM also isn't ZFS code, it's an OS level operation, and GZIP
hardware acceleration is produced by Indra networks, and has absolutely
nothing to do with ZFS.  Does it help ZFS?  Sure, but that's hardly a code
contribution to ZFS when it's simply a hardware acceleration card that
accelerates ALL gzip code.

So, great job picking three projects that are not proof of developers
working on ZFS.  And great job not providing any proof to the claim there
are more developers working on ZFS outside of Oracle than within.

You're going to need a hell of a lot bigger bank account to cash the check
than what you've got.  As for me, I don't recall making any claims on this
list that I can't back up, so I'm not really sure what you're getting at.  I
can only assume the defensive tone of your email is because you've been
called out and can't backup the claims either.

So again: if you've got code in the works, great.  Talk about it when it's
ready.  Stop throwing out baseless claims that you have no proof of and then
fall back on just be patient, it's coming.  We've heard that enough from
Oracle and Sun already.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-03 Thread Robert Milkowski

 On 12/26/10 05:40 AM, Tim Cook wrote:



On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling 
richard.ell...@gmail.com mailto:richard.ell...@gmail.com wrote:



There are more people outside of Oracle developing for ZFS than
inside Oracle.
This has been true for some time now.




Pardon my skepticism, but where is the proof of this claim (I'm quite 
certain you know I mean no disrespect)?  Solaris11 Express was a 
massive leap in functionality and bugfixes to ZFS.  I've seen exactly 
nothing out of outside of Oracle in the time since it went closed. 
 We used to see updates bi-weekly out of Sun.  Nexenta spending 
hundreds of man-hours on a GUI and userland apps isn't work on ZFS.





Exactly my observation as well. I haven't seen any ZFS related 
development happening at Ilumos or Nexenta, at least not yet.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-03 Thread Bob Friesenhahn

On Mon, 3 Jan 2011, Robert Milkowski wrote:


Exactly my observation as well. I haven't seen any ZFS related 
development happening at Ilumos or Nexenta, at least not yet.


There seems to be plenty of zfs work on the FreeBSD project, but 
primarily with porting the latest available sources to FreeBSD (going 
very well!) rather than with developing zfs itself.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-03 Thread Richard Elling
On Jan 3, 2011, at 5:08 AM, Robert Milkowski wrote:

 On 12/26/10 05:40 AM, Tim Cook wrote:
 
 
 
 On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling richard.ell...@gmail.com 
 wrote:
 
 There are more people outside of Oracle developing for ZFS than inside 
 Oracle.
 This has been true for some time now.
 
 
 
 
 
 Pardon my skepticism, but where is the proof of this claim (I'm quite 
 certain you know I mean no disrespect)?  Solaris11 Express was a massive 
 leap in functionality and bugfixes to ZFS.  I've seen exactly nothing out of 
 outside of Oracle in the time since it went closed.  We used to see 
 updates bi-weekly out of Sun.  Nexenta spending hundreds of man-hours on a 
 GUI and userland apps isn't work on ZFS.
 
 
 
 Exactly my observation as well. I haven't seen any ZFS related development 
 happening at Ilumos or Nexenta, at least not yet.

I am quite sure you understand how pipelines work :-)
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-03 Thread Erik Trimble

On 1/3/2011 8:28 AM, Richard Elling wrote:

On Jan 3, 2011, at 5:08 AM, Robert Milkowski wrote:

On 12/26/10 05:40 AM, Tim Cook wrote:
On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling 
richard.ell...@gmail.com mailto:richard.ell...@gmail.com wrote:



There are more people outside of Oracle developing for ZFS than
inside Oracle.
This has been true for some time now.


Pardon my skepticism, but where is the proof of this claim (I'm 
quite certain you know I mean no disrespect)?  Solaris11 Express was 
a massive leap in functionality and bugfixes to ZFS.  I've seen 
exactly nothing out of outside of Oracle in the time since it went 
closed.  We used to see updates bi-weekly out of Sun.  Nexenta 
spending hundreds of man-hours on a GUI and userland apps isn't work 
on ZFS.





Exactly my observation as well. I haven't seen any ZFS related 
development happening at Ilumos or Nexenta, at least not yet.


I am quite sure you understand how pipelines work :-)
 -- richard




I'm getting pretty close to my pain threshold on the BP_rewrite stuff, 
since not having that feature's holding up a big chunk of work I'd like 
to push.


If anyone outside of Oracle is working on some sort of change to ZFS 
that will allow arbitrary movement/placement of pre-written slabs, can 
they please contact me?  I'm pretty much at the point where I'm going to 
start diving into that chunk of the source to see if there's something 
little old me can do, and I'd far rather help on someone else's 
implementation than have to do it myself from scratch.


I'd prefer a private contact, as I realize that such work may not be 
ready for public discussion yet.


Thanks, folks!


Oh, and this is completely just me, not Oracle talking in any way.

--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2011-01-03 Thread Richard Elling
On Jan 3, 2011, at 2:10 PM, Erik Trimble wrote
 On 1/3/2011 8:28 AM, Richard Elling wrote:
 
 On Jan 3, 2011, at 5:08 AM, Robert Milkowski wrote:
 On 12/26/10 05:40 AM, Tim Cook wrote:
 On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling 
 richard.ell...@gmail.com wrote:
 
 There are more people outside of Oracle developing for ZFS than inside 
 Oracle.
 This has been true for some time now.
 
 
 Pardon my skepticism, but where is the proof of this claim (I'm quite 
 certain you know I mean no disrespect)?  Solaris11 Express was a massive 
 leap in functionality and bugfixes to ZFS.  I've seen exactly nothing out 
 of outside of Oracle in the time since it went closed.  We used to see 
 updates bi-weekly out of Sun.  Nexenta spending hundreds of man-hours on a 
 GUI and userland apps isn't work on ZFS.
 
 
 
 Exactly my observation as well. I haven't seen any ZFS related development 
 happening at Ilumos or Nexenta, at least not yet.
 
 I am quite sure you understand how pipelines work :-)
  -- richard
 
 I'm getting pretty close to my pain threshold on the BP_rewrite stuff, since 
 not having that feature's holding up a big chunk of work I'd like to push.
 
 If anyone outside of Oracle is working on some sort of change to ZFS that 
 will allow arbitrary movement/placement of pre-written slabs, can they please 
 contact me?  I'm pretty much at the point where I'm going to start diving 
 into that chunk of the source to see if there's something little old me can 
 do, and I'd far rather help on someone else's implementation than have to do 
 it myself from scratch.
 
 I'd prefer a private contact, as I realize that such work may not be ready 
 for public discussion yet.
 
 Thanks, folks!
 
 Oh, and this is completely just me, not Oracle talking in any way.

Oracle doesn't seem to say much at all :-(

But for those interested, Nexenta is actively hiring people to work in this 
area.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-25 Thread Richard Elling
On Dec 21, 2010, at 5:05 AM, Deano wrote:
 
 The question therefore is, is there room in the software implementation to 
 achieve performance and reliability numbers similar to expensive drives 
 whilst using relative cheap drives?

For some definition of similar, yes. But using relatively cheap drives does
not mean the overall system cost will be cheap.  For example, $250 will buy
8.6K random IOPS @ 4KB in an SSD[1], but to do that with cheap disks might
require eighty 7,200 rpm SATA disks.

 ZFS is good but IMHO easy to see how it can be improved to better meet this 
 situation, I can’t currently say when this line of thinking and code will 
 move from research to production level use (tho I have a pretty good idea ;) 
 ) but I wouldn’t bet on the status quo lasting much longer. In some ways the 
 removal of OpenSolaris may actually be a good thing, as its catalyized a 
 number of developers from the view that zfs is Oracle led, to thinking “what 
 can we do with zfs code as a base”?

There are more people outside of Oracle developing for ZFS than inside Oracle.
This has been true for some time now.

 Ffor example how about sticking a cheap 80GiB commodity SSD in the storage 
 case. When a resilver or defrag is required, use it as a scratch space to 
 give you a block of fast IOPs storage space to accelerate the slow parts. 
 When its done secure erase and power it down, ready for the next time a 
 resilver needs to happen. The hardware is available, just needs someone to 
 write the software…

In general, SSDs will not speed resilver unless the resilvering disk is an SSD.

[1] 
http://www.intel.com/cd/channel/reseller/asmo-na/eng/products/nand/feature/index.htm
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-25 Thread Tim Cook
On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling
richard.ell...@gmail.comwrote:

 On Dec 21, 2010, at 5:05 AM, Deano wrote:


 The question therefore is, is there room in the software implementation to
 achieve performance and reliability numbers similar to expensive drives
 whilst using relative cheap drives?


 For some definition of similar, yes. But using relatively cheap drives
 does
 not mean the overall system cost will be cheap.  For example, $250 will buy
 8.6K random IOPS @ 4KB in an SSD[1], but to do that with cheap disks
 might
 require eighty 7,200 rpm SATA disks.

 ZFS is good but IMHO easy to see how it can be improved to better meet this
 situation, I can’t currently say when this line of thinking and code will
 move from research to production level use (tho I have a pretty good idea ;)
 ) but I wouldn’t bet on the status quo lasting much longer. In some ways the
 removal of OpenSolaris may actually be a good thing, as its catalyized a
 number of developers from the view that zfs is Oracle led, to thinking “what
 can we do with zfs code as a base”?


 There are more people outside of Oracle developing for ZFS than inside
 Oracle.
 This has been true for some time now.




Pardon my skepticism, but where is the proof of this claim (I'm quite
certain you know I mean no disrespect)?  Solaris11 Express was a massive
leap in functionality and bugfixes to ZFS.  I've seen exactly nothing out of
outside of Oracle in the time since it went closed.  We used to see
updates bi-weekly out of Sun.  Nexenta spending hundreds of man-hours on a
GUI and userland apps isn't work on ZFS.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-21 Thread Lanky Doodle
 It's worse on raidzN than on mirrors, because the
 number of items which must
 be read is higher in radizN, assuming you're using
 larger vdev's and
 therefore more items exist scattered about inside
 that vdev.  You therefore
 have a higher number of things which must be randomly
 read before you reach
 completion.

In that case, isn't the answer to have a dedicated parity disk (or 2 or 3 
depending on what raidz* is used), ala raid-dp. Wouldn't this effectively be 
the 'same' as a mirror when resilvering (the only difference being parity vs 
actual data), as it's doing so from a single disk.

raid-dp covers the parity disk from failure so raidz1 probably wouldn't be 
sensible as if the parity disk fails.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-21 Thread Phil Harman

On 21/12/2010 05:44, Richard Elling wrote:
On Dec 20, 2010, at 7:31 AM, Phil Harman phil.har...@gmail.com 
mailto:phil.har...@gmail.com wrote:

On 20/12/2010 13:59, Richard Elling wrote:
On Dec 20, 2010, at 2:42 AM, Phil Harman phil.har...@gmail.com 
mailto:phil.har...@gmail.com wrote:

Why does resilvering take so long in raidz anyway?
Because it's broken. There were some changes a while back that made 
it more broken.

broken is the wrong term here. It functions as designed and correctly
resilvers devices. Disagreeing with the design is quite different than
proving a defect.
It might be the wrong term in general, but I think it does apply in 
the budget home media server context of this thread.

If you only have a few slow drives, you don't have performance.
Like trying to win the Indianapolis 500 with a tricycle...


The context of this thread is a budget home media server (certainly not 
the Indy 500, but perhaps not as humble as tricycle touring either). And 
whilst it is a habit of the hardware advocate to blame the software ... 
and vice versa ... it's not much help to those of us trying to build 
good enough systems across the performance and availability spectrum.


I think we can agree that ZFS currently doesn't play well on cheap 
disks. I think we can also agree that the performance of ZFS 
resilvering is known to be suboptimal under certain conditions.

... and those conditions are also a strength. For example, most file
systems are nowhere near full. With ZFS you only resilver data. For those
who recall the resilver throttles in SVM or VXVM, you will appreciate not
having to resilver non-data.


I'd love to see the data and analysis for the assertion that most files 
systems are nowhere near full, discounting, of course, any trivial 
cases. In my experience, in any cost conscious scenario, in the home or 
the enterprise, the expectation is that I'll get to use the majority of 
the space I've paid for (generally through the nose from the storage 
silo team in the enterprise scenario). To borrow your illustration, even 
Indy 500 teams care about fuel consumption.


What I don't appreciate is having to resilver significantly more data 
than the drive can contain. But when it comes to the crunch, what I'd 
really appreciate was a bounded resilver time measured in hours not days 
or weeks.


For a long time at Sun, the rule was correctness is a constraint, 
performance is a goal. However, in the real world, performance is 
often also a constraint (just as a quick but erroneous answer is a 
wrong answer, so also, a slow but correct answer can also be wrong).


Then one brave soul at Sun once ventured that if Linux is faster, 
it's a Solaris bug! and to his surprise, the idea caught on. I later 
went on to tell people that ZFS delievered RAID where I = 
inexpensive, so I'm a just a little frustrated when that promise 
becomes less respected over time. First it was USB drives (which I 
agreed with), now it's SATA (and I'm not so sure).

slow doesn't begin with an i :-)


Both ZFS and RAID promised to play in the inexpensive space.

There has been a lot of discussion, anecdotes and some data on this 
list.

slow because I use devices with poor random write(!) performance
is very different than broken.
Again, context is everything. For example, if someone was building a 
business critical NAS appliance from consumer grade parts, I'd be the 
first to say are you nuts?!

Unfortunately, the math does not support your position...


Actually, the math (e.g. raw drive metrics) doesn't lead me to expect 
such a disparity.


The resilver doesn't do a single pass of the drives, but uses a 
smarter temporal algorithm based on metadata.

A design that only does a single pass does not handle the temporal
changes. Many RAID implementations use a mix of spatial and temporal
resilvering and suffer with that design decision.
Actually, it's easy to see how a combined spatial and temporal 
approach could be implemented to an advantage for mirrored vdevs.
However, the current implentation has difficulty finishing the job 
if there's a steady flow of updates to the pool.

Please define current. There are many releases of ZFS, and
many improvements have been made over time. What has not
improved is the random write performance of consumer-grade
HDDs.
I was led to believe this was not yet fixed in Solaris 11, and that 
there are therefore doubts about what Solaris 10 update may see the 
fix, if any.
As far as I'm aware, the only way to get bounded resilver times is 
to stop the workload until resilvering is completed.

I know of no RAID implementation that bounds resilver times
for HDDs. I believe it is not possible. OTOH, whether a resilver
takes 10 seconds or 10 hours makes little difference in data
availability. Indeed, this is why we often throttle resilvering
activity. See previous discussions on this forum regarding the
dueling RFEs.
I don't share your disbelief or little difference analysys. If it 
is true that no 

Re: [zfs-discuss] A few questions

2010-12-21 Thread Deano
On Dec 20, 2010, at 7:31 AM, Phil Harman phil.har...@gmail.com wrote:

 If you only have a few slow drives, you don't have performance.

 Like trying to win the Indianapolis 500 with a tricycle...

 

Well you can put a jet engine on a tricycle and perhaps win it… Or you can 
change the race course to only allow a tricycle space to move. In the context 
of storage we have 2 factors hardware and software, having faster and more 
reliable spindles is no reason to suggest that better software can’t be used to 
beat it. The simple example is ZIL SSD, where using some software and  even a 
cheap commodity SSD will outperform sync writes than any amount of expensive 
spindle drives. Before ZIL software is was easy to argue that the only way of 
speeding up writes was more faster spindles.

 

The question therefore is, is there room in the software implementation to 
achieve performance and reliability numbers similar to expensive drives whilst 
using relative cheap drives?

 

ZFS is good but IMHO easy to see how it can be improved to better meet this 
situation, I can’t currently say when this line of thinking and code will move 
from research to production level use (tho I have a pretty good idea ;) ) but I 
wouldn’t bet on the status quo lasting much longer. In some ways the removal of 
OpenSolaris may actually be a good thing, as its catalyized a number of 
developers from the view that zfs is Oracle led, to thinking “what can we do 
with zfs code as a base”?

 

Ffor example how about sticking a cheap 80GiB commodity SSD in the storage 
case. When a resilver or defrag is required, use it as a scratch space to give 
you a block of fast IOPs storage space to accelerate the slow parts. When its 
done secure erase and power it down, ready for the next time a resilver needs 
to happen. The hardware is available, just needs someone to write the software…

 

 

Bye,

Deano

 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-21 Thread Phil Harman

On 21/12/2010 13:05, Deano wrote:


On Dec 20, 2010, at 7:31 AM, Phil Harman phil.har...@gmail.com 
mailto:phil.har...@gmail.com wrote:


 If you only have a few slow drives, you don't have performance.

 Like trying to win the Indianapolis 500 with a tricycle...



Actually, I didn't say that, Richard did :)

Well you can put a jet engine on a tricycle and perhaps win it… Or you 
can change the race course to only allow a tricycle space to move. In 
the context of storage we have 2 factors hardware and software, having 
faster and more reliable spindles is no reason to suggest that better 
software can’t be used to beat it. The simple example is ZIL SSD, 
where using some software and  even a cheap commodity SSD will 
outperform sync writes than any amount of expensive spindle drives. 
Before ZIL software is was easy to argue that the only way of speeding 
up writes was more faster spindles.


The question therefore is, is there room in the software 
implementation to achieve performance and reliability numbers similar 
to expensive drives whilst using relative cheap drives?


ZFS is good but IMHO easy to see how it can be improved to better meet 
this situation, I can’t currently say when this line of thinking and 
code will move from research to production level use (tho I have a 
pretty good idea ;) ) but I wouldn’t bet on the status quo lasting 
much longer. In some ways the removal of OpenSolaris may actually be a 
good thing, as its catalyized a number of developers from the view 
that zfs is Oracle led, to thinking “what can we do with zfs code as a 
base”?


Ffor example how about sticking a cheap 80GiB commodity SSD in the 
storage case. When a resilver or defrag is required, use it as a 
scratch space to give you a block of fast IOPs storage space to 
accelerate the slow parts. When its done secure erase and power it 
down, ready for the next time a resilver needs to happen. The hardware 
is available, just needs someone to write the software…


Bye,

Deano


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-21 Thread Deano
Doh sorry about that, the threading got very confused on my mail reader!

 

Bye,

Deano

 

From: Phil Harman [mailto:phil.har...@gmail.com] 
Sent: 21 December 2010 13:12
To: Deano
Cc: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] A few questions

 

On 21/12/2010 13:05, Deano wrote: 

On Dec 20, 2010, at 7:31 AM, Phil Harman phil.har...@gmail.com wrote:

 If you only have a few slow drives, you don't have performance.

 Like trying to win the Indianapolis 500 with a tricycle...


Actually, I didn't say that, Richard did :)




Well you can put a jet engine on a tricycle and perhaps win it… Or you can 
change the race course to only allow a tricycle space to move. In the context 
of storage we have 2 factors hardware and software, having faster and more 
reliable spindles is no reason to suggest that better software can’t be used to 
beat it. The simple example is ZIL SSD, where using some software and  even a 
cheap commodity SSD will outperform sync writes than any amount of expensive 
spindle drives. Before ZIL software is was easy to argue that the only way of 
speeding up writes was more faster spindles.

 

The question therefore is, is there room in the software implementation to 
achieve performance and reliability numbers similar to expensive drives whilst 
using relative cheap drives?

 

ZFS is good but IMHO easy to see how it can be improved to better meet this 
situation, I can’t currently say when this line of thinking and code will move 
from research to production level use (tho I have a pretty good idea ;) ) but I 
wouldn’t bet on the status quo lasting much longer. In some ways the removal of 
OpenSolaris may actually be a good thing, as its catalyized a number of 
developers from the view that zfs is Oracle led, to thinking “what can we do 
with zfs code as a base”?

 

Ffor example how about sticking a cheap 80GiB commodity SSD in the storage 
case. When a resilver or defrag is required, use it as a scratch space to give 
you a block of fast IOPs storage space to accelerate the slow parts. When its 
done secure erase and power it down, ready for the next time a resilver needs 
to happen. The hardware is available, just needs someone to write the software…

 

 

Bye,

Deano

 

 
 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-21 Thread Edward Ned Harvey
 From: edmud...@mail.bounceswoosh.org
 [mailto:edmud...@mail.bounceswoosh.org] On Behalf Of Eric D. Mudama
 
 On Mon, Dec 20 at 19:19, Edward Ned Harvey wrote:
 If there is no correlation between on-disk order of blocks for different
 disks within the same vdev, then all hope is lost; it's essentially
 impossible to optimize the resilver/scrub order unless the on-disk order
of
 multiple disks is highly correlated or equal by definition.
 
 Very little is impossible.
 
 Drives have been optimally ordering seeks for 35+ years.  I'm guessing

Unless your drive is able to queue up a request to read every single used
part of the drive...  Which is larger than the command queue for any
reasonable drive in the world...  The point is, in order to be optimal you
have to eliminate all those seeks, and perform sequential reads only.  The
only seeks you should do are to skip over unused space.

If you're able to sequentially read the whole drive, skipping all the unused
space, then you're guaranteed to complete faster (or equal) than either (a)
sequentially reading the whole drive, or (b) seeking all over the drive to
read the used parts in random order.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-21 Thread Edward Ned Harvey
 From: Richard Elling [mailto:richard.ell...@gmail.com]
 
  Now suppose you have a raidz with 3 disks (disk1, disk2, disk3, where
disk3
  is resilvering).  You find some way of ordering all the used blocks of
  disk1...  Which means disk1 will be able to read in optimal order and
speed.
 
 Sounds like prefetching :-)

Ok.  Prefetch every used sector in the pool.  Problem solved.  Let the disks
sort all the requests into on-disk order.  Unless perhaps the number of
requests would exceed the limits of what the drive is able to sort ...
Which seems ... more than likely.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-21 Thread Eric D. Mudama

On Tue, Dec 21 at  8:24, Edward Ned Harvey wrote:

From: edmud...@mail.bounceswoosh.org
[mailto:edmud...@mail.bounceswoosh.org] On Behalf Of Eric D. Mudama

On Mon, Dec 20 at 19:19, Edward Ned Harvey wrote:
If there is no correlation between on-disk order of blocks for different
disks within the same vdev, then all hope is lost; it's essentially
impossible to optimize the resilver/scrub order unless the on-disk order

of

multiple disks is highly correlated or equal by definition.

Very little is impossible.

Drives have been optimally ordering seeks for 35+ years.  I'm guessing


Unless your drive is able to queue up a request to read every single used
part of the drive...  Which is larger than the command queue for any
reasonable drive in the world...  The point is, in order to be optimal you
have to eliminate all those seeks, and perform sequential reads only.  The
only seeks you should do are to skip over unused space.


I don't think you read my whole post.  I was saying this seek
calculation pre-processing would have to be done by the host server,
and while not impossible, is not trivial.  Present the next 32 seeks
to each device while the pre-processor works on the complete list of
future seeks, and the drive will do as well as possible.


If you're able to sequentially read the whole drive, skipping all the unused
space, then you're guaranteed to complete faster (or equal) than either (a)
sequentially reading the whole drive, or (b) seeking all over the drive to
read the used parts in random order.


Yes, I understand how that works.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-21 Thread Edward Ned Harvey
 From: edmud...@mail.bounceswoosh.org
 [mailto:edmud...@mail.bounceswoosh.org] On Behalf Of Eric D. Mudama
 
 Unless your drive is able to queue up a request to read every single used
 part of the drive...  Which is larger than the command queue for any
 reasonable drive in the world...  The point is, in order to be optimal
you
 have to eliminate all those seeks, and perform sequential reads only.
The
 only seeks you should do are to skip over unused space.
 
 I don't think you read my whole post.  I was saying this seek
 calculation pre-processing would have to be done by the host server,
 and while not impossible, is not trivial.  Present the next 32 seeks
 to each device while the pre-processor works on the complete list of
 future seeks, and the drive will do as well as possible.

I did read that, but now I think, perhaps I misunderstand it, or you
misunderstood me?  I am thinking...  If you're just queueing up a few reads
at a time (less than infinity, or less than 99% of the pool) ...  I would
not assume that these 32 seeks are even remotely sequential  I mean ...
32 blocks in a pool of presumably millions of blocks...  I would assume they
are essentially random, are they not?

In my mind, which is likely wrong or at least oversimplified, I think if you
want to order the list of blocks to read according to disk order (which
should at least be theoretically possible on mirrors, but perhaps not even
physically possible on raidz)...  You would have to first generate a list of
all the blocks to be read, and then sort it.  Rough estimate, for any pool
of a reasonable size, that sounds like some GB of ram to me.

Maybe there's a less-than-perfect sort algorithm which has a much lower
memory footprint?  Like a simple hashing algorithm that will guarantee the
next few thousand seeks are in disk order...  Although they will skip or
jump over many blocks that will have to be done later ... An algorithm which
is not a perfect sort, but given some repetition and multiple passes over
the disk, might achieve an acceptable level of performance versus memory
footprint...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Lanky Doodle
Thanks Edward.

I do agree about mirrored rpool (equivalent to Windows OS volume); not doing it 
goes against one of my principles when building enterprise servers.

Is there any argument against using the rpool for all data storage as well as 
being the install volume?

Say for example I chucked 15x 1TB disks in there and created a mirrored rpool 
during installation, using 2 disks. If I added another 6 mirrors (12 disks) to 
it that would give me an rpool of 7TB. The 15th disk being a spare.

Or, say I selected 3 disks during install, does this create a 3 way mirrored 
rpool or does it give you the option of creating raidz? If so, I could then 
create a further 4x 3 drive raidz's, giving me a 10TB rpool.

Or, I could use 2 smaller disks (say 80GB) for the rpool, then create 4x 3 
drive raidz's, giving me an 8TB rpool. Again this gives me a spare disk.

Either of these 3 should keep resilvering times to a minimum, against say one 
big raidz2 of 13 disks.

Why does resilvering take so long in raidz anyway?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Lanky Doodle
Oh, does anyone know if resilvering efficiency is improved or fixed in Solaris 
11 Express, as that is what i'm using.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Phil Harman
 Why does resilvering take so long in raidz anyway?

Because it's broken. There were some changes a while back that made it more 
broken.

There has been a lot of discussion, anecdotes and some data on this list. 

The resilver doesn't do a single pass of the drives, but uses a smarter 
temporal algorithm based on metadata.

However, the current implentation has difficulty finishing the job if there's a 
steady flow of updates to the pool.

As far as I'm aware, the only way to get bounded resilver times is to stop the 
workload until resilvering is completed.

The problem exists for mirrors too, but is not as marked because mirror 
reconstruction is inherently simpler.

I believe Oracle is aware of the problem, but most of the core ZFS team has 
left. And of course, a fix for Oracle Solaris no longer means a fix for the 
rest of us.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Deano
Hi,
Which brings up an interesting question... 

IF it were fixed in for example illumos or freebsd is there a plan for how
to handle possible incompatible zfs implementations?

Currently the basic version numbering only works as it implies only one
stream of development, now with multiple possible stream does ZFS need to
move to a feature bit system or are we going to have to have forks or
multiple incompatible versions?

Thanks,
Deano

-Original Message-
From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Phil Harman
Sent: 20 December 2010 10:43
To: Lanky Doodle
Cc: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] A few questions

 Why does resilvering take so long in raidz anyway?

Because it's broken. There were some changes a while back that made it more
broken.

There has been a lot of discussion, anecdotes and some data on this list. 

The resilver doesn't do a single pass of the drives, but uses a smarter
temporal algorithm based on metadata.

However, the current implentation has difficulty finishing the job if
there's a steady flow of updates to the pool.

As far as I'm aware, the only way to get bounded resilver times is to stop
the workload until resilvering is completed.

The problem exists for mirrors too, but is not as marked because mirror
reconstruction is inherently simpler.

I believe Oracle is aware of the problem, but most of the core ZFS team has
left. And of course, a fix for Oracle Solaris no longer means a fix for the
rest of us.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Lanky Doodle
 I believe Oracle is aware of the problem, but most of
 the core ZFS team has left. And of course, a fix for
 Oracle Solaris no longer means a fix for the rest of
 us.

OK, that is a bit concerning then. As good as ZFS may be, i'm not sure I want 
to committ to a file system that is 'broken' and may not be fully fixed, if at 
all.

Hmnnn...
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Phil Harman

On 20/12/2010 11:03, Deano wrote:

Hi,
Which brings up an interesting question...

IF it were fixed in for example illumos or freebsd is there a plan for how
to handle possible incompatible zfs implementations?

Currently the basic version numbering only works as it implies only one
stream of development, now with multiple possible stream does ZFS need to
move to a feature bit system or are we going to have to have forks or
multiple incompatible versions?

Thanks,
Deano


Changes to the resilvering implementation don't necessarily require 
changes to the on disk format (although they could). Of course, there 
might be an issue moving a pool mid-resilver from one implementation to 
another.


With arguably considerably more ZFS expertise outside Oracle than in, 
there's a good chance the community will get to a fix first. It would 
then be interesting to see whether NIH prevails, or perhaps even a new 
spirit of share and share alike.


You may say I'm a dreamer ...
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Phil Harman

On 20/12/2010 11:29, Lanky Doodle wrote:

I believe Oracle is aware of the problem, but most of
the core ZFS team has left. And of course, a fix for
Oracle Solaris no longer means a fix for the rest of
us.

OK, that is a bit concerning then. As good as ZFS may be, i'm not sure I want 
to committ to a file system that is 'broken' and may not be fully fixed, if at 
all.

Hmnnn...


My home server is still running snv_82, and my iMac is running Apple's 
last public beta release for Leopard. The way I see it, the on-disk 
format is sound, and the basic always consistent on disk promise seems 
to be worth something. My files are read-mostly, and performance isn't 
an issue for me. ZFS has protected my data for several years now in the 
face of various hardware issues. I'll upgrade my NAS appliance to 
OpenSolaris snv_134b sometime soon, but as far as I can tell, I can't 
use Oracle Solaris 11 Express for licensing reasons (I have backups of 
business data). I'll be watching Illumos with interest, but snv_82 has 
served me well for 3 years, so I figure snv_134b probably has quite a 
lot of useful life left in it, and maybe then brtfs will be ready for 
prime time?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Joerg Schilling
Phil Harman phil.har...@gmail.com wrote:

 Changes to the resilvering implementation don't necessarily require 
 changes to the on disk format (although they could). Of course, there 
 might be an issue moving a pool mid-resilver from one implementation to 
 another.

We seem to come to a similar problem as wuth UFS 20 years ago. At that time,
Sun did enhance the UFS on-disk format but the *BSDs did not follow this change 
even though the format change was documented in the related include files.

For a future ZFS development, thee may be a need to allow an implementation to 
implement on-disk version 1..21 + 24 and another implementation to support 
on-disk version 1..23 + 25.

These thoughts of course are void in case that Oracle continues the OSS 
decisions for Solaris and other Solaris variants can import the code related to
recent enhancements.



Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Richard Elling

On Dec 20, 2010, at 2:42 AM, Phil Harman phil.har...@gmail.com wrote:

 Why does resilvering take so long in raidz anyway?
 
 Because it's broken. There were some changes a while back that made it more 
 broken.

broken is the wrong term here. It functions as designed and correctly 
resilvers devices. Disagreeing with the design is quite different than
proving a defect.

 There has been a lot of discussion, anecdotes and some data on this list. 

slow because I use devices with poor random write(!) performance
is very different than broken.

 The resilver doesn't do a single pass of the drives, but uses a smarter 
 temporal algorithm based on metadata.

A design that only does a single pass does not handle the temporal
changes. Many RAID implementations use a mix of spatial and temporal
resilvering and suffer with that design decision.

 However, the current implentation has difficulty finishing the job if there's 
 a steady flow of updates to the pool.

Please define current. There are many releases of ZFS, and
many improvements have been made over time. What has not
improved is the random write performance of consumer-grade
HDDs.

 As far as I'm aware, the only way to get bounded resilver times is to stop 
 the workload until resilvering is completed.

I know of no RAID implementation that bounds resilver times
for HDDs. I believe it is not possible. OTOH, whether a resilver
takes 10 seconds or 10 hours makes little difference in data
availability. Indeed, this is why we often throttle resilvering
activity. See previous discussions on this forum regarding the
dueling RFEs.

 The problem exists for mirrors too, but is not as marked because mirror 
 reconstruction is inherently simpler.

Resilver time is bounded by the random write performance of
the resilvering device. Mirroring or raidz make no difference.

 I believe Oracle is aware of the problem, but most of the core ZFS team has 
 left. And of course, a fix for Oracle Solaris no longer means a fix for the 
 rest of us.

Some improvements were made post-b134 and pre-b148.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Lanky Doodle
Thanks relling.

I suppose at the end of the day any file system/volume manager has it's flaws 
so perhaps it's better to look at the positives of each and decide based on 
them.

So, back to my question above, is there a deciding argument [i]against[/i] 
putting data on the install volume (rpool). Forget about mirroring for a sec;

1) Select 3 disks during install creating raidz1. Create a further 4x 3 drive 
raidz1's, giving me a 10TB rpool with no spare disks

2) Select 5 disks during install creating raidz1. Create a further 2x 5 drive 
raidsz1's giving me a 12TB rpool with no spare disks

3) Select 7 disks during install creating raidz1. Create a further 7 drive 
raidz1 giving me 12TB rpool with 1 spare disk

As there is no space gain between 2) and 3) there is no point going for 3), 
other than having a spare disk, but resilver times would be slower.

So it becomes between 1) and 2). Neither offer spare disks but 1) would offer 
faster resilver times with upto 5 simultaneous disk failures and 2) would offer 
2TB extra space with upto 3 simultaneous disk failures.

FYI, I am using Samsung SpinPoint F2's, which have the variable RPM speeds 
(http://www.scan.co.uk/products/1tb-samsung-hd103si-ecogreen-f2-sata-3gb-s-32mb-cache-89-ms-ncq)

I may wait at least until I get the next 4 drives in (I actually have 6 at the 
mo, not 5) taking me to 10, before migrating to ZFS so plenty of time to think 
about it and hopefully time for them to fix resilvering! ;-)

Thanks again...
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Phil Harman

On 20/12/2010 13:59, Richard Elling wrote:
On Dec 20, 2010, at 2:42 AM, Phil Harman phil.har...@gmail.com 
mailto:phil.har...@gmail.com wrote:



Why does resilvering take so long in raidz anyway?
Because it's broken. There were some changes a while back that made 
it more broken.


broken is the wrong term here. It functions as designed and correctly
resilvers devices. Disagreeing with the design is quite different than
proving a defect.


It might be the wrong term in general, but I think it does apply in the 
budget home media server context of this thread. I think we can agree 
that ZFS currently doesn't play well on cheap disks. I think we can also 
agree that the performance of ZFS resilvering is known to be suboptimal 
under certain conditions.


For a long time at Sun, the rule was correctness is a constraint, 
performance is a goal. However, in the real world, performance is often 
also a constraint (just as a quick but erroneous answer is a wrong 
answer, so also, a slow but correct answer can also be wrong).


Then one brave soul at Sun once ventured that if Linux is faster, it's 
a Solaris bug! and to his surprise, the idea caught on. I later went on 
to tell people that ZFS delievered RAID where I = inexpensive, so I'm 
a just a little frustrated when that promise becomes less respected over 
time. First it was USB drives (which I agreed with), now it's SATA (and 
I'm not so sure).


There has been a lot of discussion, anecdotes and some data on this 
list.


slow because I use devices with poor random write(!) performance
is very different than broken.


Again, context is everything. For example, if someone was building a 
business critical NAS appliance from consumer grade parts, I'd be the 
first to say are you nuts?!


The resilver doesn't do a single pass of the drives, but uses a 
smarter temporal algorithm based on metadata.


A design that only does a single pass does not handle the temporal
changes. Many RAID implementations use a mix of spatial and temporal
resilvering and suffer with that design decision.


Actually, it's easy to see how a combined spatial and temporal approach 
could be implemented to an advantage for mirrored vdevs.


However, the current implentation has difficulty finishing the job if 
there's a steady flow of updates to the pool.


Please define current. There are many releases of ZFS, and
many improvements have been made over time. What has not
improved is the random write performance of consumer-grade
HDDs.


I was led to believe this was not yet fixed in Solaris 11, and that 
there are therefore doubts about what Solaris 10 update may see the fix, 
if any.


As far as I'm aware, the only way to get bounded resilver times is to 
stop the workload until resilvering is completed.


I know of no RAID implementation that bounds resilver times
for HDDs. I believe it is not possible. OTOH, whether a resilver
takes 10 seconds or 10 hours makes little difference in data
availability. Indeed, this is why we often throttle resilvering
activity. See previous discussions on this forum regarding the
dueling RFEs.


I don't share your disbelief or little difference analysys. If it is 
true that no current implementation succeeds, isn't that a great 
opportunity to change the rules? Wasn't resilver time vs availability 
was a major factor in Adam Leventhal's paper introducing the need for 
RAIDZ3?


The appropriateness or otherwise of resilver throttling depends on the 
context. If I can tolerate further failures without data loss (e.g. 
RAIDZ2 with one failed device, or RAIDZ3 with two failed devices), or if 
I can recover business critical data in a timely manner, then great. But 
there may come a point where I would rather take a short term 
performance hit to close the window on total data loss.


The problem exists for mirrors too, but is not as marked because 
mirror reconstruction is inherently simpler.


Resilver time is bounded by the random write performance of
the resilvering device. Mirroring or raidz make no difference.


This only holds in a quiesced system.

I believe Oracle is aware of the problem, but most of the core ZFS 
team has left. And of course, a fix for Oracle Solaris no longer 
means a fix for the rest of us.


Some improvements were made post-b134 and pre-b148.


That is, indeed, good news.


 -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Mark Sandrock

On Dec 18, 2010, at 12:23 PM, Lanky Doodle wrote:

 Now this is getting really complex, but can you have server failover in ZFS, 
 much like DFS-R in Windows - you point clients to a clustered ZFS namespace 
 so if a complete server failed nothing is interrupted.

This is the purpose of an Amber Road dual-head cluster (7310C, 7410C, etc.) -- 
not only the storage pool fails over,
but also the server IP address fails over, so that NFS, etc. shares remain 
active, when one storage head goes down.

Amber Road uses ZFS, but the clustering and failover are not related to the 
filesystem type.

Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Lanky Doodle
 
  I believe Oracle is aware of the problem, but most of
  the core ZFS team has left. And of course, a fix for
  Oracle Solaris no longer means a fix for the rest of
  us.
 
 OK, that is a bit concerning then. As good as ZFS may be, i'm not sure I
want
 to committ to a file system that is 'broken' and may not be fully fixed,
if at all.

ZFS is not broken.  It is, however, a weak spot, that resilver is very
inefficient.  For example:

On my server, which is made up of 10krpm SATA drives, 1TB each...  My drives
can each sustain 1Gbit/sec sequential read/write.  This means, if I needed
to resilver the entire drive (in a mirror) sequentially, it would take ...
8,000 sec = 133 minutes.  About 2 hours.  In reality, I have ZFS mirrors,
and disks are around 70% full, and resilver takes 12-14 hours.

So although resilver is broken by some standards, it is bounded, and you
can limit it to something which is survivable, by using mirrors instead of
raidz.  For most people, even using 5-disk, or 7-disk raidzN will still be
fine.  But you start getting unsustainable if you get up to 21-disk radiz3
for example.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Lanky Doodle
 
 Is there any argument against using the rpool for all data storage as well
as
 being the install volume?

Generally speaking, you can't do it.
The rpool is only supported on mirrors, not raidz.  I believe this is
because you need rpool in order to load the kernel, and until the kernel is
loaded, there's just no reasonable way to have a fully zfs-aware,
supports-every-feature bootloader able to read rpool in order to fetch the
kernel.

Normally, you'll dedicate 2 disks to the OS, and then you build additional
separate data pools.  If you absolutely need all the disk space of the OS
disks, then you partition the OS into a smaller section of the OS disks and
assign the remaining space to some pool.  But doing that partitioning scheme
can be complex, and if you're not careful, risky.  I don't advise it unless
you truly have your back against a wall for more disk space.


 Why does resilvering take so long in raidz anyway?

There are some really long and sometimes complex threads in this mailing
list discussing that.  Fundamentally ... First of all, it's not always true.
It depends on your usage behavior and the type of disks you're using.  But
the typical usage includes reading  writing a lot of files, essentially
randomly over time, creating and deleting snapshots, using spindle disks, so
the typical usage behavior does have a resilver performance problem.

The root cause of the problem is that ZFS does not resilver the whole
disk...  It only resilvers the used portions of the disk.  Sounds like a
performance enhancer, right?  It would be, if the disks were mostly empty
... or if ZFS were resilvering a partial disk, in order according to disk
layout.  Unfortunately, it's resilvering according to the temporal order
blocks were written, and usually a disk is significantly full (say, 50% or
more) and as such, the disks have to thrash all around, performing all sorts
of random reads, until eventually it can read all the used parts in random
order.

It's worse on raidzN than on mirrors, because the number of items which must
be read is higher in radizN, assuming you're using larger vdev's and
therefore more items exist scattered about inside that vdev.  You therefore
have a higher number of things which must be randomly read before you reach
completion.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Saxon, Will
 -Original Message-
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
 Sent: Monday, December 20, 2010 11:46 AM
 To: 'Lanky Doodle'; zfs-discuss@opensolaris.org
 Subject: Re: [zfs-discuss] A few questions
 
  From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
  boun...@opensolaris.org] On Behalf Of Lanky Doodle
 
   I believe Oracle is aware of the problem, but most of
   the core ZFS team has left. And of course, a fix for
   Oracle Solaris no longer means a fix for the rest of
   us.
 
  OK, that is a bit concerning then. As good as ZFS may be, i'm not sure I
 want
  to committ to a file system that is 'broken' and may not be fully fixed,
 if at all.
 
 ZFS is not broken.  It is, however, a weak spot, that resilver is very
 inefficient.  For example:
 
 On my server, which is made up of 10krpm SATA drives, 1TB each...  My
 drives
 can each sustain 1Gbit/sec sequential read/write.  This means, if I needed
 to resilver the entire drive (in a mirror) sequentially, it would take ...
 8,000 sec = 133 minutes.  About 2 hours.  In reality, I have ZFS mirrors,
 and disks are around 70% full, and resilver takes 12-14 hours.
 
 So although resilver is broken by some standards, it is bounded, and you
 can limit it to something which is survivable, by using mirrors instead of
 raidz.  For most people, even using 5-disk, or 7-disk raidzN will still be
 fine.  But you start getting unsustainable if you get up to 21-disk radiz3
 for example.

This argument keeps coming up on the list, but I don't see where anyone has 
made a good suggestion about whether this can even be 'fixed' or how it would 
be done.

As I understand it, you have two basic types of array reconstruction: in a 
mirror you can make a block-by-block copy and that's easy, but in a parity 
array you have to perform a calculation on the existing data and/or existing 
parity to reconstruct the missing piece. This is pretty easy when you can 
guarantee that all your stripes are the same width, start/end on the same 
sectors/boundaries/whatever and thus know a piece of them lives on all drives 
in the set. I don't think this is possible with ZFS since we have variable 
stripe width. A failed disk d may or may not contain data from stripe s (or 
transaction t). This information has to be discovered by looking at the 
transaction records. Right?

Can someone speculate as to how you could rebuild a variable stripe width array 
without replaying all the available transactions? I am no filesystem engineer 
but I can't wrap my head around how this could be handled any better than it 
already is. I've read that resilvering is throttled - presumably to keep 
performance degradation to a minimum during the process - maybe this could be a 
tunable (e.g. priority: low, normal, high)? 

Do we know if resilvers on a mirror are actually handled differently from those 
on a raidz?

Sorry if this has already been explained. I think this is an issue that 
everyone who uses ZFS should understand completely before jumping in, because 
the behavior (while not 'wrong') is clearly NOT the same as with more 
conventional arrays.

-Will
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Erik Trimble

On 12/20/2010 9:20 AM, Saxon, Will wrote:

-Original Message-
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
Sent: Monday, December 20, 2010 11:46 AM
To: 'Lanky Doodle'; zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] A few questions


From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Lanky Doodle


I believe Oracle is aware of the problem, but most of
the core ZFS team has left. And of course, a fix for
Oracle Solaris no longer means a fix for the rest of
us.

OK, that is a bit concerning then. As good as ZFS may be, i'm not sure I

want

to committ to a file system that is 'broken' and may not be fully fixed,

if at all.

ZFS is not broken.  It is, however, a weak spot, that resilver is very
inefficient.  For example:

On my server, which is made up of 10krpm SATA drives, 1TB each...  My
drives
can each sustain 1Gbit/sec sequential read/write.  This means, if I needed
to resilver the entire drive (in a mirror) sequentially, it would take ...
8,000 sec = 133 minutes.  About 2 hours.  In reality, I have ZFS mirrors,
and disks are around 70% full, and resilver takes 12-14 hours.

So although resilver is broken by some standards, it is bounded, and you
can limit it to something which is survivable, by using mirrors instead of
raidz.  For most people, even using 5-disk, or 7-disk raidzN will still be
fine.  But you start getting unsustainable if you get up to 21-disk radiz3
for example.

This argument keeps coming up on the list, but I don't see where anyone has 
made a good suggestion about whether this can even be 'fixed' or how it would 
be done.

As I understand it, you have two basic types of array reconstruction: in a 
mirror you can make a block-by-block copy and that's easy, but in a parity 
array you have to perform a calculation on the existing data and/or existing 
parity to reconstruct the missing piece. This is pretty easy when you can 
guarantee that all your stripes are the same width, start/end on the same 
sectors/boundaries/whatever and thus know a piece of them lives on all drives 
in the set. I don't think this is possible with ZFS since we have variable 
stripe width. A failed disk d may or may not contain data from stripe s (or 
transaction t). This information has to be discovered by looking at the 
transaction records. Right?

Can someone speculate as to how you could rebuild a variable stripe width array 
without replaying all the available transactions? I am no filesystem engineer 
but I can't wrap my head around how this could be handled any better than it 
already is. I've read that resilvering is throttled - presumably to keep 
performance degradation to a minimum during the process - maybe this could be a 
tunable (e.g. priority: low, normal, high)?

Do we know if resilvers on a mirror are actually handled differently from those 
on a raidz?

Sorry if this has already been explained. I think this is an issue that 
everyone who uses ZFS should understand completely before jumping in, because 
the behavior (while not 'wrong') is clearly NOT the same as with more 
conventional arrays.

-Will
the problem is NOT the checksum/error correction overhead. that's 
relatively trivial.  The problem isn't really even variable width (i.e. 
variable number of disks one crosses) slabs.


The problem boils down to this:

When ZFS does a resilver, it walks the METADATA tree to determine what 
order to rebuild things from. That means, it resilvers the very first 
slab ever written, then the next oldest, etc.   The problem here is that 
slab age has nothing to do with where that data physically resides on 
the actual disks. If you've used the zpool as a WORM device, then, sure, 
there should be a strict correlation between increasing slab age and 
locality on the disk.  However, in any reasonable case, files get 
deleted regularly. This means that the probability that for a slab B, 
written immediately after slab A, it WON'T be physically near slab A.


In the end, the problem is that using metadata order, while reducing the 
total amount of work to do in the resilver (as you only resilver live 
data, not every bit on the drive), increases the physical inefficiency 
for each slab.  That is, seek time between cyclinders begins to dominate 
your slab reconstruction time.  In RAIDZ, this problem is magnified by 
both the much larger average vdev size vs mirrors, and the necessity 
that all drives containing a slab information return that data before 
the corrected data can be written to the resilvering drive.


Thus, current ZFS resilvering tends to be seek-time limited, NOT 
throughput limited.  This is really the fault of the underlying media, 
not ZFS.  For instance, if you have a raidZ of SSDs (where seek time is 
negligible, but throughput isn't),  they resilver really, really fast. 
In fact, they resilver at the maximum write throughput rate.   However, 
HDs

Re: [zfs-discuss] A few questions

2010-12-20 Thread Erik Trimble

On 12/20/2010 9:20 AM, Saxon, Will wrote:

-Original Message-
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
Sent: Monday, December 20, 2010 11:46 AM
To: 'Lanky Doodle'; zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] A few questions


From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Lanky Doodle


I believe Oracle is aware of the problem, but most of
the core ZFS team has left. And of course, a fix for
Oracle Solaris no longer means a fix for the rest of
us.

OK, that is a bit concerning then. As good as ZFS may be, i'm not sure I

want

to committ to a file system that is 'broken' and may not be fully fixed,

if at all.

ZFS is not broken.  It is, however, a weak spot, that resilver is very
inefficient.  For example:

On my server, which is made up of 10krpm SATA drives, 1TB each...  My
drives
can each sustain 1Gbit/sec sequential read/write.  This means, if I needed
to resilver the entire drive (in a mirror) sequentially, it would take ...
8,000 sec = 133 minutes.  About 2 hours.  In reality, I have ZFS mirrors,
and disks are around 70% full, and resilver takes 12-14 hours.

So although resilver is broken by some standards, it is bounded, and you
can limit it to something which is survivable, by using mirrors instead of
raidz.  For most people, even using 5-disk, or 7-disk raidzN will still be
fine.  But you start getting unsustainable if you get up to 21-disk radiz3
for example.

This argument keeps coming up on the list, but I don't see where anyone has 
made a good suggestion about whether this can even be 'fixed' or how it would 
be done.

As I understand it, you have two basic types of array reconstruction: in a 
mirror you can make a block-by-block copy and that's easy, but in a parity 
array you have to perform a calculation on the existing data and/or existing 
parity to reconstruct the missing piece. This is pretty easy when you can 
guarantee that all your stripes are the same width, start/end on the same 
sectors/boundaries/whatever and thus know a piece of them lives on all drives 
in the set. I don't think this is possible with ZFS since we have variable 
stripe width. A failed disk d may or may not contain data from stripe s (or 
transaction t). This information has to be discovered by looking at the 
transaction records. Right?

Can someone speculate as to how you could rebuild a variable stripe width array 
without replaying all the available transactions? I am no filesystem engineer 
but I can't wrap my head around how this could be handled any better than it 
already is. I've read that resilvering is throttled - presumably to keep 
performance degradation to a minimum during the process - maybe this could be a 
tunable (e.g. priority: low, normal, high)?

Do we know if resilvers on a mirror are actually handled differently from those 
on a raidz?

Sorry if this has already been explained. I think this is an issue that 
everyone who uses ZFS should understand completely before jumping in, because 
the behavior (while not 'wrong') is clearly NOT the same as with more 
conventional arrays.

-Will


As far as a possible fix, here's what I can see:

[Note:  I'm not a kernel or FS-level developer. I would love to be able 
to fix this myself, but I have neither the aptitude nor the [extensive] 
time to learn such skill]


We can either (a) change how ZFS does resilvering or (b) repack the 
zpool layouts to avoid the problem in the first place.


In case (a), my vote would be to seriously increase the number of 
in-flight resilver slabs, AND allow for out-of-time-order slab 
resilvering.  By that, I mean that ZFS would read several 
disk-sequential slabs, and then mark them as done. This would mean a 
*lot* of scanning the metadata tree (since leaves all over the place 
could be done).   Frankly, I can't say how bad that would be; the 
problem is that for ANY resilver, ZFS would have to scan the entire 
metadata tree to see if it had work to do, rather than simply look for 
the latest completed leave, then assume everything after that needs to 
be done.  There'd also be the matter of determining *if* one should read 
a disk sector...


In case (b), we need the ability to move slabs around on the physical 
disk (via the mythical Block Pointer Re-write method).  If there is 
that underlying mechanism, then a defrag utility can be run to repack 
the zpool to the point where chronological creation time = physical 
layout.  Which then substantially mitigates the seek time problem.



I can't fix (a) - I don't understand the codebase well enough. Neither 
can I do the BP-rewrite implementation.  However, if I can get 
BP-rewrite, I've got a prototype defragger that seems to work well 
(under simulation). I'm sure it could use some performance improvement, 
but it works reasonably well on a simulated fragmented pool.



Please, Santa, can a good little boy get

Re: [zfs-discuss] A few questions

2010-12-20 Thread Mark Sandrock
Erik,

just a hypothetical what-if ...

In the case of resilvering on a mirrored disk, why not take a snapshot, and then
resilver by doing a pure block copy from the snapshot? It would be sequential,
so long as the original data was unmodified; and random access in dealing with
the modified blocks only, right.

After the original snapshot had been replicated, a second pass would be done,
in order to update the clone to 100% live data.

Not knowing enough about the inner workings of ZFS snapshots, I don't know why
this would not be doable. (I'm biased towards mirrors for busy filesystems.)

I'm supposing that a block-level snapshot is not doable -- or is it?

Mark

On Dec 20, 2010, at 1:27 PM, Erik Trimble wrote:

 On 12/20/2010 9:20 AM, Saxon, Will wrote:
 -Original Message-
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
 Sent: Monday, December 20, 2010 11:46 AM
 To: 'Lanky Doodle'; zfs-discuss@opensolaris.org
 Subject: Re: [zfs-discuss] A few questions
 
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Lanky Doodle
 
 I believe Oracle is aware of the problem, but most of
 the core ZFS team has left. And of course, a fix for
 Oracle Solaris no longer means a fix for the rest of
 us.
 OK, that is a bit concerning then. As good as ZFS may be, i'm not sure I
 want
 to committ to a file system that is 'broken' and may not be fully fixed,
 if at all.
 
 ZFS is not broken.  It is, however, a weak spot, that resilver is very
 inefficient.  For example:
 
 On my server, which is made up of 10krpm SATA drives, 1TB each...  My
 drives
 can each sustain 1Gbit/sec sequential read/write.  This means, if I needed
 to resilver the entire drive (in a mirror) sequentially, it would take ...
 8,000 sec = 133 minutes.  About 2 hours.  In reality, I have ZFS mirrors,
 and disks are around 70% full, and resilver takes 12-14 hours.
 
 So although resilver is broken by some standards, it is bounded, and you
 can limit it to something which is survivable, by using mirrors instead of
 raidz.  For most people, even using 5-disk, or 7-disk raidzN will still be
 fine.  But you start getting unsustainable if you get up to 21-disk radiz3
 for example.
 This argument keeps coming up on the list, but I don't see where anyone has 
 made a good suggestion about whether this can even be 'fixed' or how it 
 would be done.
 
 As I understand it, you have two basic types of array reconstruction: in a 
 mirror you can make a block-by-block copy and that's easy, but in a parity 
 array you have to perform a calculation on the existing data and/or existing 
 parity to reconstruct the missing piece. This is pretty easy when you can 
 guarantee that all your stripes are the same width, start/end on the same 
 sectors/boundaries/whatever and thus know a piece of them lives on all 
 drives in the set. I don't think this is possible with ZFS since we have 
 variable stripe width. A failed disk d may or may not contain data from 
 stripe s (or transaction t). This information has to be discovered by 
 looking at the transaction records. Right?
 
 Can someone speculate as to how you could rebuild a variable stripe width 
 array without replaying all the available transactions? I am no filesystem 
 engineer but I can't wrap my head around how this could be handled any 
 better than it already is. I've read that resilvering is throttled - 
 presumably to keep performance degradation to a minimum during the process - 
 maybe this could be a tunable (e.g. priority: low, normal, high)?
 
 Do we know if resilvers on a mirror are actually handled differently from 
 those on a raidz?
 
 Sorry if this has already been explained. I think this is an issue that 
 everyone who uses ZFS should understand completely before jumping in, 
 because the behavior (while not 'wrong') is clearly NOT the same as with 
 more conventional arrays.
 
 -Will
 the problem is NOT the checksum/error correction overhead. that's 
 relatively trivial.  The problem isn't really even variable width (i.e. 
 variable number of disks one crosses) slabs.
 
 The problem boils down to this:
 
 When ZFS does a resilver, it walks the METADATA tree to determine what order 
 to rebuild things from. That means, it resilvers the very first slab ever 
 written, then the next oldest, etc.   The problem here is that slab age has 
 nothing to do with where that data physically resides on the actual disks. If 
 you've used the zpool as a WORM device, then, sure, there should be a strict 
 correlation between increasing slab age and locality on the disk.  However, 
 in any reasonable case, files get deleted regularly. This means that the 
 probability that for a slab B, written immediately after slab A, it WON'T be 
 physically near slab A.
 
 In the end, the problem is that using metadata order, while reducing the 
 total amount of work to do in the resilver

Re: [zfs-discuss] A few questions

2010-12-20 Thread Erik Trimble

On 12/20/2010 11:56 AM, Mark Sandrock wrote:

Erik,

just a hypothetical what-if ...

In the case of resilvering on a mirrored disk, why not take a snapshot, and then
resilver by doing a pure block copy from the snapshot? It would be sequential,
so long as the original data was unmodified; and random access in dealing with
the modified blocks only, right.

After the original snapshot had been replicated, a second pass would be done,
in order to update the clone to 100% live data.

Not knowing enough about the inner workings of ZFS snapshots, I don't know why
this would not be doable. (I'm biased towards mirrors for busy filesystems.)

I'm supposing that a block-level snapshot is not doable -- or is it?

Mark
Snapshots on ZFS are true snapshots - they take a picture of the current 
state of the system. They DON'T copy any data around when created. So, a 
ZFS snapshot would be just as fragmented as the ZFS filesystem was at 
the time.



The problem is this:

Let's say I write block A, B, C, and D on a clean zpool (what kind, it 
doesn't matter).  I now delete block C.  Later on, I write block E.   
There is a probability (increasing dramatically as times goes on), that 
the on-disk layout will now look like:


A, B, E, D

rather than

A, B, [space], D, E


So, in the first case, I can do a sequential read to get A  B, but then 
must do a seek to get D, and a seek to get E.


The fragmentation problem is mainly due to file deletion, NOT to file 
re-writing.  (though, in ZFS, being a C-O-W filesystem, re-writing 
generally looks like a delete-then-write process, rather than a modify 
process).



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Bakul Shah
On Mon, 20 Dec 2010 11:27:41 PST Erik Trimble erik.trim...@oracle.com  wrote:
 
 The problem boils down to this:
 
 When ZFS does a resilver, it walks the METADATA tree to determine what 
 order to rebuild things from. That means, it resilvers the very first 
 slab ever written, then the next oldest, etc.   The problem here is that 
 slab age has nothing to do with where that data physically resides on 
 the actual disks. If you've used the zpool as a WORM device, then, sure, 
 there should be a strict correlation between increasing slab age and 
 locality on the disk.  However, in any reasonable case, files get 
 deleted regularly. This means that the probability that for a slab B, 
 written immediately after slab A, it WON'T be physically near slab A.
 
 In the end, the problem is that using metadata order, while reducing the 
 total amount of work to do in the resilver (as you only resilver live 
 data, not every bit on the drive), increases the physical inefficiency 
 for each slab.  That is, seek time between cyclinders begins to dominate 
 your slab reconstruction time.  In RAIDZ, this problem is magnified by 
 both the much larger average vdev size vs mirrors, and the necessity 
 that all drives containing a slab information return that data before 
 the corrected data can be written to the resilvering drive.
 
 Thus, current ZFS resilvering tends to be seek-time limited, NOT 
 throughput limited.  This is really the fault of the underlying media, 
 not ZFS.  For instance, if you have a raidZ of SSDs (where seek time is 
 negligible, but throughput isn't),  they resilver really, really fast. 
 In fact, they resilver at the maximum write throughput rate.   However, 
 HDs are severely seek-limited, so that dominates HD resilver time.

You guys may be interested in a solution I used in a totally
different situation.  There an identical tree data structure
had to be maintained on every node of a distributed system.
When a new node was added, it needed to be initialized with
an identical copy before it could be put in operation. But
this had to be done while the rest of the system was
operational and there may even be updates from a central node
during the `mirroring' operation. Some of these updates could
completely change the tree!  Starting at the root was not
going to work since a subtree that was being copied may stop
existing in the middle and its space reused! In a way this is
a similar problem (but worse!). I needed something foolproof
and simple.

My algorithm started copying sequentially from the start.  If
N blocks were already copied when an update comes along,
updates of any block with block#  N are ignored (since the
sequential copy would get to them eventually).  Updates of
any block# = N were queued up (further update of the same
block would overwrite the old update, to reduce work).
Periodically they would be flushed out to the new node. This
was paced so at to not affect the normal operation much.

I should think a variation would work for active filesystems.
You sequentially read some amount of data from all the disks
from which data for the new disk to be prepared and write it
out sequentially. Each time read enough data so that reading
time dominates any seek time. Handle concurrent updates as
above. If you dedicate N% of time to resilvering, the total
time to complete resilver will be 100/N times sequential read
time of the whole disk. (For example, 1TB disk, 100MBps io
speed, 25% for resilver = under 12 hours).  How much worse
this gets depends on the amount of updates during
resilvering.

At the time of resilvering your FS is more likely to be near
full than near empty so I wouldn't worry about optimizing the
mostly empty FS case.

Bakul
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Mark Sandrock

On Dec 20, 2010, at 2:05 PM, Erik Trimble wrote:

 On 12/20/2010 11:56 AM, Mark Sandrock wrote:
 Erik,
 
  just a hypothetical what-if ...
 
 In the case of resilvering on a mirrored disk, why not take a snapshot, and 
 then
 resilver by doing a pure block copy from the snapshot? It would be 
 sequential,
 so long as the original data was unmodified; and random access in dealing 
 with
 the modified blocks only, right.
 
 After the original snapshot had been replicated, a second pass would be done,
 in order to update the clone to 100% live data.
 
 Not knowing enough about the inner workings of ZFS snapshots, I don't know 
 why
 this would not be doable. (I'm biased towards mirrors for busy filesystems.)
 
 I'm supposing that a block-level snapshot is not doable -- or is it?
 
 Mark
 Snapshots on ZFS are true snapshots - they take a picture of the current 
 state of the system. They DON'T copy any data around when created. So, a ZFS 
 snapshot would be just as fragmented as the ZFS filesystem was at the time.

But if one does a raw (block) copy, there isn't any fragmentation -- except for 
the COW updates.

If there were no updates to the snapshot, then it becomes a 100% sequential 
block copy operation.

But even with COW updates, presumably the large majority of the copy would 
still be sequential i/o.

Maybe for the 2nd pass, the filesystem would have to be locked, so the 
operation would ever complete,
but if this is fairly short in relation to the overall resilvering time, then 
it could still be a win in many cases.

I'm probably not explaining it well, and may be way off, but it seemed an 
interesting notion.

Mark

 
 
 The problem is this:
 
 Let's say I write block A, B, C, and D on a clean zpool (what kind, it 
 doesn't matter).  I now delete block C.  Later on, I write block E.   There 
 is a probability (increasing dramatically as times goes on), that the on-disk 
 layout will now look like:
 
 A, B, E, D
 
 rather than
 
 A, B, [space], D, E
 
 
 So, in the first case, I can do a sequential read to get A  B, but then must 
 do a seek to get D, and a seek to get E.
 
 The fragmentation problem is mainly due to file deletion, NOT to file 
 re-writing.  (though, in ZFS, being a C-O-W filesystem, re-writing generally 
 looks like a delete-then-write process, rather than a modify process).
 
 
 -- 
 Erik Trimble
 Java System Support
 Mailstop:  usca22-123
 Phone:  x17195
 Santa Clara, CA
 Timezone: US/Pacific (GMT-0800)
 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Edward Ned Harvey
 From: Erik Trimble [mailto:erik.trim...@oracle.com]
 
 We can either (a) change how ZFS does resilvering or (b) repack the
 zpool layouts to avoid the problem in the first place.
 
 In case (a), my vote would be to seriously increase the number of
 in-flight resilver slabs, AND allow for out-of-time-order slab
 resilvering.  

Question for any clueful person:

Suppose you have a mirror to resilver, made of disk1 and disk2, where disk2
failed and is resilvering.  If you have an algorithm to create a list of all
the used blocks of disk1 in disk order, then you're able to resilver the
mirror extremely fast, because all the reads will be sequential in nature,
plus you get to skip past all the unused space.

Now suppose you have a raidz with 3 disks (disk1, disk2, disk3, where disk3
is resilvering).  You find some way of ordering all the used blocks of
disk1...  Which means disk1 will be able to read in optimal order and speed.
Does that necessarily imply disk2 will also work well?  Does the on-disk
order of blocks of disk1 necessarily match the order of blocks on disk2?

If there is no correlation between on-disk order of blocks for different
disks within the same vdev, then all hope is lost; it's essentially
impossible to optimize the resilver/scrub order unless the on-disk order of
multiple disks is highly correlated or equal by definition.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Erik Trimble
 
  In the case of resilvering on a mirrored disk, why not take a snapshot,
and
 then
  resilver by doing a pure block copy from the snapshot? It would be
 sequential,

 So, a
 ZFS snapshot would be just as fragmented as the ZFS filesystem was at
 the time.

I think Mark was suggesting something like dd copy device 1 onto device 2,
in order to guarantee a first-pass sequential resilver.  And my response
would be:  Creative thinking and suggestions are always a good thing.  In
fact, the above suggestion is already faster than the present-day solution
for what I'm calling typical usage, but there are an awful lot of use
cases where the dd solution would be worse... Such as a pool which is
largely sequential already, or largely empty, or made of high IOPS devices
such as SSD.  However, there is a desire to avoid resilvering unused blocks,
so I hope a better solution is possible... 

The fundamental requirement for a better optimized solution would be a way
to resilver according to disk ordering...  And it's just a question for
somebody that actually knows the answer ... How terrible is the idea of
figuring out the on-disk order?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Eric D. Mudama

On Mon, Dec 20 at 19:19, Edward Ned Harvey wrote:

If there is no correlation between on-disk order of blocks for different
disks within the same vdev, then all hope is lost; it's essentially
impossible to optimize the resilver/scrub order unless the on-disk order of
multiple disks is highly correlated or equal by definition.


Very little is impossible.

Drives have been optimally ordering seeks for 35+ years.  I'm guessing
that the trick (difficult, but not impossible) is how to solve a
travelling salesman route pathing problem where you have billions or
trillions of transactions, and do it fast enough that it was worth
doing any extra computation besides just giving the device 32+ queued
commands at a time that align with the elements of each ordered
transaction ID.

Add to that all the complexity of unwinding the error recovery in the
event that you fail checksum validation on transaction N-1 after
moving past transaction N, which would be a required capability if you
wanted to queue more than a single transaction for verification at a
time.

Oh, and do all of the above without noticably affecting the throughput
of the applications already running on the system.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Mark Sandrock
It well may be that different methods are optimal for different use cases.

Mechanical disk vs. SSD; mirrored vs. raidz[123]; sparse vs. populated; etc.

It would be interesting to read more in this area, if papers are available.

I'll have to take a look. ... Or does someone have pointers?

Mark


On Dec 20, 2010, at 6:28 PM, Edward Ned Harvey wrote:

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Erik Trimble
 
 In the case of resilvering on a mirrored disk, why not take a snapshot,
 and
 then
 resilver by doing a pure block copy from the snapshot? It would be
 sequential,
 
 So, a
 ZFS snapshot would be just as fragmented as the ZFS filesystem was at
 the time.
 
 I think Mark was suggesting something like dd copy device 1 onto device 2,
 in order to guarantee a first-pass sequential resilver.  And my response
 would be:  Creative thinking and suggestions are always a good thing.  In
 fact, the above suggestion is already faster than the present-day solution
 for what I'm calling typical usage, but there are an awful lot of use
 cases where the dd solution would be worse... Such as a pool which is
 largely sequential already, or largely empty, or made of high IOPS devices
 such as SSD.  However, there is a desire to avoid resilvering unused blocks,
 so I hope a better solution is possible... 
 
 The fundamental requirement for a better optimized solution would be a way
 to resilver according to disk ordering...  And it's just a question for
 somebody that actually knows the answer ... How terrible is the idea of
 figuring out the on-disk order?
 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Richard Elling
On Dec 20, 2010, at 7:31 AM, Phil Harman phil.har...@gmail.com wrote:

 On 20/12/2010 13:59, Richard Elling wrote:
 
 On Dec 20, 2010, at 2:42 AM, Phil Harman phil.har...@gmail.com wrote:
 
 
 Why does resilvering take so long in raidz anyway?
 Because it's broken. There were some changes a while back that made it more 
 broken.
 
 broken is the wrong term here. It functions as designed and correctly 
 resilvers devices. Disagreeing with the design is quite different than
 proving a defect.
 
 It might be the wrong term in general, but I think it does apply in the 
 budget home media server context of this thread.

If you only have a few slow drives, you don't have performance.
Like trying to win the Indianapolis 500 with a tricycle...

 I think we can agree that ZFS currently doesn't play well on cheap disks. I 
 think we can also agree that the performance of ZFS resilvering is known to 
 be suboptimal under certain conditions.

... and those conditions are also a strength. For example, most file
systems are nowhere near full. With ZFS you only resilver data. For those
who recall the resilver throttles in SVM or VXVM, you will appreciate not
having to resilver non-data.

 For a long time at Sun, the rule was correctness is a constraint, 
 performance is a goal. However, in the real world, performance is often also 
 a constraint (just as a quick but erroneous answer is a wrong answer, so 
 also, a slow but correct answer can also be wrong).
 
 Then one brave soul at Sun once ventured that if Linux is faster, it's a 
 Solaris bug! and to his surprise, the idea caught on. I later went on to 
 tell people that ZFS delievered RAID where I = inexpensive, so I'm a 
 just a little frustrated when that promise becomes less respected over time. 
 First it was USB drives (which I agreed with), now it's SATA (and I'm not so 
 sure).

slow doesn't begin with an i :-)

 
 
 There has been a lot of discussion, anecdotes and some data on this list. 
 
 slow because I use devices with poor random write(!) performance
 is very different than broken.
 
 Again, context is everything. For example, if someone was building a business 
 critical NAS appliance from consumer grade parts, I'd be the first to say 
 are you nuts?!

Unfortunately, the math does not support your position...

 
 
 The resilver doesn't do a single pass of the drives, but uses a smarter 
 temporal algorithm based on metadata.
 
 A design that only does a single pass does not handle the temporal
 changes. Many RAID implementations use a mix of spatial and temporal
 resilvering and suffer with that design decision.
 
 Actually, it's easy to see how a combined spatial and temporal approach could 
 be implemented to an advantage for mirrored vdevs.
 
 
 However, the current implentation has difficulty finishing the job if 
 there's a steady flow of updates to the pool.
 
 Please define current. There are many releases of ZFS, and
 many improvements have been made over time. What has not
 improved is the random write performance of consumer-grade
 HDDs.
 
 I was led to believe this was not yet fixed in Solaris 11, and that there are 
 therefore doubts about what Solaris 10 update may see the fix, if any.
 
 
 As far as I'm aware, the only way to get bounded resilver times is to stop 
 the workload until resilvering is completed.
 
 I know of no RAID implementation that bounds resilver times
 for HDDs. I believe it is not possible. OTOH, whether a resilver
 takes 10 seconds or 10 hours makes little difference in data
 availability. Indeed, this is why we often throttle resilvering
 activity. See previous discussions on this forum regarding the
 dueling RFEs.
 
 I don't share your disbelief or little difference analysys. If it is true 
 that no current implementation succeeds, isn't that a great opportunity to 
 change the rules? Wasn't resilver time vs availability was a major factor in 
 Adam Leventhal's paper introducing the need for RAIDZ3?

No, it wasn't. There are two failure modes we can model given the data
provided by disk vendors:
1. failures by time (MTBF)
2. failures by bits read (UER)

Over time, the MTBF has improved, but the failures by bits read has not
improved. Just a few years ago enterprise class HDDs had an MTBF
of around 1 million hours. Today, they are in the range of 1.6 million
hours. Just looking at the size of the numbers, the probability that a
drive will fail in one hour is on the order of 10e-6.

By contrast, the failure rate by bits read has not improved much.
Consumer class HDDs are usually spec'ed at 1 error per 1e14
bits read.  To put this in perspective, a 2TB disk has around 1.6e13
bits. Or, the probability of an unrecoverable read if you read every bit 
on a 2TB is growing well above 10%. Some of the better enterprise class 
HDDs are rated two orders of magnitude better, but the only way to get
much better is to use more bits for ECC... hence the move towards
4KB sectors.

In other words, the probability of losing data 

Re: [zfs-discuss] A few questions

2010-12-20 Thread Richard Elling
On Dec 20, 2010, at 4:19 PM, Edward Ned Harvey 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:

 From: Erik Trimble [mailto:erik.trim...@oracle.com]
 
 We can either (a) change how ZFS does resilvering or (b) repack the
 zpool layouts to avoid the problem in the first place.
 
 In case (a), my vote would be to seriously increase the number of
 in-flight resilver slabs, AND allow for out-of-time-order slab
 resilvering.  
 
 Question for any clueful person:
 
 Suppose you have a mirror to resilver, made of disk1 and disk2, where disk2
 failed and is resilvering.  If you have an algorithm to create a list of all
 the used blocks of disk1 in disk order, then you're able to resilver the
 mirror extremely fast, because all the reads will be sequential in nature,
 plus you get to skip past all the unused space.

Sounds like the definition of random access :-) 

 
 Now suppose you have a raidz with 3 disks (disk1, disk2, disk3, where disk3
 is resilvering).  You find some way of ordering all the used blocks of
 disk1...  Which means disk1 will be able to read in optimal order and speed.

Sounds like prefetching :-)

 Does that necessarily imply disk2 will also work well?  Does the on-disk
 order of blocks of disk1 necessarily match the order of blocks on disk2?

This is an interesting question, that will become more interesting
as the physical sector size gets bigger...
 -- richard

 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-18 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Alexander Lesle
 
 at Dezember, 17 2010, 17:48 Lanky Doodle wrote in [1]:
 
  By single drive mirrors, I assume, in a 14 disk setup, you mean 7
  sets of 2 disk mirrors - I am thinking of traditional RAID1 here.
 
  Or do you mean 1 massive mirror with all 14 disks?
 
 Edward means a set of two-way-mirrors.

Correct.
mirror disk0 disk1 mirror disk2 disk3 mirror disk4 disk5 ...
You would normally call this a stripe of mirrors.  Even though the ZFS
concept of striping is more advanced than traditional raid striping...  We
still call this a ZFS stripe for lack of any other term.  A ZFS stripe has
all the good characteristics of raid concatenation and striping, without any
of the bad characteristics.  It can utilize bandwidth on multiple disks when
it wants to, or use a single device when it wants to for small blocks.  It
can dynamically add randomly sized devices, and it can be done
one-at-a-time.  Gaining everything good of traditional raid stripe or
concatenation, without any of the negatives of traditional raid stripe and
concatenation.


 At Sol11 Express Oracle announced that at TestInstall you can set
 RootPool to mirror during installation. At the moment I try it out
 in a VM but I didnt find this option. :-(

Actually, even in solaris 10, I habitually install the root filesystem onto
a ZFS mirror.  You just select 2 disks, and it's automatically a mirror.


 zpool create lankyserver mirror vdev1 vdev2 mirror vdev3 vdev4
 
 When you need more space you can expand a bundle of two disks to your
 lankyserver. Each pair with the same capacity is effective.
 
 zpool add lankyserver mirror vdev5 vdev6 mirror vdev7 vdev8  ...

Correct.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-18 Thread Edward Ned Harvey
 From: Bob Friesenhahn [mailto:bfrie...@simple.dallas.tx.us]
 Sent: Friday, December 17, 2010 9:16 PM
 
 While I agree that smaller vdevs are more reliable, your statement
 about the failure being more likely be in the same vdev if you have
 only 2 vdev's to be a rather useless statement.  The probability of
 vdev failure does not have anything to do with the number of vdevs.
 However, the probability of vdev failure increases tremendously if
 there is only one vdev and there is a second disk failure.

I'm not sure you got what I meant.  I'll rephrase and see if it's more
clear:

Correct, the number of vdev's doesn't affect the probability of a failure in
a specific vdev, but the number of disks in a vdev does.  Lanky said he was
considering 2x7disk raidz, versus 3x5disk raidz.  So when I said he's more
likely to have a 2nd disk fail in the same vdev if he only has 2 vdev's ...
That was meant to be taken in context, not as a generalization about pools
in general.

Consider a single disk.  Let P be the probability of the disk failing,
within 1 day.

If you have 5 disks in a raidz vdev, and one fails, there are 4 remaining.
If resilver will last 8 days, then the probability of a 2nd disk failing is
4*8*P = 32P

If you have 7 disks in a raidz vdev, and one fails, there are 6 remaining.
If a resilver will last 12 days, then the probability of a 2nd disk failing
is 6*12*P = 72P



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-18 Thread Lanky Doodle
On the subject of where to install ZFS, I was planning to use either Compact 
Flash or USB drive (both of which would be mounted internally); using up 2 of 
the drive bays for a mirrored install is possibly a waste of physical space, 
considering it's a) a home media server and b) the config can be backed up to a 
protected ZFS pool - if the CF or USB drive failed I would just replace and 
restore the config.

Can you have an equivalent of a global hot spare in ZFS. If I did go down the 
mirror route (mirror disk0 disk1 mirror disk2 disk3 mirror disk4 disk5 etc) all 
the way up to 14 disks that would leave the 15th disk spare.

Now this is getting really complex, but can you have server failover in ZFS, 
much like DFS-R in Windows - you point clients to a clustered ZFS namespace so 
if a complete server failed nothing is interrupted.

I am still undecided as to mirror vs RAID Z. I am going to be ripping 
uncompressed Blu-Rays so space is vital. I use RAID DP in NetApp kit at work 
and I'm guessing RAID Z2 is the equivalent? I have 5TB space at the moment so 
going to the expense of mirroring for only 2TB extra doesn't seem much of a pay 
off.

Maybe a compromise of 2x 7-disk RAID Z1 with global hotspare is the way to go?

Put it this way, I currently use Windows Home Server, which has no true disk 
failure protection, so any of ZFS's redundancy schemes is going to be a step 
up; is there an equivalent system in ZFS where if 1 disk fails you only lose 
that disks data, like unRAID?

Thanks everyone for your input so far :)
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-18 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Lanky Doodle
 
 On the subject of where to install ZFS, I was planning to use either
Compact
 Flash or USB drive (both of which would be mounted internally); using up 2
of
 the drive bays for a mirrored install is possibly a waste of physical
space,
 considering it's a) a home media server and b) the config can be backed up
to
 a protected ZFS pool - if the CF or USB drive failed I would just replace
and
 restore the config.

All of the above is correct.  One thing you should keep in mind however:  If
your unmirrored rpool (usb fob) fails...  Although yes you can restore
assuming you have been sufficiently backing it up ... You will suffer an
ungraceful halt.  Maybe you can live with that.


 Can you have an equivalent of a global hot spare in ZFS. If I did go down
the
 mirror route (mirror disk0 disk1 mirror disk2 disk3 mirror disk4 disk5
etc) all
 the way up to 14 disks that would leave the 15th disk spare.

Check the zpool man page for spare, but I know you can have spares
assigned to a vdev, and I'm pretty sure you can assign any given spare to
multiples, effectively making it a global hotspare.  So yes is the answer.


 Now this is getting really complex, but can you have server failover in
ZFS,
 much like DFS-R in Windows - you point clients to a clustered ZFS
namespace
 so if a complete server failed nothing is interrupted.

If that's somehow possible, it's something I don't know.  I don't believe
you can do that with ZFS.


 I am still undecided as to mirror vs RAID Z. I am going to be ripping
 uncompressed Blu-Rays so space is vital. 

For both read and write, raidz works extremely well for sequential
operations.  It sounds like you're probably going to be doing mostly
sequential operations, so raidz should perform very well for you.  A lot of
people will avoid raidzN because it doesn't perform very well for random
reads, so they opt for mirrors instead.  But in your case, no so much.

In your case, the only reason I can think to avoid raidz would be if you're
worrying about resilver times.  That's a valid concern, but you can linearly
choose any number of disks you want ... You could make raidz using 3-disks
each...  It's just a compromise between the mirror and the larger raidz
vdev.


 I use RAID DP in NetApp kit at work
 and I'm guessing RAID Z2 is the equivalent? 

Yup, raid-dp and raidz2 are conceptually pretty much the same.


 Put it this way, I currently use Windows Home Server, which has no true
disk
 failure protection, so any of ZFS's redundancy schemes is going to be a
step
 up; is there an equivalent system in ZFS where if 1 disk fails you only
lose that
 disks data, like unRAID?

No.  Not unless you make that many separate volumes.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-17 Thread Lanky Doodle
Thanks for all the replies.

The bit about combining zpools came from this command on the southbrain 
tutorial;

zpool create mail \
 mirror c6t600D0230006C1C4C0C50BE5BC9D49100d0 
c6t600D0230006B66680C50AB7821F0E900d0 \
 mirror c6t600D0230006B66680C50AB0187D75000d0 
c6t600D0230006C1C4C0C50BE27386C4900d0

I admit I was getting confused between zpools and vdevs, thinking in the above 
command that each mirror was a zpool and not a vdev.

Just so i'm correct, a normal command would like like

zpool create mypool raidz disk1 disk2 disk3 disk4 disk5

which would result in a zpool called my pool, which is made up of a 5 disk 
raidz vdev? This means that zpools don't actually 'contain' physical devices, 
which is what I originally thought.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-17 Thread Erik Trimble

On 12/17/2010 2:12 AM, Lanky Doodle wrote:

Thanks for all the replies.

The bit about combining zpools came from this command on the southbrain 
tutorial;

zpool create mail \
  mirror c6t600D0230006C1C4C0C50BE5BC9D49100d0 
c6t600D0230006B66680C50AB7821F0E900d0 \
  mirror c6t600D0230006B66680C50AB0187D75000d0 
c6t600D0230006C1C4C0C50BE27386C4900d0

I admit I was getting confused between zpools and vdevs, thinking in the above 
command that each mirror was a zpool and not a vdev.

Just so i'm correct, a normal command would like like

zpool create mypool raidz disk1 disk2 disk3 disk4 disk5

which would result in a zpool called my pool, which is made up of a 5 disk 
raidz vdev? This means that zpools don't actually 'contain' physical devices, 
which is what I originally thought.

You are correct that the above will have a single vdev of 5 disks.

Here's a shorthand note:

A zpool is made of 1 or more vdevs.

Each vdev can be a raidz, mirror, or single device (either a file or 
disk).  So, you *can* have a zpool which has solely physical drives:


e.g.

zpool create tank disk1 disk2 disk3

will create a pool with 3 disks, with data being striped across the 
devices as desired.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-17 Thread Lanky Doodle
OK cool.

One last question. Reading the Admin Guid for ZFS, it says:

[i]A more complex conceptual RAID-Z configuration would look similar to the 
following:

raidz c1t0d0 c2t0d0 c3t0d0 c4t0d0 c5t0d0 c6t0d0 c7t0d0 raidz c8t0d0 c9t0d0 
c10t0d0 c11t0d0 c12t0d0 c13t0d0 c14t0d0
If you are creating a RAID-Z configuration with many disks, as in this example, 
a RAID-Z configuration with 14 disks is better split into a two 7-disk 
groupings. RAID-Z configurations with single-digit groupings of disks should 
perform better[/i]

This is relevant as my final setup was planned to be 15 disks, so only one more 
than the example.

So, do I drop one disk and go with 2 7 drive vdevs, or stick to 3 5 drive vdevs.

Also, does anyone have anything to add re the security of CIFS when used with 
Windows clients?

Thanks again guys, and gals...
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-17 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Lanky Doodle
 
 This is relevant as my final setup was planned to be 15 disks, so only one
 more than the example.
 
 So, do I drop one disk and go with 2 7 drive vdevs, or stick to 3 5 drive
vdevs.

Both ways are fine.  Consider the balance between redundancy and drive
space.

Also, in the event of a resilver, the 3x5 radiz will be faster.  In rough
numbers, suppose you have 1TB drives, 70% full.  Then your resilver might be
8 days instead of 12 days.  That's important when you consider the fact that
during that window, you have degraded redundancy.  Another failed disk in
the same vdev would destroy the entire pool.

Also if a 2nd disk fails during resilver, it's more likely to be in the same
vdev, if you have only 2 vdev's.  Your odds are better with smaller vdev's,
both because the resilver completes faster, and the probability of a 2nd
failure in the same vdev is smaller.

For both performance and reliability reasons, I recommend nothing except
single-drive mirrors, except in extreme data-is-not-important situations.
At least, that's my recommendation until someday, when the resilver
efficiency is improved, or fixed.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-17 Thread Lanky Doodle
Thanks!

By single drive mirrors, I assume, in a 14 disk setup, you mean 7 sets of 2 
disk mirrors - I am thinking of traditional RAID1 here.

Or do you mean 1 massive mirror with all 14 disks?

This is always a tough one for me. I too prefer RAID1 where redundancy is king, 
but the trade off for me would be 5GB of 'wasted' space - total of 7GB in 
mirror and 12GB in 3x RAIDZ.

Decisions, decisions.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-17 Thread Cindy Swearingen

You should take a look at the ZFS best practices guide for RAIDZ and
mirrored configuration recommendations:

http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

Its easy for me to say because I don't have to buy storage but
mirrored storage pools are currently more flexible, provide good
performance, and replacing/resilvering data on disks is faster.

Thanks,

Cindy



On 12/17/10 09:48, Lanky Doodle wrote:

Thanks!

By single drive mirrors, I assume, in a 14 disk setup, you mean 7 sets of 2 
disk mirrors - I am thinking of traditional RAID1 here.

Or do you mean 1 massive mirror with all 14 disks?

This is always a tough one for me. I too prefer RAID1 where redundancy is king, 
but the trade off for me would be 5GB of 'wasted' space - total of 7GB in 
mirror and 12GB in 3x RAIDZ.

Decisions, decisions.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-17 Thread Alexander Lesle
at Dezember, 17 2010, 17:48 Lanky Doodle wrote in [1]:

 By single drive mirrors, I assume, in a 14 disk setup, you mean 7
 sets of 2 disk mirrors - I am thinking of traditional RAID1 here.

 Or do you mean 1 massive mirror with all 14 disks?

Edward means a set of two-way-mirrors.

Do you remember what he wrote:
 Also, in the event of a resilver, the 3x5 radiz will be faster.  In rough
 numbers, suppose you have 1TB drives, 70% full.  Then your resilver might be
 8 days instead of 12 days.  That's important when you consider the fact that
 during that window, you have degraded redundancy.  Another failed disk in
 the same vdev would destroy the entire pool.

 Also if a 2nd disk fails during resilver, it's more likely to be in the same
 vdev, if you have only 2 vdev's.  Your odds are better with smaller vdev's,
 both because the resilver completes faster, and the probability of a 2nd
 failure in the same vdev is smaller.

And this scene is a horrible notion. In that time resilvering is
running you have to hope that nothing fails. In his example between
192 to 288 hours - thats a long a very long time.
And be aware that a disk will broken at some point.

 This is always a tough one for me. I too prefer RAID1 where
 redundancy is king, but the trade off for me would be 5GB of
 'wasted' space - total of 7GB in mirror and 12GB in 3x RAIDZ.

You lost at most space when you make a pool with mirrors BUT
the I/O is much faster and its more secure and you have
all the features of zfs too.
http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance

 Decisions, decisions.

My suggestion is
make a two-way-mirror of small disks or ssd for the OS. This is not
easy to do after installation, you have to look for a howto.
Sorry I dont find the link at the moment.

At Sol11 Express Oracle announced that at TestInstall you can set
RootPool to mirror during installation. At the moment I try it out
in a VM but I didnt find this option. :-(

zpool create lankyserver mirror vdev1 vdev2 mirror vdev3 vdev4

When you need more space you can expand a bundle of two disks to your
lankyserver. Each pair with the same capacity is effective.

zpool add lankyserver mirror vdev5 vdev6 mirror vdev7 vdev8  ...

Consider that its a good decision when you plan one spare disk.
You can using the zpool add command when you want to add a
spare disk at a later time.
http://docs.sun.com/app/docs/doc/819-2240/zpool-1m?a=view

When you build a raidz pool every disk in this pool must have the same
space as the smallest disk have or bigger. Raidz pool uses only this
space that the smallest disk have. The rest of the bigger disk is
waste.
At a mirrored pool only the pair must have the same space so you can
use a pair of 1 TB disks, one pair of 2 TB disks at the same pool. In
this scene your spare disk _must have_ the biggest space.

Read this for your decision:
http://constantin.glez.de/blog/2010/01/home-server-raid-greed-and-why-mirroring-still-best

-- 
Best Regards
Alexander
Dezember, 17 2010

[1] mid:382802084.111292604519623.javamail.tweb...@sf-app1


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-17 Thread Bob Friesenhahn

On Fri, 17 Dec 2010, Edward Ned Harvey wrote:


Also if a 2nd disk fails during resilver, it's more likely to be in the same
vdev, if you have only 2 vdev's.  Your odds are better with smaller vdev's,
both because the resilver completes faster, and the probability of a 2nd
failure in the same vdev is smaller.


While I agree that smaller vdevs are more reliable, your statement 
about the failure being more likely be in the same vdev if you have 
only 2 vdev's to be a rather useless statement.  The probability of 
vdev failure does not have anything to do with the number of vdevs. 
However, the probability of vdev failure increases tremendously if 
there is only one vdev and there is a second disk failure.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-16 Thread Roy Sigurd Karlsbakk
 Also, at present I have 5x 1TB drives to use in my home server so I
 plan to create a RAID-Z1 pool which will have my shares on it (Movies,
 Music, Pictures etc). I then plan to increase this in sets of 5 (so
 another 5x 1TB drives in Jan and nother 5 in Feb/March so that I can
 avoid all disks being from the same batch). I did plan on creating
 seperate zpoolz with each set of 5 drives;
 
 drives 1-5 volume0 zpool
 drives 6-10 volume1 zpool
 drives 11-15 volume2 zpool

Although this seems a good idea to start with, there are issues with it 
performance-wise. If you fill up VDEV0 (drives 1-5) and then attach VDEV1 
(drives 6-10), new writes will still be initially striped across the two VDEVs, 
leading to a performance impact on writes. There is currently no way of 
balancing VDEV fills without manualy backup/restore or in-vdev copying the data 
from one place to another and then removing the original data.

 so that I can sustain 3 simultaneous drives failures, as long as it's
 one drive from each set. However I think this will mean each zpool
 will have independant shares which I don't want. I have used this
 guide - http://southbrain.com/south/tutorials/zpools.html - which says
 you can combine zpools into a 'parent' zpool, but can this be done in
 my scenario (staggered) as it looks like the child zpools have to be
 created before the parent is done. So basically I'd need to be able
 to;

For the scheme to work as above, start with something like

 # zpool create mypool raidz1 c0t1d0 c0t2d0 c0t3d0 c2t4d0 c2t5d0

Later, you'll add the new vdev

 # zpool add mypool raidz1 c0t6d0 c0t7d0 c0t8d0 c2t9d0 c2t10d0

This will work as described above. However, I would do this somehow 
differently. Start off with, say, 6 1TB drives in RAIDz2 and set autoexpand=on 
on the pool (remember compression=on on the zfs pool fs too).

 # zpool create mypool raidz2 c0t1d0 c0t2d0 c0t3d0 c2t4d0 c2t5d0 c2t6d0
 # zpool set autoexpand on mypool
 # zfs set compression=on mypool

Compression is lzjb, and it won't compress much for audio or video, but then, 
won't hurt much either. When this starts to get somewhat close to a fill, get 
new, larger drives and replace the one by one with the older 1TB drives. Once 
all are replaced by larger, say 1,5TB drives, whops, your array is larger. This 
will scale better performance-wise and you won't need that many controllers. 
Also, with RAIDz2, you can lose any two drives.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-16 Thread Freddie Cash
On Thu, Dec 16, 2010 at 12:59 AM, Lanky Doodle lanky_doo...@hotmail.com wrote:
 I have been playing with ZFS for a few days now on a test PC, and I plan to 
 use if for my home media server after being very impressed!

Works great for that.  Have a similar setup at home, using FreeBSD.

 Also, at present I have 5x 1TB drives to use in my home server so I plan to 
 create a RAID-Z1 pool which will have my shares on it (Movies, Music, 
 Pictures etc). I then plan to increase this in sets of 5 (so another 5x 1TB 
 drives in Jan and nother 5 in Feb/March so that I can avoid all disks being 
 from the same batch). I did plan on creating seperate zpoolz with each set of 
 5 drives;

No no no.  Create 1 pool.

Create the pool initially with a single 5-drive raidz vdev.

Later, add the next five drives to the system, and create a new raidz
vdev *in the same pool*.  Voila.  You now have the equivalent of a
RAID50, as ZFS will stripe writes to both vdevs, increaseing the
overall size *and* speed of the pool.

Later, add the next five drives to the system, and create a new raidz
vdev in the same pool.  Voila.  You now have a pool with 3 vdevs, with
read/writes being striped across all three.

You can still lose 3 drives (1 per vdev) before losing the pool.

The commands to do this are along the lines of:

# zpool create mypool raidz disk1 disk2 disk3 disk4 disk5

# zpool add mypool raidz disk6 disk7 disk8 disk9 disk10

# zpool add mypool raidz disk11 disk12 disk13 disk14 disk15

Creating 1 pool gives you the best performance and the most
flexibility.  Use separate filesystems on top of that pool if you want
to tweak all the different properties.

Going with 1 pool also increases your chances for dedupe, as dedupe is
done at the pool level.

-- 
Freddie Cash
fjwc...@gmail.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-16 Thread Cindy Swearingen

Hi Lanky,

Other follow-up posters have given you good advice.

I don't see where you are getting the idea that you can combine
pools with pools. You can't do this and I don't see that the
southbrain tutorial illustrates this either. All of his examples
for creating redundant pools are reasonable.

As others have said, you can create a RAIDZ pool with one vdev
of say 5 disks, and then later add another 5 disks, and so on.

Thanks,

Cindy

On 12/16/10 01:59, Lanky Doodle wrote:

Hiya,

I have been playing with ZFS for a few days now on a test PC, and I plan to use 
if for my home media server after being very impressed!

I've got the basics of creating zpools and zfs filesystems with compression and 
dedup etc, but I'm wondering if there's a better way to handle security. I'm 
using Windows 7 clients by the way.

I have used this 'guide' to do the permissions - http://www.slepicka.net/?p=37

Also, at present I have 5x 1TB drives to use in my home server so I plan to 
create a RAID-Z1 pool which will have my shares on it (Movies, Music, Pictures 
etc). I then plan to increase this in sets of 5 (so another 5x 1TB drives in 
Jan and nother 5 in Feb/March so that I can avoid all disks being from the same 
batch). I did plan on creating seperate zpoolz with each set of 5 drives;

drives 1-5 volume0 zpool
drives 6-10 volume1 zpool
drives 11-15 volume2 zpool

so that I can sustain 3 simultaneous drives failures, as long as it's one drive 
from each set. However I think this will mean each zpool will have independant 
shares which I don't want. I have used this guide - 
http://southbrain.com/south/tutorials/zpools.html - which says you can combine 
zpools into a 'parent' zpool, but can this be done in my scenario (staggered) 
as it looks like the child zpools have to be created before the parent is done. 
So basically I'd need to be able to;

Create volume0 zpool now
Create volume1 zpool in Jan, then combine volume0 and volume1 into a parent 
zpool
Create volume2 in Feb/March and add to parent zpool

I know I could just add each disk to volume0 zpool but I've read it's a bugger 
to do and that creating seperate zpools with news disks is a much better way to 
go.

I think that's it for now. Sorry for the mammoth first post!

Thanks

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-16 Thread Lanky Doodle
Thanks for the reply.

In that case, wouldn't it be better to, as you say, start with a 6 drive Z2, 
then just keep adding drives until the case is full, for a single Z2 zpool?

Or even Z3, if that's available now?

I have an 11x 5.1/4 bay case, with 3x 5-in-3 hot swap caddies giving me 15 
drive bays. Hence the plan to start with 5, then 10, then all the way to 15.

This seems a more logical (and cheaper) solution than keep replacing with 
bigger drives as they come to market.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-16 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Lanky Doodle
 
 In that case, wouldn't it be better to, as you say, start with a 6 drive
Z2, then
 just keep adding drives until the case is full, for a single Z2 zpool?

Doesn't work that way.

You can create a vdev, and later, you can add more vdev's.  So you can
create a raidz now, and later you can add another raidz.  But you cannot
create a raidz now, and later just add onesy-twosy disks to increase the
size incrementally.


 Or even Z3, if that's available now?

Raidz3 is available now.  There is only one thing to be aware of.  ZFS
resilvering is very inefficient for typical usage scenarios.  The time to
resilver divides by the number of vdev's in the pool (meaning 10 mirrors
will resilver 10x faster than an equivalently sized raidzN) and the time to
resilver is doubled when you have several disks within the vdev.  Due to
inefficiency, we're talking about 12 hours (on my server) to resilver a 1TB
disk which is around 70% used.  This would have been ~3 weeks if I had one
big raidz3.  So it matters.

Your multiple raidz vdev's of each 5-6 disks is a reasonable compromise.


 I have an 11x 5.1/4 bay case, with 3x 5-in-3 hot swap caddies giving me 15
 drive bays. Hence the plan to start with 5, then 10, then all the way to
15.
 
 This seems a more logical (and cheaper) solution than keep replacing with
 bigger drives as they come to market.

'Course, you can also replace bigger drives as they come to market, too.
;-)

If you've got 5 disks in a raidz...  First scrub it.  Then, replace one disk
with a larger disk, and wait for resilver.  Replace each disk, one by one,
with larger disks.  And eventually when you do the last one ... Your pool
becomes larger.  (Depending on your defaults, manual intervention may be
required to make the pool autoexpand when the devices have all been
upgraded.)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


  1   2   >