subject:"Re\: \[zfs\-discuss\] A few questions"

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Tim Cook
 
 
 The claim was that there are more people contributing code from outside of
 Oracle than inside to zfs.  Your contributions to Illumos do absolutely
nothing

Guys, please let's just say this much:

To all those who are contributing to the open-source ZFS code, freebsd,
illumos project, and others, thank you very much.  :-)  We know certain
things are stable and production ready, but there has not yet been much
forward development after zpool 28, but the effort is well appreciated, and
for whatever comes next, yes we can all be patient.

Right now, Oracle is not contributing at all to the open source branches of
any of these projects.  So right now it's fair to say the non-oracle
contributions to the OPEN SOURCE ZFS outweighs the nonexistent oracle
contributions.  However, Oracle is continuing to develop the closed-source
ZFS.  

I don't know if anyone has real numbers, dollars contributed or number of
developer hours etc, but I think it's fair to say that oracle is probably
contributing more to the closed source ZFS right now, than the rest of the
world is contributing to the open source ZFS right now.  Also, we know that
the closed source ZFS right now is more advanced than the open source ZFS
(zpool 31 vs 28).  Oracle closed source ZFS is ahead, and probably
developing faster too, than the open source ZFS right now.

If anyone has any good way to draw more contributors into the open source
tree, that would also be useful and appreciated.  Gosh, it would be nice to
get major players like Dell, HP, IBM, Apple contributing into that project.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2011-01-05 Thread Deano

Edward Ned Harvey wrote
 I don't know if anyone has real numbers, dollars contributed or number of
 developer hours etc, but I think it's fair to say that oracle is probably
 contributing more to the closed source ZFS right now, than the rest of the
 world is contributing to the open source ZFS right now.  Also, we know
that
 the closed source ZFS right now is more advanced than the open source ZFS
 (zpool 31 vs 28).  Oracle closed source ZFS is ahead, and probably
 developing faster too, than the open source ZFS right now.

 If anyone has any good way to draw more contributors into the open source
 tree, that would also be useful and appreciated.  Gosh, it would be nice
to
 get major players like Dell, HP, IBM, Apple contributing into that
project.

This is something that Illumos/Open source ZFS needs to decide what it
wants, effectively we can't innovate ZFS without breaking capability...
because our Illumos ZPool version 29 (if we innovate) will not be Oracle
Zpool version 29.

If we want open-source ZFS to we need to make that choice and let everyone
know, apart from submitting bug fixes to zpool v28, are I'm not sure if
other changed would be welcome?

So honestly do we want to innovate ZFS (I do) or do we just want to follow
Oracle?

Bye,
Deano

de...@cloudpixies.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

 From: Deano [mailto:de...@rattie.demon.co.uk]
 Sent: Wednesday, January 05, 2011 9:16 AM
 
 So honestly do we want to innovate ZFS (I do) or do we just want to follow
 Oracle?

Well, you can't follow Oracle.  Unless you wait till they release something,
reverse engineer it, and attempt to reimplement it.  I am quite sure you'll
be sued if you do that.

If you want forward development in the open source tree, you basically have
only one option:  Some major contributor must have a financial interest, and
commit to a real concerted development effort, with their own roadmap, which
is intentionally designed NOT to overlap with the Oracle roadmap.
Otherwise, the code will stagnate.

I am rooting for the open source projects, but I'm not optimistic
personally.  I think all major contributors (IBM, Apple, etc) will not
participate for various reasons, and as a result, we'll experience bit
rot...  As presently evident by lack of zpool advancement beyond 28.

So in my mind, Oracle and ZFS are now just like netapp and wafl.  Well...  I
prefer Solaris and ZFS over netapp and wafl...  So whenever I would have
otherwise bought a netapp, I'll still buy the solaris server instead...  But
it's no longer a competitor against ubuntu or centos.

Just the way Larry wants it.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2011-01-05 Thread Khushil Dep

We do have a major commercial interest - Nexenta. It's been quiet but I do
look forward to seeing something come out of that stable this year? :-)

---
W. A. Khushil Dep - khushil@gmail.com -  07905374843

Visit my blog at http://www.khushil.com/






On 5 January 2011 14:34, Edward Ned Harvey 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:

  From: Deano [mailto:de...@rattie.demon.co.uk]
  Sent: Wednesday, January 05, 2011 9:16 AM
 
  So honestly do we want to innovate ZFS (I do) or do we just want to
 follow
  Oracle?

 Well, you can't follow Oracle.  Unless you wait till they release
 something,
 reverse engineer it, and attempt to reimplement it.  I am quite sure you'll
 be sued if you do that.

 If you want forward development in the open source tree, you basically have
 only one option:  Some major contributor must have a financial interest,
 and
 commit to a real concerted development effort, with their own roadmap,
 which
 is intentionally designed NOT to overlap with the Oracle roadmap.
 Otherwise, the code will stagnate.

 I am rooting for the open source projects, but I'm not optimistic
 personally.  I think all major contributors (IBM, Apple, etc) will not
 participate for various reasons, and as a result, we'll experience bit
 rot...  As presently evident by lack of zpool advancement beyond 28.

 So in my mind, Oracle and ZFS are now just like netapp and wafl.  Well...
  I
 prefer Solaris and ZFS over netapp and wafl...  So whenever I would have
 otherwise bought a netapp, I'll still buy the solaris server instead...
  But
 it's no longer a competitor against ubuntu or centos.

 Just the way Larry wants it.

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2011-01-05 Thread Michael Schuster

On Wed, Jan 5, 2011 at 15:34, Edward Ned Harvey
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
 From: Deano [mailto:de...@rattie.demon.co.uk]
 Sent: Wednesday, January 05, 2011 9:16 AM

 So honestly do we want to innovate ZFS (I do) or do we just want to follow
 Oracle?

 Well, you can't follow Oracle.  Unless you wait till they release something,
 reverse engineer it, and attempt to reimplement it.

that's not my understanding - while we will have to wait, oracle is
supposed to release *some* source code afterwards to satisfy some
claim or other. I agree, some would argue that that should have
already happened with S11 express... I don't know it has, but that's
not *the* release of S11, is it? And once the code is released, even
if after the fact, it's not reverse-engineering anymore, is it?

Michael
PS: just in case: even while at Oracle, I had no insight into any of
these plans, much less do I have now.
-- 
regards/mit freundlichen Grüssen
Michael Schuster
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2011-01-05 Thread Saxon, Will

 -Original Message-
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Michael Schuster
 Sent: Wednesday, January 05, 2011 9:42 AM
 To: Edward Ned Harvey
 Cc: zfs-discuss@opensolaris.org
 Subject: Re: [zfs-discuss] A few questions

 On Wed, Jan 5, 2011 at 15:34, Edward Ned Harvey
 opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
  From: Deano [mailto:de...@rattie.demon.co.uk]
  Sent: Wednesday, January 05, 2011 9:16 AM

  So honestly do we want to innovate ZFS (I do) or do we just want to
 follow
  Oracle?

  Well, you can't follow Oracle.  Unless you wait till they release something,
  reverse engineer it, and attempt to reimplement it.

 that's not my understanding - while we will have to wait, oracle is
 supposed to release *some* source code afterwards to satisfy some
 claim or other. I agree, some would argue that that should have
 already happened with S11 express... I don't know it has, but that's
 not *the* release of S11, is it? And once the code is released, even
 if after the fact, it's not reverse-engineering anymore, is it?

Not exactly. Oracle hasn't publicly committed to anything like that. There were 
several news articles last year referencing a leaked internal memo that I 
believe was more of a proposal than a plan. 

Even if Oracle did 'commit' to releasing code, they could easily just decide 
not to. 

-Will
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

 From: Michael Schuster [mailto:michaelspriv...@gmail.com]
 
  Well, you can't follow Oracle.  Unless you wait till they release
something,
  reverse engineer it, and attempt to reimplement it.
 
 that's not my understanding - while we will have to wait, oracle is
 supposed to release *some* source code afterwards to satisfy some

Where do you get that from?  AFAIK, there is no official word about oracle
opening anything moving forward, but there are plenty of unofficial reports
that it will not be opened.  Nobody in the field is holding any hope for
that to change anymore, most importantly illumos and nexenta.  (At least
with regards to ZFS and all the other projects relevant to solaris.)

I know in the case of SGE/OGE, it's officially closed source now.  As of Dec
31st, sunsource is being decomissioned, and the announcement of officially
closing the SGE source and decomissioning the open source community went out
on Dec 24th.  So all of this leads me to believe, with very little
reservation, that the new developments beyond zpool 28 are closed source
moving forward.  There's very little breathing room remaining for hope of
that being open sourced again.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

 From: Khushil Dep [mailto:khushil@gmail.com]
 
 We do have a major commercial interest - Nexenta. It's been quiet but I do
 look forward to seeing something come out of that stable this year? :-)

I'll agree to call Nexenta a major commerical interest, in regards to 
contribution to the open source ZFS tree, if they become an officially 
supported OS on Dell, HP, and/or IBM hardware.  Otherwise, they're just simply 
too small to keep up with the rate of development of the closed source ZFS 
tree, and destined to be left in the dust.

And if Nexenta does become a seriously viable competitor against netapp and 
oracle...  Watch out for lawsuits...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2011-01-05 Thread Garrett D'Amore


On 01/ 4/11 11:48 PM, Tim Cook wrote:



On Tue, Jan 4, 2011 at 8:21 PM, Garrett D'Amore garr...@nexenta.com 
mailto:garr...@nexenta.com wrote:


On 01/ 4/11 09:15 PM, Tim Cook wrote:



On Mon, Jan 3, 2011 at 5:56 AM, Garrett D'Amore
garr...@nexenta.com mailto:garr...@nexenta.com wrote:

On 01/ 3/11 05:08 AM, Robert Milkowski wrote:

On 12/26/10 05:40 AM, Tim Cook wrote:



On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling
richard.ell...@gmail.com
mailto:richard.ell...@gmail.com wrote:


There are more people outside of Oracle developing for
ZFS than inside Oracle.
This has been true for some time now.




Pardon my skepticism, but where is the proof of this claim
(I'm quite certain you know I mean no disrespect)?
 Solaris11 Express was a massive leap in functionality and
bugfixes to ZFS.  I've seen exactly nothing out of outside
of Oracle in the time since it went closed.  We used to
see updates bi-weekly out of Sun.  Nexenta spending
hundreds of man-hours on a GUI and userland apps isn't work
on ZFS.




Exactly my observation as well. I haven't seen any ZFS
related development happening at Ilumos or Nexenta, at least
not yet.


Just because you've not seen it yet doesn't imply it isn't
happening.  Please be patient.

   - Garrett



Or, conversely, don't make claims of all this code contribution
prior to having anything to show for your claimed efforts.  Duke
Nukem Forever was going to be the greatest video game ever
created... we were told to be patient... we're still waiting
for that too.



Um, have you not been paying attention?  I've delivered quite a
lot of contribution to illumos already, just not in ZFS.   Take a
close look -- there almost certainly wouldn't *be* an open source
version of OS/Net had I not done the work to enable this in libc,
kernel crypto, and other bits.  This work is still higher priority
than ZFS innovation for a variety of reasons -- mostly because we
need a viable and supportable illumos upon which to build those
ZFS innovations.

That said, much of the ZFS work I hope to contribute to illumos
needs more baking, but some of it is already open source in
NexentaStor.  (You can for a start look at zfs-monitor, the WORM
support, and support for hardware GZIP acceleration all as things
that Nexenta has innovated in ZFS, and which are open source today
if not part of illumos.  Check out http://www.nexenta.org for
source code access.)

So there, money placed where mouth is.  You?

   - Garrett



The claim was that there are more people contributing code from 
outside of Oracle than inside to zfs.  Your contributions to Illumos 
do absolutely nothing to backup that claim.  ZFS-monitor is not ZFS 
code (it's an FMA module), WORM also isn't ZFS code, it's an OS level 
operation, and GZIP hardware acceleration is produced by Indra 
networks, and has absolutely nothing to do with ZFS.  Does it help 
ZFS?  Sure, but that's hardly a code contribution to ZFS when it's 
simply a hardware acceleration card that accelerates ALL gzip code.


Um... you have obviously not looked at the code.

Our WORM code is not some basic OS guarantees on top of ZFS, but 
modifications to the ZFS code itself so that ZFS *itself* honors the 
WORM property, which is implemented as a property on the ZFS filesystem.


Likewise, the GZIP hardware acceleration support includes specific 
modifications to the ZFS kernel filesystem code.


Of course, we've not done anything major to change the fundamental way 
that ZFS stores data... is that what you're talking about?


I think you must have a very narrow idea of what constitutes an 
innovation in ZFS.




So, great job picking three projects that are not proof of developers 
working on ZFS.  And great job not providing any proof to the claim 
there are more developers working on ZFS outside of Oracle than within.


Nexenta don't represent that majority actually.  A large number of ZFS 
folks -- people with names like Leventhal, Ahrens, Wilson, and Gregg, 
are working on ZFS related work at Delphix and Joyent, or so I've been 
told.  I don't have first hand knowledge of *what* the details are, but 
I'm looking forward to seeing the results.


This ignores the contributions from people working on ZFS on other 
platforms as well.


Of course, since I know longer work there, I don't really know how many 
people Oracle still has working on ZFS.  They could have tasked 1,000 
people with it.  Or they could have shut the project down entirely.  But 
of the people who had, up until Oracle shut down the open code, made 
non-trivial contributions to ZFS, I think the majority of *those* people 
can be found working outside of Oracle now, and I think most of them are 
still working on ZFS

Re: [zfs-discuss] A few questions

2011-01-05 Thread Deano


Edward Ned Harvey wrote
 From: Deano [mailto:de...@rattie.demon.co.uk]
 Sent: Wednesday, January 05, 2011 9:16 AM
 
 So honestly do we want to innovate ZFS (I do) or do we just want to follow
 Oracle?

 Well, you can't follow Oracle.  Unless you wait till they release
something,
 reverse engineer it, and attempt to reimplement it.  I am quite sure
you'll
 be sued if you do that.

 If you want forward development in the open source tree, you basically
have
 only one option:  Some major contributor must have a financial interest,
and
 commit to a real concerted development effort, with their own roadmap,
which
 is intentionally designed NOT to overlap with the Oracle roadmap.
 Otherwise, the code will stagnate.

Why does it need a big backer? Erm ZFS isn't that large or amazingly complex
code. It is *good* code but take 100s of developers and a fortune to
develop? Erm nope (which I'd bet it never had at Sun either).

Why not overlap Oracle? what has it got to do with Oracle if we have split
into ZFS (Oracle) and OpenZFS in future. OpenZFS will get whatever
features developers feel that want or they need to develop for it.

This is the fundamental choice of Open source ZFS, illumos and OpenIndiania
(and other distributions) have to decide, what is there purpose? Is it a
free compatible (though trailing) version of Oracle Solaris OR a platform
that shared an ancestor with Oracle Solaris via Sun OpenSolaris but now is
its own evolutionary species, with no more connection than I have with a
15th cousin removed on my great, great, great, grandfathers side.

This isn't even a theoretical what if situation for me, I have a major
modification to ZFS (still being developed), it has no basis on Oracle or
anybody elses needs just mine. It is what I felt I needed and ZFS was the
right base for it. Now will that go into OpenZFS? Honestly I don't know
yet, because not sure it would be wanted (it will be incompatible with
Oracle ZFS) and personally, commercially I'm not sure if it's the right move
to open source the feature.

I bet I'm not the only small developer out there in a similar situation, the
landscape is very unclear about what actually the community wants to do
going forward, and whether we will have or even want OpenZFS and Oracle
ZFS or Oracle ZFS and 90% compatibles (always trailing) or Oracle ZFS + DevA
ZFS + DevB ZFS + DevC ZFS.

Bye,
Deano

de...@cloudpixies.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

 From: Richard Elling [mailto:richard.ell...@nexenta.com]
 
  I'll agree to call Nexenta a major commerical interest, in regards to
 contribution to the open source ZFS tree, if they become an officially
 supported OS on Dell, HP, and/or IBM hardware.
 
 NexentaStor is officially supported on Dell, HP, and IBM hardware.  The
only
 question is, what is your definition of 'support'?  Many NexentaStor

I don't want to argue about this, but I'll just try to clarify what I meant:

Presently, I have a dell server with officially supported solaris, and it's
as unreliable as pure junk.  It's just the backup server, so I'm free to
frequently create  destroy it... And as such, I frequently do recreate and
destroy it.  It is entirely stable running RHEL (centos) because Dell and
RedHat have a partnership with a serious number of human beings and machines
looking for and fixing any compatibility issues.  For my solaris
instability, I blame the fact that solaris developers don't do significant
quality assurance on non-sun hardware.  To become officially compatible,
the whole qualification process is like this:  Somebody installs it, doesn't
see any problems, and then calls it certified.  They reformat with
something else, and move on.  They don't build their business on that
platform, so they don't detect stability issues like the ones reported...
System crashes once per week and so forth.  Solaris therefore passes the
test, and becomes one of the options available on the drop-down menu for
OSes with a new server.  (Of course that's been discontinued by oracle, but
that's how it was in the past.)

Developers need to eat their own food.  Smoke your own crack.  Hardware
engineers at Dell need to actually use your OS on their hardware, for their
development efforts.  I would be willing to bet Sun hardware engineers use a
significant percentage of solaris servers for their work...  And guess what
solaris engineers don't use?  Non-sun hardware.  Pretty safe bet you won't
find any Dell servers in the server room where solaris developers do their
thing.

If you want to be taken seriously as an alternative storage option, you've
got to at LEAST be listed as a factory-distributed OS that is an option to
ship with the new server, and THEN, when people such as myself buy those
things, we've got to have a good enough experience that we don't all bitch
and flame about it afterward.

Nexenta, you need a real and serious partnership with Dell, HP, IBM.  Get
their developers to run YOUR OS on the servers which they use for
development.  Get them to sell your product bundled with their product.  And
dedicate real and serious engineering into bugfixes working with customers,
to truly identify root causes of instability, with a real OS development and
engineering and support group.  It's got to be STABLE, that's the #1
requirement.

I previously made the comparison...  Even close-source solaris  ZFS is a
better alternative to close-source netapp  wafl.  So for now, those are the
only two enterprise supportable options I'm willing to stake my career on,
and I'll buy Sun hardware with Solaris.  But I really wish I could feel
confident buying a cheaper Dell server and running ZFS on it.  Nexenta, if
you make yourself look like a serious competitor against solaris, and really
truly form an awesome stable partnership with Dell, I will happily buy your
stuff instead of Oracle.  Even if you are a little behind in feature
offering.  But I will not buy your stuff if I can't feel perfectly confident
in its stability.

Ever heard the phrase Nobody ever got fired for buying IBM.  You're the
little guys.  If you want to compete against the big guys, you've got to
kick ass.  And don't get sued into oblivion.

Even today's feature set is perfectly adequate for at least a couple of
years to come.  If you put all your effort into stability and bugfixes,
serious partnerships with Dell, HP, IBM, and become extremely professional
looking and stable, with fanatical support...  You don't have to worry about
new feature development for some while.  Stability is #1 and not
disappearing is a pretty huge threat right now.

Based on my experience, I would not recommend buying Dell with Solaris, even
if that were still an option.  If you want solaris, buy sun/oracle hardware,
because then you can actually expect it to work reliably.  And if solaris
isn't stable on dell ... then all the solaris derivatives including nexenta
can't be trusted either, no matter how much you claim it's supported.

Show me the HCL, and show me the partnership between your software engineers
and Dell's hardware engineers.  Make me believe there is a serious and
thorough qualification process.  Do a huge volume.  Your volume must be
large enough to justify dedicating some engineers to serious bugfix efforts
in the field.  Otherwise...  When I need to buy something stable...  I'm
going to buy solaris on sun hardware, because I know that's thoroughly
tried, tested, and stable.

Re: [zfs-discuss] A few questions

2011-01-05 Thread Bob Friesenhahn


On Wed, 5 Jan 2011, Edward Ned Harvey wrote:

with regards to ZFS and all the other projects relevant to solaris.)

I know in the case of SGE/OGE, it's officially closed source now.  As of Dec
31st, sunsource is being decomissioned, and the announcement of officially
closing the SGE source and decomissioning the open source community went out
on Dec 24th.  So all of this leads me to believe, with very little
reservation, that the new developments beyond zpool 28 are closed source
moving forward.  There's very little breathing room remaining for hope of
that being open sourced again.


I have no idea what you are talking about.  Best I can tell, SGE/OGE 
is a reference to Sun Grid Engine, which has nothing to do with zfs. 
The only annoucement and discussion I can find via Google is written 
by you.  It was pretty clear even a year ago that Sun Grid Engine was 
going away.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] a few questions - Oracle

2011-01-04 Thread webdawg

It is sad that such a lovely file system is now in Oracle's unresponsive hands. 
 I hope someone builds another open file system just like it.  I could never 
find anything like it to protect my data like it does.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] a few questions - Oracle

2011-01-04 Thread Paul Gress


On 01/ 4/11 01:19 PM, webd...@gmail.com wrote:

It is sad that such a lovely file system is now in Oracle's unresponsive hands. 
 I hope someone builds another open file system just like it.  I could never 
find anything like it to protect my data like it does.

___


I have to reply to this.

While Oracle may not seem responsive, they are innovating on the zfs still.  I 
haven't seen it stand still when Oracle took over Sun.

Also, if you do your homework, there is a BSD version floating around, and a 
Linux version also.  To boot, Illumos has the last open source release which 
brings it to Openindania.

So what are you talking about?


Paul
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2011-01-04 Thread Robert Milkowski


 On 01/ 3/11 04:28 PM, Richard Elling wrote:

On Jan 3, 2011, at 5:08 AM, Robert Milkowski wrote:


On 12/26/10 05:40 AM, Tim Cook wrote:



On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling 
richard.ell...@gmail.com mailto:richard.ell...@gmail.com wrote:



There are more people outside of Oracle developing for ZFS than
inside Oracle.
This has been true for some time now.




Pardon my skepticism, but where is the proof of this claim (I'm 
quite certain you know I mean no disrespect)?  Solaris11 Express was 
a massive leap in functionality and bugfixes to ZFS.  I've seen 
exactly nothing out of outside of Oracle in the time since it went 
closed.  We used to see updates bi-weekly out of Sun.  Nexenta 
spending hundreds of man-hours on a GUI and userland apps isn't work 
on ZFS.





Exactly my observation as well. I haven't seen any ZFS related 
development happening at Ilumos or Nexenta, at least not yet.


I am quite sure you understand how pipelines work :-)



Are you suggesting that Nexenta is developing new ZFS features behind 
closed doors (like Oracle...) and then will share code later-on? Somehow 
I don't think so... but I would love to be proved wrong :)


--
Robert Milkowski
http://milek.blogspot.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2011-01-04 Thread Robert Milkowski


 On 01/ 4/11 11:35 PM, Robert Milkowski wrote:

On 01/ 3/11 04:28 PM, Richard Elling wrote:

On Jan 3, 2011, at 5:08 AM, Robert Milkowski wrote:


On 12/26/10 05:40 AM, Tim Cook wrote:



On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling 
richard.ell...@gmail.com mailto:richard.ell...@gmail.com wrote:



There are more people outside of Oracle developing for ZFS than
inside Oracle.
This has been true for some time now.




Pardon my skepticism, but where is the proof of this claim (I'm 
quite certain you know I mean no disrespect)?  Solaris11 Express 
was a massive leap in functionality and bugfixes to ZFS.  I've seen 
exactly nothing out of outside of Oracle in the time since it 
went closed.  We used to see updates bi-weekly out of Sun.  Nexenta 
spending hundreds of man-hours on a GUI and userland apps isn't 
work on ZFS.





Exactly my observation as well. I haven't seen any ZFS related 
development happening at Ilumos or Nexenta, at least not yet.


I am quite sure you understand how pipelines work :-)



Are you suggesting that Nexenta is developing new ZFS features behind 
closed doors (like Oracle...) and then will share code later-on? 
Somehow I don't think so... but I would love to be proved wrong :)


I mean I would love to see Nexenta start delivering real innovation in 
Solaris/Illumos kernel (zfs, networking, ...), not that I would love to 
see it happening behind a closed doors :)


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2011-01-04 Thread Garrett D'Amore


On 01/ 3/11 05:08 AM, Robert Milkowski wrote:

On 12/26/10 05:40 AM, Tim Cook wrote:



On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling 
richard.ell...@gmail.com mailto:richard.ell...@gmail.com wrote:



There are more people outside of Oracle developing for ZFS than
inside Oracle.
This has been true for some time now.




Pardon my skepticism, but where is the proof of this claim (I'm quite 
certain you know I mean no disrespect)?  Solaris11 Express was a 
massive leap in functionality and bugfixes to ZFS.  I've seen exactly 
nothing out of outside of Oracle in the time since it went closed. 
 We used to see updates bi-weekly out of Sun.  Nexenta spending 
hundreds of man-hours on a GUI and userland apps isn't work on ZFS.





Exactly my observation as well. I haven't seen any ZFS related 
development happening at Ilumos or Nexenta, at least not yet.


Just because you've not seen it yet doesn't imply it isn't happening.  
Please be patient.


   - Garrett



--
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] a few questions - Oracle

2011-01-04 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Paul Gress
 
 On 01/ 4/11 01:19 PM, webd...@gmail.com wrote:
 It is sad that such a lovely file system is now in Oracle's unresponsive
hands.  I
 hope someone builds another open file system just like it.  I could never
find
 anything like it to protect my data like it does.
 
 I have to reply to this.
 
 While Oracle may not seem responsive, they are innovating on the zfs
still.  I
 haven't seen it stand still when Oracle took over Sun.
 
 Also, if you do your homework, there is a BSD version floating around, and
a
 Linux version also.  To boot, Illumos has the last open source release
which
 brings it to Openindania.
 
 So what are you talking about?

Also, another open file system like it ... anything like it to protect my
data...

Go use Linux, and BTRFS.  It is GPL, and guess what.  Also developed by
Oracle.  But it's GPL, and it's included by default in many of the latest
linuxes.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2011-01-04 Thread Tim Cook

On Mon, Jan 3, 2011 at 5:56 AM, Garrett D'Amore garr...@nexenta.com wrote:

  On 01/ 3/11 05:08 AM, Robert Milkowski wrote:

 On 12/26/10 05:40 AM, Tim Cook wrote:



 On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling richard.ell...@gmail.com
  wrote:


 There are more people outside of Oracle developing for ZFS than inside
 Oracle.
 This has been true for some time now.




  Pardon my skepticism, but where is the proof of this claim (I'm quite
 certain you know I mean no disrespect)?  Solaris11 Express was a massive
 leap in functionality and bugfixes to ZFS.  I've seen exactly nothing out of
 outside of Oracle in the time since it went closed.  We used to see
 updates bi-weekly out of Sun.  Nexenta spending hundreds of man-hours on a
 GUI and userland apps isn't work on ZFS.



 Exactly my observation as well. I haven't seen any ZFS related development
 happening at Ilumos or Nexenta, at least not yet.


 Just because you've not seen it yet doesn't imply it isn't happening.
 Please be patient.

- Garrett



Or, conversely, don't make claims of all this code contribution prior to
having anything to show for your claimed efforts.  Duke Nukem Forever was
going to be the greatest video game ever created... we were told to be
patient... we're still waiting for that too.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2011-01-04 Thread Tim Cook

On Tue, Jan 4, 2011 at 8:21 PM, Garrett D'Amore garr...@nexenta.com wrote:

  On 01/ 4/11 09:15 PM, Tim Cook wrote:



 On Mon, Jan 3, 2011 at 5:56 AM, Garrett D'Amore garr...@nexenta.comwrote:

  On 01/ 3/11 05:08 AM, Robert Milkowski wrote:

 On 12/26/10 05:40 AM, Tim Cook wrote:



 On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling 
 richard.ell...@gmail.com wrote:


 There are more people outside of Oracle developing for ZFS than inside
 Oracle.
 This has been true for some time now.




  Pardon my skepticism, but where is the proof of this claim (I'm quite
 certain you know I mean no disrespect)?  Solaris11 Express was a massive
 leap in functionality and bugfixes to ZFS.  I've seen exactly nothing out of
 outside of Oracle in the time since it went closed.  We used to see
 updates bi-weekly out of Sun.  Nexenta spending hundreds of man-hours on a
 GUI and userland apps isn't work on ZFS.



 Exactly my observation as well. I haven't seen any ZFS related development
 happening at Ilumos or Nexenta, at least not yet.


  Just because you've not seen it yet doesn't imply it isn't happening.
 Please be patient.

- Garrett



  Or, conversely, don't make claims of all this code contribution prior to
 having anything to show for your claimed efforts.  Duke Nukem Forever was
 going to be the greatest video game ever created... we were told to be
 patient... we're still waiting for that too.



 Um, have you not been paying attention?  I've delivered quite a lot of
 contribution to illumos already, just not in ZFS.   Take a close look --
 there almost certainly wouldn't *be* an open source version of OS/Net had I
 not done the work to enable this in libc, kernel crypto, and other bits.
 This work is still higher priority than ZFS innovation for a variety of
 reasons -- mostly because we need a viable and supportable illumos upon
 which to build those ZFS innovations.

 That said, much of the ZFS work I hope to contribute to illumos needs more
 baking, but some of it is already open source in NexentaStor.  (You can for
 a start look at zfs-monitor, the WORM support, and support for hardware GZIP
 acceleration all as things that Nexenta has innovated in ZFS, and which are
 open source today if not part of illumos.  Check out
 http://www.nexenta.org for source code access.)

 So there, money placed where mouth is.  You?

- Garrett



The claim was that there are more people contributing code from outside of
Oracle than inside to zfs.  Your contributions to Illumos do absolutely
nothing to backup that claim.  ZFS-monitor is not ZFS code (it's an FMA
module), WORM also isn't ZFS code, it's an OS level operation, and GZIP
hardware acceleration is produced by Indra networks, and has absolutely
nothing to do with ZFS.  Does it help ZFS?  Sure, but that's hardly a code
contribution to ZFS when it's simply a hardware acceleration card that
accelerates ALL gzip code.

So, great job picking three projects that are not proof of developers
working on ZFS.  And great job not providing any proof to the claim there
are more developers working on ZFS outside of Oracle than within.

You're going to need a hell of a lot bigger bank account to cash the check
than what you've got.  As for me, I don't recall making any claims on this
list that I can't back up, so I'm not really sure what you're getting at.  I
can only assume the defensive tone of your email is because you've been
called out and can't backup the claims either.

So again: if you've got code in the works, great.  Talk about it when it's
ready.  Stop throwing out baseless claims that you have no proof of and then
fall back on just be patient, it's coming.  We've heard that enough from
Oracle and Sun already.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2011-01-03 Thread Robert Milkowski


 On 12/26/10 05:40 AM, Tim Cook wrote:



On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling 
richard.ell...@gmail.com mailto:richard.ell...@gmail.com wrote:



There are more people outside of Oracle developing for ZFS than
inside Oracle.
This has been true for some time now.




Pardon my skepticism, but where is the proof of this claim (I'm quite 
certain you know I mean no disrespect)?  Solaris11 Express was a 
massive leap in functionality and bugfixes to ZFS.  I've seen exactly 
nothing out of outside of Oracle in the time since it went closed. 
 We used to see updates bi-weekly out of Sun.  Nexenta spending 
hundreds of man-hours on a GUI and userland apps isn't work on ZFS.





Exactly my observation as well. I haven't seen any ZFS related 
development happening at Ilumos or Nexenta, at least not yet.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2011-01-03 Thread Bob Friesenhahn


On Mon, 3 Jan 2011, Robert Milkowski wrote:


Exactly my observation as well. I haven't seen any ZFS related 
development happening at Ilumos or Nexenta, at least not yet.


There seems to be plenty of zfs work on the FreeBSD project, but 
primarily with porting the latest available sources to FreeBSD (going 
very well!) rather than with developing zfs itself.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2011-01-03 Thread Richard Elling

On Jan 3, 2011, at 5:08 AM, Robert Milkowski wrote:

 On 12/26/10 05:40 AM, Tim Cook wrote:
 
 
 
 On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling richard.ell...@gmail.com 
 wrote:
 
 There are more people outside of Oracle developing for ZFS than inside 
 Oracle.
 This has been true for some time now.
 
 
 
 
 
 Pardon my skepticism, but where is the proof of this claim (I'm quite 
 certain you know I mean no disrespect)?  Solaris11 Express was a massive 
 leap in functionality and bugfixes to ZFS.  I've seen exactly nothing out of 
 outside of Oracle in the time since it went closed.  We used to see 
 updates bi-weekly out of Sun.  Nexenta spending hundreds of man-hours on a 
 GUI and userland apps isn't work on ZFS.
 
 
 
 Exactly my observation as well. I haven't seen any ZFS related development 
 happening at Ilumos or Nexenta, at least not yet.

I am quite sure you understand how pipelines work :-)
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2011-01-03 Thread Erik Trimble


On 1/3/2011 8:28 AM, Richard Elling wrote:

On Jan 3, 2011, at 5:08 AM, Robert Milkowski wrote:

On 12/26/10 05:40 AM, Tim Cook wrote:
On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling 
richard.ell...@gmail.com mailto:richard.ell...@gmail.com wrote:



There are more people outside of Oracle developing for ZFS than
inside Oracle.
This has been true for some time now.


Pardon my skepticism, but where is the proof of this claim (I'm 
quite certain you know I mean no disrespect)?  Solaris11 Express was 
a massive leap in functionality and bugfixes to ZFS.  I've seen 
exactly nothing out of outside of Oracle in the time since it went 
closed.  We used to see updates bi-weekly out of Sun.  Nexenta 
spending hundreds of man-hours on a GUI and userland apps isn't work 
on ZFS.





Exactly my observation as well. I haven't seen any ZFS related 
development happening at Ilumos or Nexenta, at least not yet.


I am quite sure you understand how pipelines work :-)
 -- richard




I'm getting pretty close to my pain threshold on the BP_rewrite stuff, 
since not having that feature's holding up a big chunk of work I'd like 
to push.


If anyone outside of Oracle is working on some sort of change to ZFS 
that will allow arbitrary movement/placement of pre-written slabs, can 
they please contact me?  I'm pretty much at the point where I'm going to 
start diving into that chunk of the source to see if there's something 
little old me can do, and I'd far rather help on someone else's 
implementation than have to do it myself from scratch.


I'd prefer a private contact, as I realize that such work may not be 
ready for public discussion yet.


Thanks, folks!


Oh, and this is completely just me, not Oracle talking in any way.

--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2011-01-03 Thread Richard Elling

On Jan 3, 2011, at 2:10 PM, Erik Trimble wrote
 On 1/3/2011 8:28 AM, Richard Elling wrote:
 
 On Jan 3, 2011, at 5:08 AM, Robert Milkowski wrote:
 On 12/26/10 05:40 AM, Tim Cook wrote:
 On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling 
 richard.ell...@gmail.com wrote:
 
 There are more people outside of Oracle developing for ZFS than inside 
 Oracle.
 This has been true for some time now.
 
 
 Pardon my skepticism, but where is the proof of this claim (I'm quite 
 certain you know I mean no disrespect)?  Solaris11 Express was a massive 
 leap in functionality and bugfixes to ZFS.  I've seen exactly nothing out 
 of outside of Oracle in the time since it went closed.  We used to see 
 updates bi-weekly out of Sun.  Nexenta spending hundreds of man-hours on a 
 GUI and userland apps isn't work on ZFS.
 
 
 
 Exactly my observation as well. I haven't seen any ZFS related development 
 happening at Ilumos or Nexenta, at least not yet.
 
 I am quite sure you understand how pipelines work :-)
  -- richard
 
 I'm getting pretty close to my pain threshold on the BP_rewrite stuff, since 
 not having that feature's holding up a big chunk of work I'd like to push.
 
 If anyone outside of Oracle is working on some sort of change to ZFS that 
 will allow arbitrary movement/placement of pre-written slabs, can they please 
 contact me?  I'm pretty much at the point where I'm going to start diving 
 into that chunk of the source to see if there's something little old me can 
 do, and I'd far rather help on someone else's implementation than have to do 
 it myself from scratch.
 
 I'd prefer a private contact, as I realize that such work may not be ready 
 for public discussion yet.
 
 Thanks, folks!
 
 Oh, and this is completely just me, not Oracle talking in any way.

Oracle doesn't seem to say much at all :-(

But for those interested, Nexenta is actively hiring people to work in this 
area.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2010-12-25 Thread Richard Elling

On Dec 21, 2010, at 5:05 AM, Deano wrote:
 
 The question therefore is, is there room in the software implementation to 
 achieve performance and reliability numbers similar to expensive drives 
 whilst using relative cheap drives?

For some definition of similar, yes. But using relatively cheap drives does
not mean the overall system cost will be cheap.  For example, $250 will buy
8.6K random IOPS @ 4KB in an SSD[1], but to do that with cheap disks might
require eighty 7,200 rpm SATA disks.

 ZFS is good but IMHO easy to see how it can be improved to better meet this 
 situation, I can’t currently say when this line of thinking and code will 
 move from research to production level use (tho I have a pretty good idea ;) 
 ) but I wouldn’t bet on the status quo lasting much longer. In some ways the 
 removal of OpenSolaris may actually be a good thing, as its catalyized a 
 number of developers from the view that zfs is Oracle led, to thinking “what 
 can we do with zfs code as a base”?

There are more people outside of Oracle developing for ZFS than inside Oracle.
This has been true for some time now.

 Ffor example how about sticking a cheap 80GiB commodity SSD in the storage 
 case. When a resilver or defrag is required, use it as a scratch space to 
 give you a block of fast IOPs storage space to accelerate the slow parts. 
 When its done secure erase and power it down, ready for the next time a 
 resilver needs to happen. The hardware is available, just needs someone to 
 write the software…

In general, SSDs will not speed resilver unless the resilvering disk is an SSD.

[1] 
http://www.intel.com/cd/channel/reseller/asmo-na/eng/products/nand/feature/index.htm
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2010-12-25 Thread Tim Cook

On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling
richard.ell...@gmail.comwrote:

 On Dec 21, 2010, at 5:05 AM, Deano wrote:


 The question therefore is, is there room in the software implementation to
 achieve performance and reliability numbers similar to expensive drives
 whilst using relative cheap drives?


 For some definition of similar, yes. But using relatively cheap drives
 does
 not mean the overall system cost will be cheap.  For example, $250 will buy
 8.6K random IOPS @ 4KB in an SSD[1], but to do that with cheap disks
 might
 require eighty 7,200 rpm SATA disks.

 ZFS is good but IMHO easy to see how it can be improved to better meet this
 situation, I can’t currently say when this line of thinking and code will
 move from research to production level use (tho I have a pretty good idea ;)
 ) but I wouldn’t bet on the status quo lasting much longer. In some ways the
 removal of OpenSolaris may actually be a good thing, as its catalyized a
 number of developers from the view that zfs is Oracle led, to thinking “what
 can we do with zfs code as a base”?


 There are more people outside of Oracle developing for ZFS than inside
 Oracle.
 This has been true for some time now.




Pardon my skepticism, but where is the proof of this claim (I'm quite
certain you know I mean no disrespect)?  Solaris11 Express was a massive
leap in functionality and bugfixes to ZFS.  I've seen exactly nothing out of
outside of Oracle in the time since it went closed.  We used to see
updates bi-weekly out of Sun.  Nexenta spending hundreds of man-hours on a
GUI and userland apps isn't work on ZFS.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2010-12-21 Thread Lanky Doodle

 It's worse on raidzN than on mirrors, because the
 number of items which must
 be read is higher in radizN, assuming you're using
 larger vdev's and
 therefore more items exist scattered about inside
 that vdev.  You therefore
 have a higher number of things which must be randomly
 read before you reach
 completion.

In that case, isn't the answer to have a dedicated parity disk (or 2 or 3 
depending on what raidz* is used), ala raid-dp. Wouldn't this effectively be 
the 'same' as a mirror when resilvering (the only difference being parity vs 
actual data), as it's doing so from a single disk.

raid-dp covers the parity disk from failure so raidz1 probably wouldn't be 
sensible as if the parity disk fails.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2010-12-21 Thread Phil Harman


On 21/12/2010 05:44, Richard Elling wrote:
On Dec 20, 2010, at 7:31 AM, Phil Harman phil.har...@gmail.com 
mailto:phil.har...@gmail.com wrote:

On 20/12/2010 13:59, Richard Elling wrote:
On Dec 20, 2010, at 2:42 AM, Phil Harman phil.har...@gmail.com 
mailto:phil.har...@gmail.com wrote:

Why does resilvering take so long in raidz anyway?
Because it's broken. There were some changes a while back that made 
it more broken.

broken is the wrong term here. It functions as designed and correctly
resilvers devices. Disagreeing with the design is quite different than
proving a defect.
It might be the wrong term in general, but I think it does apply in 
the budget home media server context of this thread.

If you only have a few slow drives, you don't have performance.
Like trying to win the Indianapolis 500 with a tricycle...


The context of this thread is a budget home media server (certainly not 
the Indy 500, but perhaps not as humble as tricycle touring either). And 
whilst it is a habit of the hardware advocate to blame the software ... 
and vice versa ... it's not much help to those of us trying to build 
good enough systems across the performance and availability spectrum.


I think we can agree that ZFS currently doesn't play well on cheap 
disks. I think we can also agree that the performance of ZFS 
resilvering is known to be suboptimal under certain conditions.

... and those conditions are also a strength. For example, most file
systems are nowhere near full. With ZFS you only resilver data. For those
who recall the resilver throttles in SVM or VXVM, you will appreciate not
having to resilver non-data.


I'd love to see the data and analysis for the assertion that most files 
systems are nowhere near full, discounting, of course, any trivial 
cases. In my experience, in any cost conscious scenario, in the home or 
the enterprise, the expectation is that I'll get to use the majority of 
the space I've paid for (generally through the nose from the storage 
silo team in the enterprise scenario). To borrow your illustration, even 
Indy 500 teams care about fuel consumption.


What I don't appreciate is having to resilver significantly more data 
than the drive can contain. But when it comes to the crunch, what I'd 
really appreciate was a bounded resilver time measured in hours not days 
or weeks.


For a long time at Sun, the rule was correctness is a constraint, 
performance is a goal. However, in the real world, performance is 
often also a constraint (just as a quick but erroneous answer is a 
wrong answer, so also, a slow but correct answer can also be wrong).


Then one brave soul at Sun once ventured that if Linux is faster, 
it's a Solaris bug! and to his surprise, the idea caught on. I later 
went on to tell people that ZFS delievered RAID where I = 
inexpensive, so I'm a just a little frustrated when that promise 
becomes less respected over time. First it was USB drives (which I 
agreed with), now it's SATA (and I'm not so sure).

slow doesn't begin with an i :-)


Both ZFS and RAID promised to play in the inexpensive space.

There has been a lot of discussion, anecdotes and some data on this 
list.

slow because I use devices with poor random write(!) performance
is very different than broken.
Again, context is everything. For example, if someone was building a 
business critical NAS appliance from consumer grade parts, I'd be the 
first to say are you nuts?!

Unfortunately, the math does not support your position...


Actually, the math (e.g. raw drive metrics) doesn't lead me to expect 
such a disparity.


The resilver doesn't do a single pass of the drives, but uses a 
smarter temporal algorithm based on metadata.

A design that only does a single pass does not handle the temporal
changes. Many RAID implementations use a mix of spatial and temporal
resilvering and suffer with that design decision.
Actually, it's easy to see how a combined spatial and temporal 
approach could be implemented to an advantage for mirrored vdevs.
However, the current implentation has difficulty finishing the job 
if there's a steady flow of updates to the pool.

Please define current. There are many releases of ZFS, and
many improvements have been made over time. What has not
improved is the random write performance of consumer-grade
HDDs.
I was led to believe this was not yet fixed in Solaris 11, and that 
there are therefore doubts about what Solaris 10 update may see the 
fix, if any.
As far as I'm aware, the only way to get bounded resilver times is 
to stop the workload until resilvering is completed.

I know of no RAID implementation that bounds resilver times
for HDDs. I believe it is not possible. OTOH, whether a resilver
takes 10 seconds or 10 hours makes little difference in data
availability. Indeed, this is why we often throttle resilvering
activity. See previous discussions on this forum regarding the
dueling RFEs.
I don't share your disbelief or little difference analysys. If it 
is true that no

Re: [zfs-discuss] A few questions

2010-12-21 Thread Deano

On Dec 20, 2010, at 7:31 AM, Phil Harman phil.har...@gmail.com wrote:

 If you only have a few slow drives, you don't have performance.

 Like trying to win the Indianapolis 500 with a tricycle...

 

Well you can put a jet engine on a tricycle and perhaps win it… Or you can 
change the race course to only allow a tricycle space to move. In the context 
of storage we have 2 factors hardware and software, having faster and more 
reliable spindles is no reason to suggest that better software can’t be used to 
beat it. The simple example is ZIL SSD, where using some software and  even a 
cheap commodity SSD will outperform sync writes than any amount of expensive 
spindle drives. Before ZIL software is was easy to argue that the only way of 
speeding up writes was more faster spindles.

 

The question therefore is, is there room in the software implementation to 
achieve performance and reliability numbers similar to expensive drives whilst 
using relative cheap drives?

 

ZFS is good but IMHO easy to see how it can be improved to better meet this 
situation, I can’t currently say when this line of thinking and code will move 
from research to production level use (tho I have a pretty good idea ;) ) but I 
wouldn’t bet on the status quo lasting much longer. In some ways the removal of 
OpenSolaris may actually be a good thing, as its catalyized a number of 
developers from the view that zfs is Oracle led, to thinking “what can we do 
with zfs code as a base”?

 

Ffor example how about sticking a cheap 80GiB commodity SSD in the storage 
case. When a resilver or defrag is required, use it as a scratch space to give 
you a block of fast IOPs storage space to accelerate the slow parts. When its 
done secure erase and power it down, ready for the next time a resilver needs 
to happen. The hardware is available, just needs someone to write the software…

 

 

Bye,

Deano

 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2010-12-21 Thread Phil Harman


On 21/12/2010 13:05, Deano wrote:


On Dec 20, 2010, at 7:31 AM, Phil Harman phil.har...@gmail.com 
mailto:phil.har...@gmail.com wrote:


 If you only have a few slow drives, you don't have performance.

 Like trying to win the Indianapolis 500 with a tricycle...



Actually, I didn't say that, Richard did :)

Well you can put a jet engine on a tricycle and perhaps win it… Or you 
can change the race course to only allow a tricycle space to move. In 
the context of storage we have 2 factors hardware and software, having 
faster and more reliable spindles is no reason to suggest that better 
software can’t be used to beat it. The simple example is ZIL SSD, 
where using some software and  even a cheap commodity SSD will 
outperform sync writes than any amount of expensive spindle drives. 
Before ZIL software is was easy to argue that the only way of speeding 
up writes was more faster spindles.


The question therefore is, is there room in the software 
implementation to achieve performance and reliability numbers similar 
to expensive drives whilst using relative cheap drives?


ZFS is good but IMHO easy to see how it can be improved to better meet 
this situation, I can’t currently say when this line of thinking and 
code will move from research to production level use (tho I have a 
pretty good idea ;) ) but I wouldn’t bet on the status quo lasting 
much longer. In some ways the removal of OpenSolaris may actually be a 
good thing, as its catalyized a number of developers from the view 
that zfs is Oracle led, to thinking “what can we do with zfs code as a 
base”?


Ffor example how about sticking a cheap 80GiB commodity SSD in the 
storage case. When a resilver or defrag is required, use it as a 
scratch space to give you a block of fast IOPs storage space to 
accelerate the slow parts. When its done secure erase and power it 
down, ready for the next time a resilver needs to happen. The hardware 
is available, just needs someone to write the software…


Bye,

Deano


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2010-12-21 Thread Deano

Doh sorry about that, the threading got very confused on my mail reader!

 

Bye,

Deano

 

From: Phil Harman [mailto:phil.har...@gmail.com] 
Sent: 21 December 2010 13:12
To: Deano
Cc: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] A few questions

 

On 21/12/2010 13:05, Deano wrote: 

On Dec 20, 2010, at 7:31 AM, Phil Harman phil.har...@gmail.com wrote:

 If you only have a few slow drives, you don't have performance.

 Like trying to win the Indianapolis 500 with a tricycle...


Actually, I didn't say that, Richard did :)




Well you can put a jet engine on a tricycle and perhaps win it… Or you can 
change the race course to only allow a tricycle space to move. In the context 
of storage we have 2 factors hardware and software, having faster and more 
reliable spindles is no reason to suggest that better software can’t be used to 
beat it. The simple example is ZIL SSD, where using some software and  even a 
cheap commodity SSD will outperform sync writes than any amount of expensive 
spindle drives. Before ZIL software is was easy to argue that the only way of 
speeding up writes was more faster spindles.

 

The question therefore is, is there room in the software implementation to 
achieve performance and reliability numbers similar to expensive drives whilst 
using relative cheap drives?

 

ZFS is good but IMHO easy to see how it can be improved to better meet this 
situation, I can’t currently say when this line of thinking and code will move 
from research to production level use (tho I have a pretty good idea ;) ) but I 
wouldn’t bet on the status quo lasting much longer. In some ways the removal of 
OpenSolaris may actually be a good thing, as its catalyized a number of 
developers from the view that zfs is Oracle led, to thinking “what can we do 
with zfs code as a base”?

 

Ffor example how about sticking a cheap 80GiB commodity SSD in the storage 
case. When a resilver or defrag is required, use it as a scratch space to give 
you a block of fast IOPs storage space to accelerate the slow parts. When its 
done secure erase and power it down, ready for the next time a resilver needs 
to happen. The hardware is available, just needs someone to write the software…

 

 

Bye,

Deano

 

 
 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2010-12-21 Thread Edward Ned Harvey

 From: edmud...@mail.bounceswoosh.org
 [mailto:edmud...@mail.bounceswoosh.org] On Behalf Of Eric D. Mudama
 
 On Mon, Dec 20 at 19:19, Edward Ned Harvey wrote:
 If there is no correlation between on-disk order of blocks for different
 disks within the same vdev, then all hope is lost; it's essentially
 impossible to optimize the resilver/scrub order unless the on-disk order
of
 multiple disks is highly correlated or equal by definition.
 
 Very little is impossible.
 
 Drives have been optimally ordering seeks for 35+ years.  I'm guessing

Unless your drive is able to queue up a request to read every single used
part of the drive...  Which is larger than the command queue for any
reasonable drive in the world...  The point is, in order to be optimal you
have to eliminate all those seeks, and perform sequential reads only.  The
only seeks you should do are to skip over unused space.

If you're able to sequentially read the whole drive, skipping all the unused
space, then you're guaranteed to complete faster (or equal) than either (a)
sequentially reading the whole drive, or (b) seeking all over the drive to
read the used parts in random order.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2010-12-21 Thread Edward Ned Harvey

 From: Richard Elling [mailto:richard.ell...@gmail.com]
 
  Now suppose you have a raidz with 3 disks (disk1, disk2, disk3, where
disk3
  is resilvering).  You find some way of ordering all the used blocks of
  disk1...  Which means disk1 will be able to read in optimal order and
speed.
 
 Sounds like prefetching :-)

Ok.  Prefetch every used sector in the pool.  Problem solved.  Let the disks
sort all the requests into on-disk order.  Unless perhaps the number of
requests would exceed the limits of what the drive is able to sort ...
Which seems ... more than likely.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2010-12-21 Thread Eric D. Mudama


On Tue, Dec 21 at  8:24, Edward Ned Harvey wrote:

From: edmud...@mail.bounceswoosh.org
[mailto:edmud...@mail.bounceswoosh.org] On Behalf Of Eric D. Mudama

On Mon, Dec 20 at 19:19, Edward Ned Harvey wrote:
If there is no correlation between on-disk order of blocks for different
disks within the same vdev, then all hope is lost; it's essentially
impossible to optimize the resilver/scrub order unless the on-disk order

of

multiple disks is highly correlated or equal by definition.

Very little is impossible.

Drives have been optimally ordering seeks for 35+ years.  I'm guessing


Unless your drive is able to queue up a request to read every single used
part of the drive...  Which is larger than the command queue for any
reasonable drive in the world...  The point is, in order to be optimal you
have to eliminate all those seeks, and perform sequential reads only.  The
only seeks you should do are to skip over unused space.


I don't think you read my whole post.  I was saying this seek
calculation pre-processing would have to be done by the host server,
and while not impossible, is not trivial.  Present the next 32 seeks
to each device while the pre-processor works on the complete list of
future seeks, and the drive will do as well as possible.


If you're able to sequentially read the whole drive, skipping all the unused
space, then you're guaranteed to complete faster (or equal) than either (a)
sequentially reading the whole drive, or (b) seeking all over the drive to
read the used parts in random order.


Yes, I understand how that works.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2010-12-21 Thread Edward Ned Harvey

 From: edmud...@mail.bounceswoosh.org
 [mailto:edmud...@mail.bounceswoosh.org] On Behalf Of Eric D. Mudama
 
 Unless your drive is able to queue up a request to read every single used
 part of the drive...  Which is larger than the command queue for any
 reasonable drive in the world...  The point is, in order to be optimal
you
 have to eliminate all those seeks, and perform sequential reads only.
The
 only seeks you should do are to skip over unused space.
 
 I don't think you read my whole post.  I was saying this seek
 calculation pre-processing would have to be done by the host server,
 and while not impossible, is not trivial.  Present the next 32 seeks
 to each device while the pre-processor works on the complete list of
 future seeks, and the drive will do as well as possible.

I did read that, but now I think, perhaps I misunderstand it, or you
misunderstood me?  I am thinking...  If you're just queueing up a few reads
at a time (less than infinity, or less than 99% of the pool) ...  I would
not assume that these 32 seeks are even remotely sequential  I mean ...
32 blocks in a pool of presumably millions of blocks...  I would assume they
are essentially random, are they not?

In my mind, which is likely wrong or at least oversimplified, I think if you
want to order the list of blocks to read according to disk order (which
should at least be theoretically possible on mirrors, but perhaps not even
physically possible on raidz)...  You would have to first generate a list of
all the blocks to be read, and then sort it.  Rough estimate, for any pool
of a reasonable size, that sounds like some GB of ram to me.

Maybe there's a less-than-perfect sort algorithm which has a much lower
memory footprint?  Like a simple hashing algorithm that will guarantee the
next few thousand seeks are in disk order...  Although they will skip or
jump over many blocks that will have to be done later ... An algorithm which
is not a perfect sort, but given some repetition and multiple passes over
the disk, might achieve an acceptable level of performance versus memory
footprint...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

Thanks Edward.

I do agree about mirrored rpool (equivalent to Windows OS volume); not doing it 
goes against one of my principles when building enterprise servers.

Is there any argument against using the rpool for all data storage as well as 
being the install volume?

Say for example I chucked 15x 1TB disks in there and created a mirrored rpool 
during installation, using 2 disks. If I added another 6 mirrors (12 disks) to 
it that would give me an rpool of 7TB. The 15th disk being a spare.

Or, say I selected 3 disks during install, does this create a 3 way mirrored 
rpool or does it give you the option of creating raidz? If so, I could then 
create a further 4x 3 drive raidz's, giving me a 10TB rpool.

Or, I could use 2 smaller disks (say 80GB) for the rpool, then create 4x 3 
drive raidz's, giving me an 8TB rpool. Again this gives me a spare disk.

Either of these 3 should keep resilvering times to a minimum, against say one 
big raidz2 of 13 disks.

Why does resilvering take so long in raidz anyway?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

Oh, does anyone know if resilvering efficiency is improved or fixed in Solaris 
11 Express, as that is what i'm using.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

 Why does resilvering take so long in raidz anyway?

Because it's broken. There were some changes a while back that made it more 
broken.

There has been a lot of discussion, anecdotes and some data on this list. 

The resilver doesn't do a single pass of the drives, but uses a smarter 
temporal algorithm based on metadata.

However, the current implentation has difficulty finishing the job if there's a 
steady flow of updates to the pool.

As far as I'm aware, the only way to get bounded resilver times is to stop the 
workload until resilvering is completed.

The problem exists for mirrors too, but is not as marked because mirror 
reconstruction is inherently simpler.

I believe Oracle is aware of the problem, but most of the core ZFS team has 
left. And of course, a fix for Oracle Solaris no longer means a fix for the 
rest of us.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2010-12-20 Thread Deano

Hi,
Which brings up an interesting question... 

IF it were fixed in for example illumos or freebsd is there a plan for how
to handle possible incompatible zfs implementations?

Currently the basic version numbering only works as it implies only one
stream of development, now with multiple possible stream does ZFS need to
move to a feature bit system or are we going to have to have forks or
multiple incompatible versions?

Thanks,
Deano

-Original Message-
From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Phil Harman
Sent: 20 December 2010 10:43
To: Lanky Doodle
Cc: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] A few questions

 Why does resilvering take so long in raidz anyway?

Because it's broken. There were some changes a while back that made it more
broken.

There has been a lot of discussion, anecdotes and some data on this list. 

The resilver doesn't do a single pass of the drives, but uses a smarter
temporal algorithm based on metadata.

However, the current implentation has difficulty finishing the job if
there's a steady flow of updates to the pool.

As far as I'm aware, the only way to get bounded resilver times is to stop
the workload until resilvering is completed.

The problem exists for mirrors too, but is not as marked because mirror
reconstruction is inherently simpler.

I believe Oracle is aware of the problem, but most of the core ZFS team has
left. And of course, a fix for Oracle Solaris no longer means a fix for the
rest of us.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

 I believe Oracle is aware of the problem, but most of
 the core ZFS team has left. And of course, a fix for
 Oracle Solaris no longer means a fix for the rest of
 us.

OK, that is a bit concerning then. As good as ZFS may be, i'm not sure I want 
to committ to a file system that is 'broken' and may not be fully fixed, if at 
all.

Hmnnn...
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions


On 20/12/2010 11:03, Deano wrote:

Hi,
Which brings up an interesting question...

IF it were fixed in for example illumos or freebsd is there a plan for how
to handle possible incompatible zfs implementations?

Currently the basic version numbering only works as it implies only one
stream of development, now with multiple possible stream does ZFS need to
move to a feature bit system or are we going to have to have forks or
multiple incompatible versions?

Thanks,
Deano


Changes to the resilvering implementation don't necessarily require 
changes to the on disk format (although they could). Of course, there 
might be an issue moving a pool mid-resilver from one implementation to 
another.


With arguably considerably more ZFS expertise outside Oracle than in, 
there's a good chance the community will get to a fix first. It would 
then be interesting to see whether NIH prevails, or perhaps even a new 
spirit of share and share alike.


You may say I'm a dreamer ...
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions


On 20/12/2010 11:29, Lanky Doodle wrote:

I believe Oracle is aware of the problem, but most of
the core ZFS team has left. And of course, a fix for
Oracle Solaris no longer means a fix for the rest of
us.

OK, that is a bit concerning then. As good as ZFS may be, i'm not sure I want 
to committ to a file system that is 'broken' and may not be fully fixed, if at 
all.

Hmnnn...


My home server is still running snv_82, and my iMac is running Apple's 
last public beta release for Leopard. The way I see it, the on-disk 
format is sound, and the basic always consistent on disk promise seems 
to be worth something. My files are read-mostly, and performance isn't 
an issue for me. ZFS has protected my data for several years now in the 
face of various hardware issues. I'll upgrade my NAS appliance to 
OpenSolaris snv_134b sometime soon, but as far as I can tell, I can't 
use Oracle Solaris 11 Express for licensing reasons (I have backups of 
business data). I'll be watching Illumos with interest, but snv_82 has 
served me well for 3 years, so I figure snv_134b probably has quite a 
lot of useful life left in it, and maybe then brtfs will be ready for 
prime time?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2010-12-20 Thread Joerg Schilling

Phil Harman phil.har...@gmail.com wrote:

 Changes to the resilvering implementation don't necessarily require 
 changes to the on disk format (although they could). Of course, there 
 might be an issue moving a pool mid-resilver from one implementation to 
 another.

We seem to come to a similar problem as wuth UFS 20 years ago. At that time,
Sun did enhance the UFS on-disk format but the *BSDs did not follow this change 
even though the format change was documented in the related include files.

For a future ZFS development, thee may be a need to allow an implementation to 
implement on-disk version 1..21 + 24 and another implementation to support 
on-disk version 1..23 + 25.

These thoughts of course are void in case that Oracle continues the OSS 
decisions for Solaris and other Solaris variants can import the code related to
recent enhancements.



Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2010-12-20 Thread Richard Elling


On Dec 20, 2010, at 2:42 AM, Phil Harman phil.har...@gmail.com wrote:

 Why does resilvering take so long in raidz anyway?
 
 Because it's broken. There were some changes a while back that made it more 
 broken.

broken is the wrong term here. It functions as designed and correctly 
resilvers devices. Disagreeing with the design is quite different than
proving a defect.

 There has been a lot of discussion, anecdotes and some data on this list. 

slow because I use devices with poor random write(!) performance
is very different than broken.

 The resilver doesn't do a single pass of the drives, but uses a smarter 
 temporal algorithm based on metadata.

A design that only does a single pass does not handle the temporal
changes. Many RAID implementations use a mix of spatial and temporal
resilvering and suffer with that design decision.

 However, the current implentation has difficulty finishing the job if there's 
 a steady flow of updates to the pool.

Please define current. There are many releases of ZFS, and
many improvements have been made over time. What has not
improved is the random write performance of consumer-grade
HDDs.

 As far as I'm aware, the only way to get bounded resilver times is to stop 
 the workload until resilvering is completed.

I know of no RAID implementation that bounds resilver times
for HDDs. I believe it is not possible. OTOH, whether a resilver
takes 10 seconds or 10 hours makes little difference in data
availability. Indeed, this is why we often throttle resilvering
activity. See previous discussions on this forum regarding the
dueling RFEs.

 The problem exists for mirrors too, but is not as marked because mirror 
 reconstruction is inherently simpler.

Resilver time is bounded by the random write performance of
the resilvering device. Mirroring or raidz make no difference.

 I believe Oracle is aware of the problem, but most of the core ZFS team has 
 left. And of course, a fix for Oracle Solaris no longer means a fix for the 
 rest of us.

Some improvements were made post-b134 and pre-b148.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

Thanks relling.

I suppose at the end of the day any file system/volume manager has it's flaws
so perhaps it's better to look at the positives of each and decide based on
them.

So, back to my question above, is there a deciding argument [i]against[/i]
putting data on the install volume (rpool). Forget about mirroring for a sec;

1) Select 3 disks during install creating raidz1. Create a further 4x 3 drive
raidz1's, giving me a 10TB rpool with no spare disks

2) Select 5 disks during install creating raidz1. Create a further 2x 5 drive
raidsz1's giving me a 12TB rpool with no spare disks

3) Select 7 disks during install creating raidz1. Create a further 7 drive
raidz1 giving me 12TB rpool with 1 spare disk

As there is no space gain between 2) and 3) there is no point going for 3),
other than having a spare disk, but resilver times would be slower.

So it becomes between 1) and 2). Neither offer spare disks but 1) would offer
faster resilver times with upto 5 simultaneous disk failures and 2) would offer
2TB extra space with upto 3 simultaneous disk failures.

FYI, I am using Samsung SpinPoint F2's, which have the variable RPM speeds
(http://www.scan.co.uk/products/1tb-samsung-hd103si-ecogreen-f2-sata-3gb-s-32mb-cache-89-ms-ncq)

I may wait at least until I get the next 4 drives in (I actually have 6 at the
mo, not 5) taking me to 10, before migrating to ZFS so plenty of time to think
about it and hopefully time for them to fix resilvering! ;-)

Thanks again...
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions


On 20/12/2010 13:59, Richard Elling wrote:
On Dec 20, 2010, at 2:42 AM, Phil Harman phil.har...@gmail.com 
mailto:phil.har...@gmail.com wrote:



Why does resilvering take so long in raidz anyway?
Because it's broken. There were some changes a while back that made 
it more broken.


broken is the wrong term here. It functions as designed and correctly
resilvers devices. Disagreeing with the design is quite different than
proving a defect.


It might be the wrong term in general, but I think it does apply in the 
budget home media server context of this thread. I think we can agree 
that ZFS currently doesn't play well on cheap disks. I think we can also 
agree that the performance of ZFS resilvering is known to be suboptimal 
under certain conditions.


For a long time at Sun, the rule was correctness is a constraint, 
performance is a goal. However, in the real world, performance is often 
also a constraint (just as a quick but erroneous answer is a wrong 
answer, so also, a slow but correct answer can also be wrong).


Then one brave soul at Sun once ventured that if Linux is faster, it's 
a Solaris bug! and to his surprise, the idea caught on. I later went on 
to tell people that ZFS delievered RAID where I = inexpensive, so I'm 
a just a little frustrated when that promise becomes less respected over 
time. First it was USB drives (which I agreed with), now it's SATA (and 
I'm not so sure).


There has been a lot of discussion, anecdotes and some data on this 
list.


slow because I use devices with poor random write(!) performance
is very different than broken.


Again, context is everything. For example, if someone was building a 
business critical NAS appliance from consumer grade parts, I'd be the 
first to say are you nuts?!


The resilver doesn't do a single pass of the drives, but uses a 
smarter temporal algorithm based on metadata.


A design that only does a single pass does not handle the temporal
changes. Many RAID implementations use a mix of spatial and temporal
resilvering and suffer with that design decision.


Actually, it's easy to see how a combined spatial and temporal approach 
could be implemented to an advantage for mirrored vdevs.


However, the current implentation has difficulty finishing the job if 
there's a steady flow of updates to the pool.


Please define current. There are many releases of ZFS, and
many improvements have been made over time. What has not
improved is the random write performance of consumer-grade
HDDs.


I was led to believe this was not yet fixed in Solaris 11, and that 
there are therefore doubts about what Solaris 10 update may see the fix, 
if any.


As far as I'm aware, the only way to get bounded resilver times is to 
stop the workload until resilvering is completed.


I know of no RAID implementation that bounds resilver times
for HDDs. I believe it is not possible. OTOH, whether a resilver
takes 10 seconds or 10 hours makes little difference in data
availability. Indeed, this is why we often throttle resilvering
activity. See previous discussions on this forum regarding the
dueling RFEs.


I don't share your disbelief or little difference analysys. If it is 
true that no current implementation succeeds, isn't that a great 
opportunity to change the rules? Wasn't resilver time vs availability 
was a major factor in Adam Leventhal's paper introducing the need for 
RAIDZ3?


The appropriateness or otherwise of resilver throttling depends on the 
context. If I can tolerate further failures without data loss (e.g. 
RAIDZ2 with one failed device, or RAIDZ3 with two failed devices), or if 
I can recover business critical data in a timely manner, then great. But 
there may come a point where I would rather take a short term 
performance hit to close the window on total data loss.


The problem exists for mirrors too, but is not as marked because 
mirror reconstruction is inherently simpler.


Resilver time is bounded by the random write performance of
the resilvering device. Mirroring or raidz make no difference.


This only holds in a quiesced system.

I believe Oracle is aware of the problem, but most of the core ZFS 
team has left. And of course, a fix for Oracle Solaris no longer 
means a fix for the rest of us.


Some improvements were made post-b134 and pre-b148.


That is, indeed, good news.


 -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions


On Dec 18, 2010, at 12:23 PM, Lanky Doodle wrote:

 Now this is getting really complex, but can you have server failover in ZFS, 
 much like DFS-R in Windows - you point clients to a clustered ZFS namespace 
 so if a complete server failed nothing is interrupted.

This is the purpose of an Amber Road dual-head cluster (7310C, 7410C, etc.) -- 
not only the storage pool fails over,
but also the server IP address fails over, so that NFS, etc. shares remain 
active, when one storage head goes down.

Amber Road uses ZFS, but the clustering and failover are not related to the 
filesystem type.

Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Lanky Doodle
 
  I believe Oracle is aware of the problem, but most of
  the core ZFS team has left. And of course, a fix for
  Oracle Solaris no longer means a fix for the rest of
  us.
 
 OK, that is a bit concerning then. As good as ZFS may be, i'm not sure I
want
 to committ to a file system that is 'broken' and may not be fully fixed,
if at all.

ZFS is not broken.  It is, however, a weak spot, that resilver is very
inefficient.  For example:

On my server, which is made up of 10krpm SATA drives, 1TB each...  My drives
can each sustain 1Gbit/sec sequential read/write.  This means, if I needed
to resilver the entire drive (in a mirror) sequentially, it would take ...
8,000 sec = 133 minutes.  About 2 hours.  In reality, I have ZFS mirrors,
and disks are around 70% full, and resilver takes 12-14 hours.

So although resilver is broken by some standards, it is bounded, and you
can limit it to something which is survivable, by using mirrors instead of
raidz.  For most people, even using 5-disk, or 7-disk raidzN will still be
fine.  But you start getting unsustainable if you get up to 21-disk radiz3
for example.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Lanky Doodle
 
 Is there any argument against using the rpool for all data storage as well
as
 being the install volume?

Generally speaking, you can't do it.
The rpool is only supported on mirrors, not raidz.  I believe this is
because you need rpool in order to load the kernel, and until the kernel is
loaded, there's just no reasonable way to have a fully zfs-aware,
supports-every-feature bootloader able to read rpool in order to fetch the
kernel.

Normally, you'll dedicate 2 disks to the OS, and then you build additional
separate data pools.  If you absolutely need all the disk space of the OS
disks, then you partition the OS into a smaller section of the OS disks and
assign the remaining space to some pool.  But doing that partitioning scheme
can be complex, and if you're not careful, risky.  I don't advise it unless
you truly have your back against a wall for more disk space.


 Why does resilvering take so long in raidz anyway?

There are some really long and sometimes complex threads in this mailing
list discussing that.  Fundamentally ... First of all, it's not always true.
It depends on your usage behavior and the type of disks you're using.  But
the typical usage includes reading  writing a lot of files, essentially
randomly over time, creating and deleting snapshots, using spindle disks, so
the typical usage behavior does have a resilver performance problem.

The root cause of the problem is that ZFS does not resilver the whole
disk...  It only resilvers the used portions of the disk.  Sounds like a
performance enhancer, right?  It would be, if the disks were mostly empty
... or if ZFS were resilvering a partial disk, in order according to disk
layout.  Unfortunately, it's resilvering according to the temporal order
blocks were written, and usually a disk is significantly full (say, 50% or
more) and as such, the disks have to thrash all around, performing all sorts
of random reads, until eventually it can read all the used parts in random
order.

It's worse on raidzN than on mirrors, because the number of items which must
be read is higher in radizN, assuming you're using larger vdev's and
therefore more items exist scattered about inside that vdev.  You therefore
have a higher number of things which must be randomly read before you reach
completion.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2010-12-20 Thread Saxon, Will

 -Original Message-
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
 Sent: Monday, December 20, 2010 11:46 AM
 To: 'Lanky Doodle'; zfs-discuss@opensolaris.org
 Subject: Re: [zfs-discuss] A few questions

  From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
  boun...@opensolaris.org] On Behalf Of Lanky Doodle

   I believe Oracle is aware of the problem, but most of
   the core ZFS team has left. And of course, a fix for
   Oracle Solaris no longer means a fix for the rest of
   us.

  OK, that is a bit concerning then. As good as ZFS may be, i'm not sure I
 want
  to committ to a file system that is 'broken' and may not be fully fixed,
 if at all.

 ZFS is not broken.  It is, however, a weak spot, that resilver is very
 inefficient.  For example:

 On my server, which is made up of 10krpm SATA drives, 1TB each...  My
 drives
 can each sustain 1Gbit/sec sequential read/write.  This means, if I needed
 to resilver the entire drive (in a mirror) sequentially, it would take ...
 8,000 sec = 133 minutes.  About 2 hours.  In reality, I have ZFS mirrors,
 and disks are around 70% full, and resilver takes 12-14 hours.

 So although resilver is broken by some standards, it is bounded, and you
 can limit it to something which is survivable, by using mirrors instead of
 raidz.  For most people, even using 5-disk, or 7-disk raidzN will still be
 fine.  But you start getting unsustainable if you get up to 21-disk radiz3
 for example.

This argument keeps coming up on the list, but I don't see where anyone has 
made a good suggestion about whether this can even be 'fixed' or how it would 
be done.

As I understand it, you have two basic types of array reconstruction: in a 
mirror you can make a block-by-block copy and that's easy, but in a parity 
array you have to perform a calculation on the existing data and/or existing 
parity to reconstruct the missing piece. This is pretty easy when you can 
guarantee that all your stripes are the same width, start/end on the same 
sectors/boundaries/whatever and thus know a piece of them lives on all drives 
in the set. I don't think this is possible with ZFS since we have variable 
stripe width. A failed disk d may or may not contain data from stripe s (or 
transaction t). This information has to be discovered by looking at the 
transaction records. Right?

Can someone speculate as to how you could rebuild a variable stripe width array 
without replaying all the available transactions? I am no filesystem engineer 
but I can't wrap my head around how this could be handled any better than it 
already is. I've read that resilvering is throttled - presumably to keep 
performance degradation to a minimum during the process - maybe this could be a 
tunable (e.g. priority: low, normal, high)? 

Do we know if resilvers on a mirror are actually handled differently from those 
on a raidz?

Sorry if this has already been explained. I think this is an issue that 
everyone who uses ZFS should understand completely before jumping in, because 
the behavior (while not 'wrong') is clearly NOT the same as with more 
conventional arrays.

-Will
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2010-12-20 Thread Erik Trimble

On 12/20/2010 9:20 AM, Saxon, Will wrote:

-Original Message-
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
Sent: Monday, December 20, 2010 11:46 AM
To: 'Lanky Doodle'; zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] A few questions

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Lanky Doodle

I believe Oracle is aware of the problem, but most of
the core ZFS team has left. And of course, a fix for
Oracle Solaris no longer means a fix for the rest of
us.

OK, that is a bit concerning then. As good as ZFS may be, i'm not sure I

want

to committ to a file system that is 'broken' and may not be fully fixed,

if at all.

ZFS is not broken.  It is, however, a weak spot, that resilver is very
inefficient.  For example:

On my server, which is made up of 10krpm SATA drives, 1TB each...  My
drives
can each sustain 1Gbit/sec sequential read/write.  This means, if I needed
to resilver the entire drive (in a mirror) sequentially, it would take ...
8,000 sec = 133 minutes.  About 2 hours.  In reality, I have ZFS mirrors,
and disks are around 70% full, and resilver takes 12-14 hours.

So although resilver is broken by some standards, it is bounded, and you
can limit it to something which is survivable, by using mirrors instead of
raidz.  For most people, even using 5-disk, or 7-disk raidzN will still be
fine.  But you start getting unsustainable if you get up to 21-disk radiz3
for example.

This argument keeps coming up on the list, but I don't see where anyone has 
made a good suggestion about whether this can even be 'fixed' or how it would 
be done.

As I understand it, you have two basic types of array reconstruction: in a 
mirror you can make a block-by-block copy and that's easy, but in a parity 
array you have to perform a calculation on the existing data and/or existing 
parity to reconstruct the missing piece. This is pretty easy when you can 
guarantee that all your stripes are the same width, start/end on the same 
sectors/boundaries/whatever and thus know a piece of them lives on all drives 
in the set. I don't think this is possible with ZFS since we have variable 
stripe width. A failed disk d may or may not contain data from stripe s (or 
transaction t). This information has to be discovered by looking at the 
transaction records. Right?

Can someone speculate as to how you could rebuild a variable stripe width array 
without replaying all the available transactions? I am no filesystem engineer 
but I can't wrap my head around how this could be handled any better than it 
already is. I've read that resilvering is throttled - presumably to keep 
performance degradation to a minimum during the process - maybe this could be a 
tunable (e.g. priority: low, normal, high)?

Do we know if resilvers on a mirror are actually handled differently from those 
on a raidz?

Sorry if this has already been explained. I think this is an issue that 
everyone who uses ZFS should understand completely before jumping in, because 
the behavior (while not 'wrong') is clearly NOT the same as with more 
conventional arrays.

-Will
the problem is NOT the checksum/error correction overhead. that's 
relatively trivial.  The problem isn't really even variable width (i.e. 
variable number of disks one crosses) slabs.

The problem boils down to this:

When ZFS does a resilver, it walks the METADATA tree to determine what 
order to rebuild things from. That means, it resilvers the very first 
slab ever written, then the next oldest, etc.   The problem here is that 
slab age has nothing to do with where that data physically resides on 
the actual disks. If you've used the zpool as a WORM device, then, sure, 
there should be a strict correlation between increasing slab age and 
locality on the disk.  However, in any reasonable case, files get 
deleted regularly. This means that the probability that for a slab B, 
written immediately after slab A, it WON'T be physically near slab A.

In the end, the problem is that using metadata order, while reducing the 
total amount of work to do in the resilver (as you only resilver live 
data, not every bit on the drive), increases the physical inefficiency 
for each slab.  That is, seek time between cyclinders begins to dominate 
your slab reconstruction time.  In RAIDZ, this problem is magnified by 
both the much larger average vdev size vs mirrors, and the necessity 
that all drives containing a slab information return that data before 
the corrected data can be written to the resilvering drive.

Thus, current ZFS resilvering tends to be seek-time limited, NOT 
throughput limited.  This is really the fault of the underlying media, 
not ZFS.  For instance, if you have a raidZ of SSDs (where seek time is 
negligible, but throughput isn't),  they resilver really, really fast. 
In fact, they resilver at the maximum write throughput rate.   However, 
HDs

Re: [zfs-discuss] A few questions

2010-12-20 Thread Erik Trimble

On 12/20/2010 9:20 AM, Saxon, Will wrote:

-Original Message-
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
Sent: Monday, December 20, 2010 11:46 AM
To: 'Lanky Doodle'; zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] A few questions

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Lanky Doodle

I believe Oracle is aware of the problem, but most of
the core ZFS team has left. And of course, a fix for
Oracle Solaris no longer means a fix for the rest of
us.

OK, that is a bit concerning then. As good as ZFS may be, i'm not sure I

want

to committ to a file system that is 'broken' and may not be fully fixed,

if at all.

ZFS is not broken.  It is, however, a weak spot, that resilver is very
inefficient.  For example:

On my server, which is made up of 10krpm SATA drives, 1TB each...  My
drives
can each sustain 1Gbit/sec sequential read/write.  This means, if I needed
to resilver the entire drive (in a mirror) sequentially, it would take ...
8,000 sec = 133 minutes.  About 2 hours.  In reality, I have ZFS mirrors,
and disks are around 70% full, and resilver takes 12-14 hours.

So although resilver is broken by some standards, it is bounded, and you
can limit it to something which is survivable, by using mirrors instead of
raidz.  For most people, even using 5-disk, or 7-disk raidzN will still be
fine.  But you start getting unsustainable if you get up to 21-disk radiz3
for example.

This argument keeps coming up on the list, but I don't see where anyone has 
made a good suggestion about whether this can even be 'fixed' or how it would 
be done.

As I understand it, you have two basic types of array reconstruction: in a 
mirror you can make a block-by-block copy and that's easy, but in a parity 
array you have to perform a calculation on the existing data and/or existing 
parity to reconstruct the missing piece. This is pretty easy when you can 
guarantee that all your stripes are the same width, start/end on the same 
sectors/boundaries/whatever and thus know a piece of them lives on all drives 
in the set. I don't think this is possible with ZFS since we have variable 
stripe width. A failed disk d may or may not contain data from stripe s (or 
transaction t). This information has to be discovered by looking at the 
transaction records. Right?

Can someone speculate as to how you could rebuild a variable stripe width array 
without replaying all the available transactions? I am no filesystem engineer 
but I can't wrap my head around how this could be handled any better than it 
already is. I've read that resilvering is throttled - presumably to keep 
performance degradation to a minimum during the process - maybe this could be a 
tunable (e.g. priority: low, normal, high)?

Do we know if resilvers on a mirror are actually handled differently from those 
on a raidz?

Sorry if this has already been explained. I think this is an issue that 
everyone who uses ZFS should understand completely before jumping in, because 
the behavior (while not 'wrong') is clearly NOT the same as with more 
conventional arrays.

-Will

As far as a possible fix, here's what I can see:

[Note:  I'm not a kernel or FS-level developer. I would love to be able 
to fix this myself, but I have neither the aptitude nor the [extensive] 
time to learn such skill]

We can either (a) change how ZFS does resilvering or (b) repack the 
zpool layouts to avoid the problem in the first place.

In case (a), my vote would be to seriously increase the number of 
in-flight resilver slabs, AND allow for out-of-time-order slab 
resilvering.  By that, I mean that ZFS would read several 
disk-sequential slabs, and then mark them as done. This would mean a 
*lot* of scanning the metadata tree (since leaves all over the place 
could be done).   Frankly, I can't say how bad that would be; the 
problem is that for ANY resilver, ZFS would have to scan the entire 
metadata tree to see if it had work to do, rather than simply look for 
the latest completed leave, then assume everything after that needs to 
be done.  There'd also be the matter of determining *if* one should read 
a disk sector...

In case (b), we need the ability to move slabs around on the physical 
disk (via the mythical Block Pointer Re-write method).  If there is 
that underlying mechanism, then a defrag utility can be run to repack 
the zpool to the point where chronological creation time = physical 
layout.  Which then substantially mitigates the seek time problem.

I can't fix (a) - I don't understand the codebase well enough. Neither 
can I do the BP-rewrite implementation.  However, if I can get 
BP-rewrite, I've got a prototype defragger that seems to work well 
(under simulation). I'm sure it could use some performance improvement, 
but it works reasonably well on a simulated fragmented pool.

Please, Santa, can a good little boy get

Re: [zfs-discuss] A few questions

Erik,

just a hypothetical what-if ...

In the case of resilvering on a mirrored disk, why not take a snapshot, and then
resilver by doing a pure block copy from the snapshot? It would be sequential,
so long as the original data was unmodified; and random access in dealing with
the modified blocks only, right.

After the original snapshot had been replicated, a second pass would be done,
in order to update the clone to 100% live data.

Not knowing enough about the inner workings of ZFS snapshots, I don't know why
this would not be doable. (I'm biased towards mirrors for busy filesystems.)

I'm supposing that a block-level snapshot is not doable -- or is it?

Mark

On Dec 20, 2010, at 1:27 PM, Erik Trimble wrote:

 On 12/20/2010 9:20 AM, Saxon, Will wrote:
 -Original Message-
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
 Sent: Monday, December 20, 2010 11:46 AM
 To: 'Lanky Doodle'; zfs-discuss@opensolaris.org
 Subject: Re: [zfs-discuss] A few questions
 
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Lanky Doodle
 
 I believe Oracle is aware of the problem, but most of
 the core ZFS team has left. And of course, a fix for
 Oracle Solaris no longer means a fix for the rest of
 us.
 OK, that is a bit concerning then. As good as ZFS may be, i'm not sure I
 want
 to committ to a file system that is 'broken' and may not be fully fixed,
 if at all.
 
 ZFS is not broken.  It is, however, a weak spot, that resilver is very
 inefficient.  For example:
 
 On my server, which is made up of 10krpm SATA drives, 1TB each...  My
 drives
 can each sustain 1Gbit/sec sequential read/write.  This means, if I needed
 to resilver the entire drive (in a mirror) sequentially, it would take ...
 8,000 sec = 133 minutes.  About 2 hours.  In reality, I have ZFS mirrors,
 and disks are around 70% full, and resilver takes 12-14 hours.
 
 So although resilver is broken by some standards, it is bounded, and you
 can limit it to something which is survivable, by using mirrors instead of
 raidz.  For most people, even using 5-disk, or 7-disk raidzN will still be
 fine.  But you start getting unsustainable if you get up to 21-disk radiz3
 for example.
 This argument keeps coming up on the list, but I don't see where anyone has 
 made a good suggestion about whether this can even be 'fixed' or how it 
 would be done.
 
 As I understand it, you have two basic types of array reconstruction: in a 
 mirror you can make a block-by-block copy and that's easy, but in a parity 
 array you have to perform a calculation on the existing data and/or existing 
 parity to reconstruct the missing piece. This is pretty easy when you can 
 guarantee that all your stripes are the same width, start/end on the same 
 sectors/boundaries/whatever and thus know a piece of them lives on all 
 drives in the set. I don't think this is possible with ZFS since we have 
 variable stripe width. A failed disk d may or may not contain data from 
 stripe s (or transaction t). This information has to be discovered by 
 looking at the transaction records. Right?
 
 Can someone speculate as to how you could rebuild a variable stripe width 
 array without replaying all the available transactions? I am no filesystem 
 engineer but I can't wrap my head around how this could be handled any 
 better than it already is. I've read that resilvering is throttled - 
 presumably to keep performance degradation to a minimum during the process - 
 maybe this could be a tunable (e.g. priority: low, normal, high)?
 
 Do we know if resilvers on a mirror are actually handled differently from 
 those on a raidz?
 
 Sorry if this has already been explained. I think this is an issue that 
 everyone who uses ZFS should understand completely before jumping in, 
 because the behavior (while not 'wrong') is clearly NOT the same as with 
 more conventional arrays.
 
 -Will
 the problem is NOT the checksum/error correction overhead. that's 
 relatively trivial.  The problem isn't really even variable width (i.e. 
 variable number of disks one crosses) slabs.
 
 The problem boils down to this:
 
 When ZFS does a resilver, it walks the METADATA tree to determine what order 
 to rebuild things from. That means, it resilvers the very first slab ever 
 written, then the next oldest, etc.   The problem here is that slab age has 
 nothing to do with where that data physically resides on the actual disks. If 
 you've used the zpool as a WORM device, then, sure, there should be a strict 
 correlation between increasing slab age and locality on the disk.  However, 
 in any reasonable case, files get deleted regularly. This means that the 
 probability that for a slab B, written immediately after slab A, it WON'T be 
 physically near slab A.
 
 In the end, the problem is that using metadata order, while reducing the 
 total amount of work to do in the resilver

Re: [zfs-discuss] A few questions

2010-12-20 Thread Erik Trimble


On 12/20/2010 11:56 AM, Mark Sandrock wrote:

Erik,

just a hypothetical what-if ...

In the case of resilvering on a mirrored disk, why not take a snapshot, and then
resilver by doing a pure block copy from the snapshot? It would be sequential,
so long as the original data was unmodified; and random access in dealing with
the modified blocks only, right.

After the original snapshot had been replicated, a second pass would be done,
in order to update the clone to 100% live data.

Not knowing enough about the inner workings of ZFS snapshots, I don't know why
this would not be doable. (I'm biased towards mirrors for busy filesystems.)

I'm supposing that a block-level snapshot is not doable -- or is it?

Mark
Snapshots on ZFS are true snapshots - they take a picture of the current 
state of the system. They DON'T copy any data around when created. So, a 
ZFS snapshot would be just as fragmented as the ZFS filesystem was at 
the time.



The problem is this:

Let's say I write block A, B, C, and D on a clean zpool (what kind, it 
doesn't matter).  I now delete block C.  Later on, I write block E.   
There is a probability (increasing dramatically as times goes on), that 
the on-disk layout will now look like:


A, B, E, D

rather than

A, B, [space], D, E


So, in the first case, I can do a sequential read to get A  B, but then 
must do a seek to get D, and a seek to get E.


The fragmentation problem is mainly due to file deletion, NOT to file 
re-writing.  (though, in ZFS, being a C-O-W filesystem, re-writing 
generally looks like a delete-then-write process, rather than a modify 
process).



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2010-12-20 Thread Bakul Shah

On Mon, 20 Dec 2010 11:27:41 PST Erik Trimble erik.trim...@oracle.com  wrote:
 
 The problem boils down to this:
 
 When ZFS does a resilver, it walks the METADATA tree to determine what 
 order to rebuild things from. That means, it resilvers the very first 
 slab ever written, then the next oldest, etc.   The problem here is that 
 slab age has nothing to do with where that data physically resides on 
 the actual disks. If you've used the zpool as a WORM device, then, sure, 
 there should be a strict correlation between increasing slab age and 
 locality on the disk.  However, in any reasonable case, files get 
 deleted regularly. This means that the probability that for a slab B, 
 written immediately after slab A, it WON'T be physically near slab A.
 
 In the end, the problem is that using metadata order, while reducing the 
 total amount of work to do in the resilver (as you only resilver live 
 data, not every bit on the drive), increases the physical inefficiency 
 for each slab.  That is, seek time between cyclinders begins to dominate 
 your slab reconstruction time.  In RAIDZ, this problem is magnified by 
 both the much larger average vdev size vs mirrors, and the necessity 
 that all drives containing a slab information return that data before 
 the corrected data can be written to the resilvering drive.
 
 Thus, current ZFS resilvering tends to be seek-time limited, NOT 
 throughput limited.  This is really the fault of the underlying media, 
 not ZFS.  For instance, if you have a raidZ of SSDs (where seek time is 
 negligible, but throughput isn't),  they resilver really, really fast. 
 In fact, they resilver at the maximum write throughput rate.   However, 
 HDs are severely seek-limited, so that dominates HD resilver time.

You guys may be interested in a solution I used in a totally
different situation.  There an identical tree data structure
had to be maintained on every node of a distributed system.
When a new node was added, it needed to be initialized with
an identical copy before it could be put in operation. But
this had to be done while the rest of the system was
operational and there may even be updates from a central node
during the `mirroring' operation. Some of these updates could
completely change the tree!  Starting at the root was not
going to work since a subtree that was being copied may stop
existing in the middle and its space reused! In a way this is
a similar problem (but worse!). I needed something foolproof
and simple.

My algorithm started copying sequentially from the start.  If
N blocks were already copied when an update comes along,
updates of any block with block#  N are ignored (since the
sequential copy would get to them eventually).  Updates of
any block# = N were queued up (further update of the same
block would overwrite the old update, to reduce work).
Periodically they would be flushed out to the new node. This
was paced so at to not affect the normal operation much.

I should think a variation would work for active filesystems.
You sequentially read some amount of data from all the disks
from which data for the new disk to be prepared and write it
out sequentially. Each time read enough data so that reading
time dominates any seek time. Handle concurrent updates as
above. If you dedicate N% of time to resilvering, the total
time to complete resilver will be 100/N times sequential read
time of the whole disk. (For example, 1TB disk, 100MBps io
speed, 25% for resilver = under 12 hours).  How much worse
this gets depends on the amount of updates during
resilvering.

At the time of resilvering your FS is more likely to be near
full than near empty so I wouldn't worry about optimizing the
mostly empty FS case.

Bakul
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions


On Dec 20, 2010, at 2:05 PM, Erik Trimble wrote:

 On 12/20/2010 11:56 AM, Mark Sandrock wrote:
 Erik,
 
  just a hypothetical what-if ...
 
 In the case of resilvering on a mirrored disk, why not take a snapshot, and 
 then
 resilver by doing a pure block copy from the snapshot? It would be 
 sequential,
 so long as the original data was unmodified; and random access in dealing 
 with
 the modified blocks only, right.
 
 After the original snapshot had been replicated, a second pass would be done,
 in order to update the clone to 100% live data.
 
 Not knowing enough about the inner workings of ZFS snapshots, I don't know 
 why
 this would not be doable. (I'm biased towards mirrors for busy filesystems.)
 
 I'm supposing that a block-level snapshot is not doable -- or is it?
 
 Mark
 Snapshots on ZFS are true snapshots - they take a picture of the current 
 state of the system. They DON'T copy any data around when created. So, a ZFS 
 snapshot would be just as fragmented as the ZFS filesystem was at the time.

But if one does a raw (block) copy, there isn't any fragmentation -- except for 
the COW updates.

If there were no updates to the snapshot, then it becomes a 100% sequential 
block copy operation.

But even with COW updates, presumably the large majority of the copy would 
still be sequential i/o.

Maybe for the 2nd pass, the filesystem would have to be locked, so the 
operation would ever complete,
but if this is fairly short in relation to the overall resilvering time, then 
it could still be a win in many cases.

I'm probably not explaining it well, and may be way off, but it seemed an 
interesting notion.

Mark

 
 
 The problem is this:
 
 Let's say I write block A, B, C, and D on a clean zpool (what kind, it 
 doesn't matter).  I now delete block C.  Later on, I write block E.   There 
 is a probability (increasing dramatically as times goes on), that the on-disk 
 layout will now look like:
 
 A, B, E, D
 
 rather than
 
 A, B, [space], D, E
 
 
 So, in the first case, I can do a sequential read to get A  B, but then must 
 do a seek to get D, and a seek to get E.
 
 The fragmentation problem is mainly due to file deletion, NOT to file 
 re-writing.  (though, in ZFS, being a C-O-W filesystem, re-writing generally 
 looks like a delete-then-write process, rather than a modify process).
 
 
 -- 
 Erik Trimble
 Java System Support
 Mailstop:  usca22-123
 Phone:  x17195
 Santa Clara, CA
 Timezone: US/Pacific (GMT-0800)
 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

 From: Erik Trimble [mailto:erik.trim...@oracle.com]
 
 We can either (a) change how ZFS does resilvering or (b) repack the
 zpool layouts to avoid the problem in the first place.
 
 In case (a), my vote would be to seriously increase the number of
 in-flight resilver slabs, AND allow for out-of-time-order slab
 resilvering.  

Question for any clueful person:

Suppose you have a mirror to resilver, made of disk1 and disk2, where disk2
failed and is resilvering.  If you have an algorithm to create a list of all
the used blocks of disk1 in disk order, then you're able to resilver the
mirror extremely fast, because all the reads will be sequential in nature,
plus you get to skip past all the unused space.

Now suppose you have a raidz with 3 disks (disk1, disk2, disk3, where disk3
is resilvering).  You find some way of ordering all the used blocks of
disk1...  Which means disk1 will be able to read in optimal order and speed.
Does that necessarily imply disk2 will also work well?  Does the on-disk
order of blocks of disk1 necessarily match the order of blocks on disk2?

If there is no correlation between on-disk order of blocks for different
disks within the same vdev, then all hope is lost; it's essentially
impossible to optimize the resilver/scrub order unless the on-disk order of
multiple disks is highly correlated or equal by definition.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Erik Trimble
 
  In the case of resilvering on a mirrored disk, why not take a snapshot,
and
 then
  resilver by doing a pure block copy from the snapshot? It would be
 sequential,

 So, a
 ZFS snapshot would be just as fragmented as the ZFS filesystem was at
 the time.

I think Mark was suggesting something like dd copy device 1 onto device 2,
in order to guarantee a first-pass sequential resilver.  And my response
would be:  Creative thinking and suggestions are always a good thing.  In
fact, the above suggestion is already faster than the present-day solution
for what I'm calling typical usage, but there are an awful lot of use
cases where the dd solution would be worse... Such as a pool which is
largely sequential already, or largely empty, or made of high IOPS devices
such as SSD.  However, there is a desire to avoid resilvering unused blocks,
so I hope a better solution is possible... 

The fundamental requirement for a better optimized solution would be a way
to resilver according to disk ordering...  And it's just a question for
somebody that actually knows the answer ... How terrible is the idea of
figuring out the on-disk order?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2010-12-20 Thread Eric D. Mudama


On Mon, Dec 20 at 19:19, Edward Ned Harvey wrote:

If there is no correlation between on-disk order of blocks for different
disks within the same vdev, then all hope is lost; it's essentially
impossible to optimize the resilver/scrub order unless the on-disk order of
multiple disks is highly correlated or equal by definition.


Very little is impossible.

Drives have been optimally ordering seeks for 35+ years.  I'm guessing
that the trick (difficult, but not impossible) is how to solve a
travelling salesman route pathing problem where you have billions or
trillions of transactions, and do it fast enough that it was worth
doing any extra computation besides just giving the device 32+ queued
commands at a time that align with the elements of each ordered
transaction ID.

Add to that all the complexity of unwinding the error recovery in the
event that you fail checksum validation on transaction N-1 after
moving past transaction N, which would be a required capability if you
wanted to queue more than a single transaction for verification at a
time.

Oh, and do all of the above without noticably affecting the throughput
of the applications already running on the system.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions