Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-14 Thread can you guess?
 On Dec 14, 2007 1:12 AM, can you guess?
 [EMAIL PROTECTED] wrote:
   yes.  far rarer and yet home users still see
 them.
 
  I'd need to see evidence of that for current
 hardware.
 What would constitute evidence?  Do anecdotal tales
 from home users
 qualify?  I have two disks (and one controller!) that
 generate several
 checksum errors per day each.

I assume that you're referring to ZFS checksum errors rather than to transfer 
errors caught by the CRC resulting in retries.

If so, then the next obvious question is, what is causing the ZFS checksum 
errors?  And (possibly of some help in answering that question) is the disk 
seeing CRC transfer errors (which show up in its SMART data)?

If the disk is not seeing CRC errors, then the likelihood that data is being 
'silently' corrupted as it crosses the wire is negligible (1 in 65,536 if 
you're using ATA disks, given your correction below, else 1 in 4.3 billion for 
SATA).  Controller or disk firmware bugs have been known to cause otherwise 
undetected errors (though I'm not familiar with any recent examples in normal 
desktop environments - e.g., the CERN study discussed earlier found a disk 
firmware bug that seemed only activated by the unusual demands placed on the 
disk by a RAID controller, and exacerbated by that controller's propensity just 
to ignore disk time-outs).  So, for that matter, have buggy file systems.  
Flaky RAM can result in ZFS checksum errors (the CERN study found correlations 
there when it used its own checksum mechanisms).

  I've also seen
 intermittent checksum
 fails that go away once all the cables are wiggled.

Once again, a significant question is whether the checksum errors are 
accompanied by a lot of CRC transfer errors.  If not, that would strongly 
suggest that they're not coming from bad transfers (and while they could 
conceivably be the result of commands corrupted on the wire, so much more data 
is transferred compared to command bandwidth that you'd really expect to see 
data CRC errors too if commands were getting mangled).  When you wiggle the 
cables, other things wiggle as well (I assume you've checked that your RAM is 
solidly seated).

On the other hand, if you're getting a whole bunch of CRC errors, then with 
only a 16-bit CRC it's entirely conceivable that a few are sneaking by 
unnoticed.

 
  Unlikely, since transfers over those connections
 have been protected by 32-bit CRCs since ATA busses
 went to 33 or 66 MB/sec. (SATA has even stronger
 protection)
 The ATA/7 spec specifies a 32-bit CRC (older ones
 used a 16-bit CRC)
 [1].

Yup - my error:  the CRC was indeed introduced in ATA-4 (33 MB/sec. version), 
but was only 16 bits wide back then.

  The serial ata protocol also specifies 32-bit
 CRCs beneath 8/10b
 coding (1.0a p. 159)[2].  That's not much stronger at
 all.

The extra strength comes more from its additional coverage (commands as well as 
data).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-14 Thread can you guess?
 
 ...
 though I'm not familiar with any recent examples in
 normal desktop environments
 
 
 
 One example found during early use of zfs in Solaris
 engineering was
 a system with a flaky power supply.
 
 It seemed to work just fine with ufs but when zfs was
 installed the
 sata drives started to shows many ZFS checksum
 errors.
 
 After replacing the powersupply, the system did not
 detect any more
 errors.
 
 Flaky powersupplies are an important contributor to
 PC unreliability; they
 also tend to fail a lot in various ways.

Thanks - now that you mention it, I think I remember reading about that here 
somewhere.

But did anyone delve into these errors sufficiently to know that they were 
specifically due to controller or disk firmware bugs (since you seem to be 
suggesting by the construction of your response above that they were) rather 
than, say, to RAM errors (if the system in question didn't have ECC RAM, 
anyway) between checksum generation and disk access on either reads or writes 
(the CERN study found a correlation even using ECC RAM between detected RAM 
errors and silent data corruption)?

Not that the generation of such otherwise undetected errors due to a flaky PSU 
isn't interesting in its own right, but this specific sub-thread was about 
whether poor connections were a significant source of such errors (my comment 
about controller and disk firmware bugs having been a suggested potential 
alternative source) - so identifying the underlying mechanisms is of interest 
as well.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-14 Thread can you guess?
the next obvious question is, what is
 causing the ZFS checksum errors?  And (possibly of
 some help in answering that question) is the disk
 seeing CRC transfer errors (which show up in its
 SMART data)?
 
 The memory is ECC in this machine, and Memtest passed
 it for five
 days.  The disk was indeed getting some pretty lousy
 SMART scores,

Seagate ATA disks (if that's what you were using) are notorious for this in a 
couple of specific metrics:  they ship from the factory that way.  This does 
not appear to be indicative of any actual problem but rather of error 
tablulation which they perform differently than other vendors do (e.g., I could 
imagine that they did something unusual in their burn-in exercising that 
generated nominal errors, but that's not even speculation, just a random guess).

 but
 that doesn't explain the controller issue.  This
 particular controller
 is a SIIG-branded silicon image 0680 chipset (which
 is, apparently, a
 piece of junk - if I'd done my homework I would've
 bought something
 else)... but the premise stands.  I bought a piece of
 consumer-level
 hardware off the shelf, it had corruption issues, and
 ZFS told me
 about it when XFS had been silent.

Then we've been talking at cross-purposes.  Your original response was to my 
request for evidence that *platter errors that escape detection by the disk's 
ECC mechanisms* occurred sufficiently frequently to be a cause for concern - 
and that's why I asked specifically what was causing the errors you saw (to see 
whether they were in fact the kind for which I had requested evidence).

Not that detecting silent errors due to buggy firmware is useless:  it clearly 
saved you from continuing corruption in this case.  My impression is that in 
conventional consumer installations (typical consumers never crack open their 
case at all, let alone to add a RAID card) controller and disk firmware is 
sufficiently stable (especially for the limited set of functions demanded of 
it) that ZFS's added integrity checks may not count for a great deal (save 
perhaps peace of mind, but typical consumers aren't sufficiently aware of 
potential dangers to suffer from deficits in that area) - but your experience 
indicates that when you stray from that mold ZFS's added protection may 
sometimes be as significant as it was for Robert's mid-range array firmware 
bugs.

And since there indeed was a RAID card involved in the original hypothetical 
situation under discussion, the fact that I was specifically referring to 
undetectable *disk* errors was only implied by my subsequent discussion of disk 
error rates, rather than explicit.

The bottom line appears to be that introducing non-standard components into the 
path between RAM and disk has, at least for some specific subset of those 
components, the potential to introduce silent errors of the form that ZFS can 
catch - quite possibly in considerably greater numbers that the kinds of 
undetected disk errors that I was talking about ever would (that RAID card you 
were using has a relatively popular low-end chipset, and Robert's mid-range 
arrays were hardly fly-by-night).  So while I'm still not convinced that ZFS 
offers significant features in the reliability area compared with other 
open-source *software* solutions, the evidence that it may do so in more 
sophisticated (but not quite high-end) hardware environments is becoming more 
persuasive.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-12-13 Thread can you guess?
 Hello can,
 
 Thursday, December 13, 2007, 12:02:56 AM, you wrote:
 
 cyg On the other hand, there's always the
 possibility that someone
 cyg else learned something useful out of this.  And
 my question about
 
 To be honest - there's basically nothing useful in
 the thread,
 perhaps except one thing - doesn't make any sense to
 listen to you.

I'm afraid you don't qualify to have an opinion on that, Robert - because you 
so obviously *haven't* really listened.  Until it became obvious that you never 
would, I was willing to continue to attempt to carry on a technical discussion 
with you, while ignoring the morons here who had nothing whatsoever in the way 
of technical comments to offer (but continued to babble on anyway).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-12-13 Thread can you guess?
 Would you two please SHUT THE F$%K UP.

Just for future reference, if you're attempting to squelch a public 
conversation it's often more effective to use private email to do it rather 
than contribute to the continuance of that public conversation yourself.

Have a nice day!

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-13 Thread can you guess?
 Are there benchmarks somewhere showing a RAID10
 implemented on an LSI card with, say, 128MB of cache
 being beaten in terms of performance by a similar
 zraid configuration with no cache on the drive
 controller?
 
 Somehow I don't think they exist. I'm all for data
 scrubbing, but this anti-raid-card movement is
 puzzling.

Oh, for joy - a chance for me to say something *good* about ZFS. rather than 
just try to balance out excessive enthusiasm.

Save for speeding up synchronous writes (if it has enough on-board NVRAM to 
hold them until it's convenient to destage them to disk), a RAID-10 card should 
not enjoy any noticeable performance advantage over ZFS mirroring.

By contrast, if extremely rare undetected and (other than via ZFS checksums) 
undetectable (or considerably more common undetected but detectable via disk 
ECC codes, *if* the data is accessed) corruption occurs, if the RAID card is 
used to mirror the data there's a good chance that even ZFS's validation scans 
won't see the problem (because the card happens to access the good copy for the 
scan rather than the bad one) - in which case you'll lose that data if the disk 
with the good data fails.  And in the case of (extremely rare) 
otherwise-undetectable corruption, if the card *does* return the bad copy then 
IIRC ZFS (not knowing that a good copy also exists) will just claim that the 
data is gone (though I don't know if it will then flag it such that you'll 
never have an opportunity to find the good copy).

If the RAID card scrubs its disks the difference (now limited to the extremely 
rare undetectable-via-disk-ECC corruption) becomes pretty negligible - but I'm 
not sure how many RAIDs below the near-enterprise category perform such scrubs.

In other words, if you *don't* otherwise scrub your disks then ZFS's 
checksums-plus-internal-scrubbing mechanisms assume greater importance:  it's 
only the contention that other solutions that *do* offer scrubbing can't 
compete with ZFS in effectively protecting your data that's somewhat over the 
top.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-13 Thread can you guess?
...

 when the difference between an unrecoverable single
 bit error is not just
 1 bit but the entire file, or corruption of an entire
 database row (etc),
 those small and infrequent errors are an extremely
 big deal.

You are confusing unrecoverable disk errors (which are rare but orders of 
magnitude more common) with otherwise *undetectable* errors (the occurrence of 
which is at most once in petabytes by the studies I've seen, rather than once 
in terabytes), despite my attempt to delineate the difference clearly.  
Conventional approaches using scrubbing provide as complete protection against 
unrecoverable disk errors as ZFS does:  it's only the far rarer otherwise 
*undetectable* errors that ZFS catches and they don't.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-13 Thread can you guess?
...

  If the RAID card scrubs its disks
 
 A scrub without checksum puts a huge burden on disk
 firmware and  
 error reporting paths :-)

Actually, a scrub without checksum places far less burden on the disks and 
their firmware than ZFS-style scrubbing does, because it merely has to scan the 
disk sectors sequentially rather than follow a tree path to each relatively 
small leaf block.  Thus it also compromises runtime operation a lot less as 
well (though in both cases doing it infrequently in the background should 
usually reduce any impact to acceptable levels).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)

2007-12-13 Thread can you guess?
Great questions.

 1) First issue relates to the überblock.  Updates to
 it are assumed to be atomic, but if the replication
 block size is smaller than the überblock then we
 can't guarantee that the whole überblock is
 replicated as an entity.  That could in theory result
 in a corrupt überblock at the
 secondary. 
 
 Will this be caught and handled by the normal ZFS
 checksumming? If so, does ZFS just use an alternate
 überblock and rewrite the damaged one transparently?

ZFS already has to deal with potential uberblock partial writes if it contains 
multiple disk sectors (and it might be prudent even if it doesn't, as Richard's 
response seems to suggest).  Common ways of dealing with this problem include 
dumping it into the log (in which case the log with its own internal recovery 
procedure becomes the real root of all evil) or cycling around at least two 
locations per mirror copy (Richard's response suggests that there are 
considerably more, and that perhaps each one is written in quadruplicate) such 
that the previous uberblock would still be available if the new write tanked.  
ZFS-style snapshots complicate both approaches unless special provisions are 
taken - e.g., copying the current uberblock on each snapshot and hanging a list 
of these snapshot uberblock addresses off the current uberblock, though even 
that might run into interesting complications under the scenario which you 
describe below.  Just using the 'queue' that Richard describes to accumulate 
snapshot uberblocks would limit the number of concurrent snapshots to less than 
the size of that queue.

In any event, as long as writes to the secondary copy don't continue after a 
write failure of the kind that you describe has occurred (save for the kind of 
catch-up procedure that you mention later), ZFS's internal facilities should 
not be confused by encountering a partial uberblock update at the secondary, 
any more than they'd be confused by encountering it on an unreplicated system 
after restart.

 
 2) Assuming that the replication maintains
 write-ordering, the secondary site will always have
 valid and self-consistent data, although it may be
 out-of-date compared to the primary if the
 replication is asynchronous, depending on link
 latency, buffering, etc. 
 
 Normally most replication systems do maintain write
 ordering, [i]except[/i] for one specific scenario.
 If the replication is interrupted, for example
 secondary site down or unreachable due to a comms
 problem, the primary site will keep a list of
 changed blocks.  When contact between the sites is
 re-established there will be a period of 'catch-up'
 resynchronization.  In most, if not all, cases this
 is done on a simple block-order basis.
 Write-ordering is lost until the two sites are once
  again in sync and routine replication restarts. 
 
 I can see this has having major ZFS impact.  It would
 be possible for intermediate blocks to be replicated
 before the data blocks they point to, and in the
 worst case an updated überblock could be replicated
 before the block chains that it references have been
 copied.  This breaks the assumption that the on-disk
 format is always self-consistent. 
 
 If a disaster happened during the 'catch-up', and the
 partially-resynchronized LUNs were imported into a
 zpool at the secondary site, what would/could happen?
 Refusal to accept the whole zpool? Rejection just of
 the files affected? System panic? How could recovery
 from this situation be achieved?

My inclination is to say By repopulating your environment from backups:  it 
is not reasonable to expect *any* file system to operate correctly, or to 
attempt any kind of comprehensive recovery (other than via something like fsck, 
with no guarantee of how much you'll get back), when the underlying hardware 
transparently reorders updates which the file system has explicitly ordered 
when it presented them.

But you may well be correct in suspecting that there's more potential for 
data loss should this occur in a ZFS environment than in update-in-place 
environments where only portions of the tree structure that were explicitly 
changed during the connection hiatus would likely be affected by such a 
recovery interruption (though even there if a directory changed enough to 
change its block structure on disk you could be in more trouble).

 
 Obviously all filesystems can suffer with this
 scenario, but ones that expect less from their
 underlying storage (like UFS) can be fscked, and
 although data that was being updated is potentially
 corrupt, existing data should still be OK and usable.
 My concern is that ZFS will handle this scenario
  less well. 
 
 There are ways to mitigate this, of course, the most
 obvious being to take a snapshot of the (valid)
 secondary before starting resync, as a fallback.

You're talking about an HDS- or EMC-level snapshot, right?

 This isn't always easy to do, especially since the
 resync is usually automatic; there is no clear
 

Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-13 Thread can you guess?
...

  Now it seems to me that without parity/replication,
 there's not much
  point in doing the scrubbing, because you could
 just wait for the error
  to be detected when someone tries to read the data
 for real.  It's
  only if you can repair such an error (before the
 data is needed) that
  such scrubbing is useful.
 
 Pretty much

I think I've read (possibly in the 'MAID' descriptions) the contention that at 
least some unreadable sectors get there in stages, such that if you catch them 
early they will be only difficult to read rather than completely unreadable.  
In such a case, scrubbing is worthwhile even without replication, because it 
finds the problem early enough that the disk itself (or higher-level mechanisms 
if the disk gives up but the higher level is more persistent) will revector the 
sector when it finds it difficult (but not impossible) to read.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-12-12 Thread can you guess?
(apologies if this gets posted twice - it disappeared the first time, and it's 
not clear whether that was intentional)

 Hello can,
 
 Tuesday, December 11, 2007, 6:57:43 PM, you wrote:
 
 Monday, December 10, 2007, 3:35:27 AM, you wrote:

 cyg  and it 
 made them slower
 cyg That's the second time you've claimed that, so you'll really at
 cyg least have to describe *how* you measured this even if the
 cyg detailed results of those measurements may be lost in the mists of 
 time.


 cyg So far you don't really have much of a position to defend at
 cyg all:  rather, you sound like a lot of the disgruntled TOPS users
 cyg of that era.  Not that they didn't have good reasons to feel
 cyg disgruntled - but they frequently weren't very careful about aiming 
 their ire accurately.

 cyg Given that RMS really was *capable* of coming very close to the
 cyg performance capabilities of the underlying hardware, your
 cyg allegations just don't ring true.  Not being able to jump into

 And where is your proof that it was capable of coming very close to
 the...?
 
 cyg It's simple:  I *know* it, because I worked *with*, and *on*, it
 cyg - for many years.  So when some bozo who worked with people with
 cyg a major known chip on their shoulder over two decades ago comes
 cyg along and knocks its capabilities, asking for specifics (not even
 cyg hard evidence, just specific allegations which could be evaluated
 cyg and if appropriate confronted) is hardly unreasonable.
 
 Bill, you openly criticize people (their work) who have worked on ZFS
 for years... not that there's anything wrong with that, just please
 realize that because you were working on it it doesn't mean it is/was
 perfect - just the same as with ZFS.

Of course it doesn't - and I never claimed that RMS was anything close to 
'perfect' (I even gave specific examples of areas in which it was *far* from 
perfect).

Just as I've given specific examples of where ZFS is far from perfect.

What I challenged was David's assertion that RMS was severely deficient in its 
*capabilities* - and demanded not 'proof' of any kind but only specific 
examples (comparable in specificity to the examples of ZFS's deficiencies that 
*I* have provided) that could actually be discussed.

 I know, everyone loves their baby...

No, you don't know:  you just assume that everyone is as biased as you and 
others here seem to be.

 
 Nevertheless just because you were working on and with it, it's not a
 proof. The person you were replaying to was also working with it (but
 not on it I guess). Not that I'm interested in such a proof. Just
 noticed that you're demanding some proof, while you are also just
 write some statements on its performance without any actual proof.

You really ought to spend a lot more time understanding what you've read before 
responding to it, Robert.

I *never* asked for anything like 'proof':  I asked for *examples* specific 
enough to address - and repeated that explicitly in responding to your previous 
demand for 'proof'.  Perhaps I should at that time have observed that your 
demand for 'proof' (your use of quotes suggesting that it was something that 
*I* had demanded) was ridiculous, but I thought my response made that obvious.

 
 
 
 Let me use your own words:

 In other words, you've got nothing, but you'd like people to believe it's 
 something.

 The phrase Put up or shut up comes to mind.

 Where are your proofs on some of your claims about ZFS?
 
 cyg Well, aside from the fact that anyone with even half a clue
 cyg knows what the effects of uncontrolled file fragmentation are on
 cyg sequential access performance (and can even estimate those
 cyg effects within moderately small error bounds if they know what
 cyg the disk characteristics are and how bad the fragmentation is),
 cyg if you're looking for additional evidence that even someone
 cyg otherwise totally ignorant could appreciate there's the fact that
 
 I've never said there are not fragmentation problems with ZFS.

Not having made a study of your collected ZFS contributions here I didn't know 
that.  But some of ZFS's developers are on record stating that they believe 
there is no need to defragment (unless they've changed their views since and 
not bothered to make us aware of it), and in the entire discussion in the 
recent 'ZFS + DB + fragments' thread there were only three contributors 
(Roch, Anton, and I) who seemed willing to admit that any problem existed.

So since one of my 'claims' for which you requested substantiation involved 
fragmentation problems, it seemed appropriate to address them.

 Well, actually I've been hit by the issue in one environment.

But didn't feel any impulse to mention that during all the preceding 
discussion, I guess.

 Also you haven't done your work home properly, as one of ZFS
 developers actually stated they are going to work on ZFS
 de-fragmentation and disk removal (pool shrinking).
 See 

Re: [zfs-discuss] Yager on ZFS

2007-12-12 Thread can you guess?
 Hello can,
 
 Tuesday, December 11, 2007, 6:57:43 PM, you wrote:
 
 Monday, December 10, 2007, 3:35:27 AM, you wrote:

 cyg  and it 
 made them slower
 cyg That's the second time you've claimed that, so you'll really at
 cyg least have to describe *how* you measured this even if the
 cyg detailed results of those measurements may be lost in the mists of 
 time.


 cyg So far you don't really have much of a position to defend at
 cyg all:  rather, you sound like a lot of the disgruntled TOPS users
 cyg of that era.  Not that they didn't have good reasons to feel
 cyg disgruntled - but they frequently weren't very careful about aiming 
 their ire accurately.

 cyg Given that RMS really was *capable* of coming very close to the
 cyg performance capabilities of the underlying hardware, your
 cyg allegations just don't ring true.  Not being able to jump into

 And where is your proof that it was capable of coming very close to
 the...?
 
 cyg It's simple:  I *know* it, because I worked *with*, and *on*, it
 cyg - for many years.  So when some bozo who worked with people with
 cyg a major known chip on their shoulder over two decades ago comes
 cyg along and knocks its capabilities, asking for specifics (not even
 cyg hard evidence, just specific allegations which could be evaluated
 cyg and if appropriate confronted) is hardly unreasonable.
 
 Bill, you openly criticize people (their work) who have worked on ZFS
 for years... not that there's anything wrong with that, just please
 realize that because you were working on it it doesn't mean it is/was
 perfect - just the same as with ZFS.

Of course it doesn't - and I never claimed that RMS was anything close to 
'perfect' (I even gave specific examples of areas in which it was *far* from 
perfect).

Just as I've given specific examples of where ZFS is far from perfect.

What I challenged was David's assertion that RMS was severely deficient in its 
*capabilities* - and demanded not 'proof' of any kind but only specific 
examples (comparable in specificity to the examples of ZFS's deficiencies that 
*I* have provided) that could actually be discussed.

 I know, everyone loves their baby...

No, you don't know:  you just assume that everyone is as biased as you and 
others here seem to be.

 
 Nevertheless just because you were working on and with it, it's not a
 proof. The person you were replaying to was also working with it (but
 not on it I guess). Not that I'm interested in such a proof. Just
 noticed that you're demanding some proof, while you are also just
 write some statements on its performance without any actual proof.

You really ought to spend a lot more time understanding what you've read before 
responding to it, Robert.

I *never* asked for anything like 'proof':  I asked for *examples* specific 
enough to address - and repeated that explicitly in responding to your previous 
demand for 'proof'.  Perhaps I should at that time have observed that your 
demand for 'proof' (your use of quotes suggesting that it was something that 
*I* had demanded) was ridiculous, but I thought my response made that obvious.

 
 
 
 Let me use your own words:

 In other words, you've got nothing, but you'd like people to believe it's 
 something.

 The phrase Put up or shut up comes to mind.

 Where are your proofs on some of your claims about ZFS?
 
 cyg Well, aside from the fact that anyone with even half a clue
 cyg knows what the effects of uncontrolled file fragmentation are on
 cyg sequential access performance (and can even estimate those
 cyg effects within moderately small error bounds if they know what
 cyg the disk characteristics are and how bad the fragmentation is),
 cyg if you're looking for additional evidence that even someone
 cyg otherwise totally ignorant could appreciate there's the fact that
 
 I've never said there are not fragmentation problems with ZFS.

Not having made a study of your collected ZFS contributions here I didn't know 
that.  But some of ZFS's developers are on record stating that they believe 
there is no need to defragment (unless they've changed their views since and 
not bothered to make us aware of it), and in the entire discussion in the 
recent 'ZFS + DB + fragments' thread there were only three contributors 
(Roch, Anton, and I) who seemed willing to admit that any problem existed.

So since one of my 'claims' for which you requested substantiation involved 
fragmentation problems, it seemed appropriate to address them.

 Well, actually I've been hit by the issue in one environment.

But didn't feel any impulse to mention that during all the preceding 
discussion, I guess.

 Also you haven't done your work home properly, as one of ZFS
 developers actually stated they are going to work on ZFS
 de-fragmentation and disk removal (pool shrinking).
 See http://www.opensolaris.org/jive/thread.jspa?messageID=139680↠

Hmmm - there were at least two Sun ZFS personnel participating in the database 
thread, and they never mentioned 

Re: [zfs-discuss] Yager on ZFS

2007-12-12 Thread can you guess?
...

 Bill - I don't think there's a point in continuing
 that discussion.

I think you've finally found something upon which we can agree.  I still 
haven't figured out exactly where on the stupid/intellectually dishonest 
spectrum you fall (lazy is probably out:  you have put some effort in to 
responding), but it is clear that you're hopeless.

On the other hand, there's always the possibility that someone else learned 
something useful out of this.  And my question about just how committed you 
were to your ignorance has been answered.  It's difficult to imagine how 
someone so incompetent in the specific area that he's debating can be so 
self-assured - I suspect that just not listening has a lot to do with it - but 
also kind of interesting to see that in action.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-12-11 Thread can you guess?
 Monday, December 10, 2007, 3:35:27 AM, you wrote:
 
 cyg  and it 
 made them slower
 
 cyg That's the second time you've claimed that, so you'll really at
 cyg least have to describe *how* you measured this even if the
 cyg detailed results of those measurements may be lost in the mists of time.
 
 
 cyg So far you don't really have much of a position to defend at
 cyg all:  rather, you sound like a lot of the disgruntled TOPS users
 cyg of that era.  Not that they didn't have good reasons to feel
 cyg disgruntled - but they frequently weren't very careful about aiming 
 their ire accurately.
 
 cyg Given that RMS really was *capable* of coming very close to the
 cyg performance capabilities of the underlying hardware, your
 cyg allegations just don't ring true.  Not being able to jump into
 
 And where is your proof that it was capable of coming very close to
 the...?

It's simple:  I *know* it, because I worked *with*, and *on*, it - for many 
years.  So when some bozo who worked with people with a major known chip on 
their shoulder over two decades ago comes along and knocks its capabilities, 
asking for specifics (not even hard evidence, just specific allegations which 
could be evaluated and if appropriate confronted) is hardly unreasonable.

Hell, *I* gave more specific reasons why someone might dislike RMS in 
particular and VMS in general (complex and therefore user-unfriendly low-level 
interfaces and sometimes poor *default* performance) than David did:  they just 
didn't happen to match those that he pulled out of (whereever) and that I 
challenged.

 Let me use your own words:
 
 In other words, you've got nothing, but you'd like people to believe it's 
 something.
 
 The phrase Put up or shut up comes to mind.
 
 Where are your proofs on some of your claims about ZFS?

Well, aside from the fact that anyone with even half a clue knows what the 
effects of uncontrolled file fragmentation are on sequential access performance 
(and can even estimate those effects within moderately small error bounds if 
they know what the disk characteristics are and how bad the fragmentation is), 
if you're looking for additional evidence that even someone otherwise totally 
ignorant could appreciate there's the fact that Unix has for over two decades 
been constantly moving in the direction of less file fragmentation on disk - 
starting with the efforts that FFS made to at least increase proximity and 
begin to remedy the complete disregard for contiguity that the early Unix file 
system displayed and to which ZFS has apparently regressed, through the 
additional modifications that Kleiman and McVoy introduced in the early '90s to 
group 56 KB of blocks adjacently when possible, through the extent-based 
architectures of VxFS, XFS, JFS, and soon-to-be ext4 file systems (
 I'm probably missing others here):  given the relative changes between disk 
access times and bandwidth over the past decade and a half, ZFS with its max 
128 KB blocks in splendid isolation offers significantly worse sequential 
performance relative to what's attainable than the systems that used 56 KB 
aggregates back then did (and they weren't all that great in that respect).

Given how slow Unix was to understand and start to deal with this issue, 
perhaps it's not surprising how ignorant some Unix people still are - despite 
the fact that other platforms fully understood the problem over three decades 
ago.

Last I knew, ZFS was still claiming that it needed nothing like 
defragmentation, while describing write allocation mechanisms that could allow 
disastrous degrees of fragmentation under conditions that I've described quite 
clearly.  If ZFS made no efforts whatsoever in this respect the potential for 
unacceptable performance would probably already have been obvious even to its 
blindest supporters, so I suspect that when ZFS is given the opportunity by a 
sequentially-writing application that doesn't force every write (or by use of 
the ZIL in some cases) it aggregates blocks in a file together in cache and 
destages them in one contiguous chunk to disk (rather than just mixing blocks 
willy-nilly in its batch disk writes) - and a lot of the time there's probably 
not enough other system write activity to make this infeasible, so that people 
haven't found sequential streaming performance to be all that bad most of the 
time (especially on the read end if their systems are lightly load
 ed and the fact that their disks may be working a lot harder than they ought 
to have to is not a problem).

But the potential remains for severe fragmention under heavily parallel access 
conditions, or when a file is updated at fine grain but then read sequentially 
(the whole basis of the recent database thread), and with that fragmentation 
comes commensurate performance degradation.  And even if you're not capable of 
understanding why yourself you should consider it significant that no one on 
the ZFS development team has piped up to say 

Re: [zfs-discuss] Yager on ZFS

2007-12-09 Thread can you guess?
...

 I remember trying to help customers move
 their
  applications from 
  TOPS-20 to VMS, back in the early 1980s, and
 finding
  that the VMS I/O 
  capabilities were really badly lacking.
  
 
  Funny how that works:  when you're not familiar
 with something, you often mistake your own ignorance
 for actual deficiencies.  Of course, the TOPS-20
 crowd was extremely unhappy at being forced to
 migrate at all, and this hardly improved their
 perception of the situation.
 
  If you'd like to provide specifics about exactly
 what was supposedly lacking, it would be possible to
 evaluate the accuracy of your recollection.

 
 I've played this game before, and it's off-topic and
 too much work to be 
 worth it.

In other words, you've got nothing, but you'd like people to believe it's 
something.

The phrase Put up or shut up comes to mind.

  Researching exactly when specific features
 were released into 
 VMS RMS from this distance would be a total pain,

I wasn't asking for anything like that:  I was simply asking for specific 
examples of the VMS I/O capabilities that you allegedly 'found' were really 
badly lacking in the early 1980s.  Even if the porting efforts you were 
involved in predated the pivotal cancellation of Jupiter in 1983, that was 
still close enough to the VMS cluster release that most VMS development effort 
had turned in that direction (i.e., the single-system VMS I/O subsystem had 
pretty well reached maturity), so there won't be any need to quibble about what 
shipped when.

Surely if you had a sufficiently strong recollection to be willing to make such 
a definitive assertion you can remember *something* specific.

 and
 then we'd argue 
 about which ones were beneficial for which
 situations, which people 
 didin't much agree about then or since.

No, no, no:  you're reading far more generality into this than I ever 
suggested.  I'm not asking you to judge what was useful, and I couldn't care 
less whether you thought the features that VMS had and TOPS lacked were 
valuable:  I'm just asking you to be specific about what VMS I/O capabilities 
you claim were seriously deficient.

   My
 experience at the time was 
 that RMS was another layer of abstraction and
 performance loss between 
 the application and the OS,

Ah - your 'experience'.  So you actually measured RMS's effect on performance, 
rather than just SWAGged that adding a layer that you found unappealing in a 
product that your customers were angry about having to move to Must Be A Bad 
Idea?  What was the quantitative result of that measurement, and how was RMS 
configured for the relevant workload?  After all, the extra layer wasn't 
introduced just to give you something to complain about:  it was there to 
provide additional features and configuration flexibility (much of it 
performance-related), as described above.  If you didn't take advantage of 
those facilities, that could be a legitimate *complexity* knock against the 
environment but it's not a legitimate *capability* or *performance* knock 
(rather the opposite, in fact).

 and it made it harder to
 do things

If you were using the RMS API itself rather than accessing RMS through a 
higher-level language that provided simple I/O handling for simple I/O needs, 
that was undoubtedly the case:  as I observed above, that's a price that VMS 
was happy to pay for providing complete control to applications that wanted it. 
 RMS was designed from the start to provide that alternative with the 
understanding that access via higher-level language mechanisms would usually be 
used by those people who didn't need the low-level control that the native RMS 
API provided.

 and it 
 made them slower

That's the second time you've claimed that, so you'll really at least have to 
describe *how* you measured this even if the detailed results of those 
measurements may be lost in the mists of time.

 and it made files less
 interchangeable between 
 applications;

That would have been some trick, given that RMS supported pure byte-stream 
files as well as its many more structured types (and I'm pretty sure that the C 
run-time system took this approach, using RMS direct I/O and doing its own 
deblocking to ensure that some of the more idiomatic C activities like 
single-character reads and writes would not inadvertently perform poorly).  So 
at worst you could have used precisely the same in-file formats that were being 
used in the TOPS-20 environment and achieved the same degree of portability 
(unless you were actually encountering peculiarities in language access rather 
than in RMS itself:  I'm considerably less familiar with that end of the 
environment).

 but I'm not interested in trying to
 defend this position 
 for weeks based on 25-year-old memories.

So far you don't really have much of a position to defend at all:  rather, you 
sound like a lot of the disgruntled TOPS users of that era.  Not that they 
didn't have good reasons to feel disgruntled - but 

Re: [zfs-discuss] Yager on ZFS

2007-12-09 Thread can you guess?
 why don't you put your immense experience and
 knowledge to contribute
 to what is going to be
 the next and only filesystems in modern operating
 systems,

Ah - the pungent aroma of teenage fanboy wafts across the Net.

ZFS is not nearly good enough to become what you suggest above, nor is it 
amenable to some of the changes necessary to make it good enough.  So while I'm 
happy to give people who have some personal reason to care about it pointers on 
how it could be improved, I have no interest in working on it myself.

 instead of
 spending your time asking for specifics

You'll really need to learn to pay a lot more attention to specifics yourself 
if you have any desire to become technically competent when you grow up.

 and
  treating everyone of
 ignorant

I make some effort only to treat the ignorant as ignorant.  It's hardly my 
fault that they are so common around here, but I'd like to think that there's a 
silent majority of more competent individuals in the forum who just look on 
quietly (and perhaps somewhat askance).

It used to be that the ignorant felt motivated to improve themselves, but now 
they seem more inclined to engage in aggressive denial (which may be easier on 
the intellect but seems a less productive use of energy).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] OT: NTFS Single Instance Storage (Re: Yager on ZFS

2007-12-08 Thread can you guess?
 [EMAIL PROTECTED] wrote:
  Darren,
  
Do you happen to have any links for this?  I
 have not seen anything
  about NTFS and CAS/dedupe besides some of the third
 party apps/services
  that just use NTFS as their backing store.
 
 Single Instance Storage is what Microsoft uses to
 refer to this:
 
 http://research.microsoft.com/sn/Farsite/WSS2000.pdf

While SIS is likely useful in certain environments, it is actually layered on 
top of NTFS rather than part of it - and in fact could in principle be layered 
on top of just about any underlying file system in any OS that supported 
layered 'filter' drivers.  File access to a shared file via SIS runs through an 
additional phase of directory look-up similar to that involved in following a 
symbolic link, and its described copy-on-close semantics require divided data 
access within the updater's version of the file (fetching unchanged data from 
the shared copy and changed data from the to-be-fleshed-out-after-close copy) 
with apparently no mechanism to avoid the need to copy the entire file after 
close even if only a single byte within it has been changed (which could 
compromise its applicability in some environments).

Nonetheless, unlike most dedupe products it does apply to on-line rather than 
backup storage, and Microsoft deserves credit for fielding it well in advance 
of the dedupe startups:  once in a while they actually do produce something 
that qualifies as at least moderately innovative.  NTFS was at least 
respectable if not ground-breaking as well when it first appeared, and it's too 
bad that it has largely stagnated since while MS pursued its 'structured 
storage' and similar dreams (one might suspect in part to try to create a de 
facto storage standard that competitors couldn't easily duplicate, limiting the 
portability of applications built to take advantage of its features without 
attracting undue attention from trust-busters, such as they are these days - 
but perhaps I'm just too cynical).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-12-08 Thread can you guess?
 from the description here
 
 http://www.djesys.com/vms/freevms/mentor/rms.html
 so who cares here ?
 
 
 RMS is not a filesystem, but more a CAS type of data
 repository

Since David begins his description with the statement RMS stands for Record 
Management Services. It is the underlying file system of OpenVMS, I'll 
suggest that your citation fails a priori to support your allegation above.

Perhaps you're confused by the fact that RMS/Files-11 is a great deal *more* of 
a file system than most Unix examples (though ReiserFS was at least heading in 
somewhat similar directions).  You might also be confused by the fact that VMS 
separates its file system facilities into an underlying block storage and 
directory layer specific to disk storage and the upper RMS 
deblocking/interpretation/pan-device layer, whereas Unix combines the two.

Better acquainting yourself with what CAS means in the context of contemporary 
disk storage solutions might be a good idea as well, since it bears no relation 
to RMS (nor to virtually any Unix file system).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mail system errors (On Topic).

2007-12-08 Thread can you guess?
 Yet another prime example.

Ah - yet another brave denizen (and top-poster) who's more than happy to dish 
it out but squeals for administrative protection when receiving a response in 
kind.

The fact that your pleas seem to be going unanswered actually reflects rather 
well on whoever is managing this forum:  even if they don't particularly care 
for my attitude, they appear to recognize that there's a good reason why I deal 
with some of you as I have.

Do have a nice day.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-12-08 Thread can you guess?
 can you run a database on RMS?

As well as you could on must Unix file systems.  And you've been able to do so 
for almost three decades now (whereas features like asynchronous and direct I/O 
are relative newcomers in the Unix environment).

 I guess its not suited

And you guess wrong:  that's what happens when you speak from ignorance rather 
than from something more substantial.

 we are already trying to get ride of a 15 years old
 filesystem called
 wafl,

Whatever for?  Please be specific about exactly what you expect will work 
better with whatever you're planning to replace it with - and why you expect it 
to be anywhere nearly as solid.

 and a 10 years old file system called
 Centera,

My, you must have been one of the *very* early adopters, since EMC launched it 
only 5 1/2 years ago.

 so do you thing
 we are going to consider a 35 years old filesystem
 now... computer
 science made a lot of improvement since

Well yes, and no.  For example, most Unix platforms are still struggling to 
match the features which VMS clusters had over two decades ago:  when you start 
as far behind as Unix did, even continual advances may still not be enough to 
match such 'old' technology.

Not that anyone was suggesting that you replace your current environment with 
RMS:  if it's your data, knock yourself out using whatever you feel like using. 
 On the other hand, if someone else is entrusting you with *their* data, they 
might be better off looking for someone with more experience and sense.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-12-08 Thread can you guess?
 can you guess? wrote:
  can you run a database on RMS?
  
 
  As well as you could on must Unix file systems.
 And you've been able to do so for almost three
 decades now (whereas features like asynchronous and
 direct I/O are relative newcomers in the Unix
  environment).

 nny, I remember trying to help customers move their
 applications from 
 TOPS-20 to VMS, back in the early 1980s, and finding
 that the VMS I/O 
 capabilities were really badly lacking.

Funny how that works:  when you're not familiar with something, you often 
mistake your own ignorance for actual deficiencies.  Of course, the TOPS-20 
crowd was extremely unhappy at being forced to migrate at all, and this hardly 
improved their perception of the situation.

If you'd like to provide specifics about exactly what was supposedly lacking, 
it would be possible to evaluate the accuracy of your recollection.

  RMS was an
 abomination -- 
 nothing but trouble,

Again, specifics would allow an assessment of that opinion.

 and another layer to keep you
 away from your data.  

Real men use raw disks, of course.  And with RMS (unlike Unix systems of that 
era) you could get very close to that point if you wanted to without abandoning 
the file level of abstraction - or work at a considerably more civilized level 
if you wanted that with minimal sacrifice in performance (again, unlike the 
Unix systems of that era, where storage performance was a joke until FFS began 
to improve things - slowly).

VMS and RMS represented a very different philosophy than Unix:  you could do 
anything, and therefore were exposed to the complexity that this flexibility 
entailed.  Unix let you do things one simple way - whether it actually met your 
needs or not.

Back then, efficient use of processing cycles (even in storage applications) 
could be important - and VMS and RMS gave you that option.  Nowadays, trading 
off cycles to obtain simplicity is a lot more feasible, and the reasons for the 
complex interfaces of yesteryear can be difficult to remember.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-12-07 Thread can you guess?
  You have me at a disadvantage here, because I'm
 not
  even a Unix (let alone Solaris and Linux)
 aficionado.
  But don't Linux snapshots in conjunction with
 rsync
  (leaving aside other possibilities that I've never
  heard of) provide rather similar capabilities
 (e.g.,
  incremental backup or re-synching), especially
 when
   used in conjunction with scripts and cron?
  
 
 
 Which explains why you keep ranting without knowing
 what you're talking about.

Au contraire, cookie:  I present things in detail to make it possible for 
anyone capable of understanding the discussion to respond substantively if 
there's something that requires clarification or further debate.

You, by contrast, babble on without saying anything substantive at all - which 
makes you kind of amusing, but otherwise useless.  You could at least have 
tried to answer my question above, since you took the trouble to quote it - but 
of course you didn't, just babbled some more.

  Copy-on-write.  Even a
 bookworm with 0 real-life-experience should be able
 to apply this one to a working situation.  

As I may well have been designing and implementing file systems since before 
you were born (or not:  you just have a conspicuously callow air about you), my 
'real-life' experience with things like COW is rather extensive.  And while I 
don't have experience with Linux adjuncts like rsync, unlike some people I'm 
readily able to learn from the experience of others (who seem far more credible 
when describing their successful use of rsync and snapshots on Linux than 
anything I've seen you offer up here).

 
 There's a reason ZFS (and netapp) can take snapshots
 galore without destroying their filesystem
 performance.

Indeed:  it's because ZFS already sacrificed a significant portion of that 
performance by disregarding on-disk contiguity, so there's relatively little 
left to lose.  By contrast, systems that respect the effects of contiguity on 
performance (and WAFL does to a greater degree than ZFS) reap its benefits all 
the time (whether snapshots exist or not) while only paying a penalty when data 
is changed (and they don't have to change as much data as ZFS does because they 
don't have to propagate changes right back to the root superblock on every 
update).

It is possible to have nearly all of the best of both worlds, but unfortunately 
not with any current implementations that I know of.  ZFS could at least come 
considerably closer, though, if it reorganized opportunistically as discussed 
in the database thread.

(By the way, since we're talking about snapshots here rather than about clones 
it doesn't matter at all how many there are, so your 'snapshots galore' bluster 
above is just more evidence of your technical incompetence:  with any 
reasonable implementation the only run-time overhead occurs in keeping the most 
recent snapshot up to date, regardless of how many older snapshots may also be 
present.)

But let's see if you can, for once, actually step up to the plate and discuss 
something technically, rather than spout buzzwords that you apparently don't 
come even close to understanding:

Are you claiming that writing snapshot before-images of modified data (as, 
e.g., Linux LVM snapshots do) for the relatively brief period that it takes to 
transfer incremental updates to another system 'destroys' performance?  First 
of all, that's clearly dependent upon the update rate during that interval, so 
if it happens at a quiet time (which presumably would be arranged if its 
performance impact actually *was* a significant issue) your assertion is 
flat-out-wrong.  Even if the snapshot must be processed during normal 
operation, maintaining it still won't be any problem if the run-time workload 
is read-dominated.

And I suppose Sun must be lying in its documentation for fssnap (which Sun has 
offered since Solaris 8 with good old update-in-place UFS) where it says While 
the snapshot is active, users of the file system might notice a slight 
performance impact [as contrasted with your contention that performance is 
'destroyed'] when the file system is written to, but they see no impact when 
the file system is read 
(http://docsun.cites.uiuc.edu/sun_docs/C/solaris_9/SUNWaadm/SYSADV1/p185.html). 
 You'd really better contact them right away and set them straight.

Normal system cache mechanisms should typically keep about-to-be-modified data 
around long enough to avoid the need to read it back in from disk to create the 
before-image for modified data used in a snapshot, and using a log-structured 
approach to storing these BIs in the snapshot file or volume (though I don't 
know what specific approaches are used in fssnap and LVM:  do you?) would be 
extremely efficient - resulting in minimal impact on normal system operation 
regardless of write activity.

C'mon, cookie:  surprise us for once - say something intelligent.  With 
guidance and practice, you might even be able to make a habit of it.

- bill
 
 
This 

Re: [zfs-discuss] Mail system errors (On Topic).

2007-12-07 Thread can you guess?
 
 I keep getting ETOOMUCHTROLL errors thrown while
 reading this list,  is
 there a list admin that can clean up the mess?   I
 would hope that repeated
 personal attacks could be considered grounds for
 removal/blocking.

Actually, most of your more unpleasant associates here seem to suffer primarily 
from blind and misguided loyalty and/or an excess of testosterone - so there's 
always hope that they'll grow up over time and become productive contributors.  
And if I'm not complaining about their attacks but just dealing with them in 
kind while carrying on more substantive conversations, it's not clear that they 
should pose a serious problem for others.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-12-07 Thread can you guess?
Once again, profuse apologies for having taken so long (well over 24 hours by 
now - though I'm not sure it actually appeared in the forum until a few hours 
after its timestamp) to respond to this.

 can you guess? wrote:
 
  Primarily its checksumming features, since other
 open source solutions support simple disk scrubbing
 (which given its ability to catch most deteriorating
 disk sectors before they become unreadable probably
 has a greater effect on reliability than checksums in
 any environment where the hardware hasn't been
 slapped together so sloppily that connections are
 flaky).

 From what I've read on the subject, That premise
  seems bad from the 
 tart.

Then you need to read more or understand it better.

  I don't believe that scrubbing will catch all
 the types of 
 errors that checksumming will.

That's absolutely correct, but it in no way contradicts what I said (and you 
quoted) above.  Perhaps you should read that again, more carefully:  it merely 
states that disk scrubbing probably has a *greater* effect on reliability than 
checksums do, not that it completely subsumes their features.

 There are a category
 of errors that are 
 not caused by firmware, or any type of software. The
 hardware just 
 doesn't write or read the correct bit value this time
 around. With out a 
 checksum there's no way for the firmware to know, and
 next time it very 
 well may write or read the correct bit value from the
 exact same spot on 
 the disk, so scrubbing is not going to flag this
 sector as 'bad'.

It doesn't have to, because that's a *correctable* error that the disk's 
extensive correction codes (which correct *all* single-bit errors as well as 
most considerably longer error bursts) resolve automatically.

 
 Now you may claim that this type of error happens so
 infrequently

No, it's actually one of the most common forms, due to the desire to pack data 
on the platter as tightly as possible:  that's why those long correction codes 
were created.

Rather than comment on the rest of your confused presentation about disk error 
rates, I'll just present a capsule review of the various kinds:

1.  Correctable errors (which I just described above).  If a disk notices that 
a sector *consistently* requires correction it may deal with it as described in 
the next paragraph.

2.  Errors that can be corrected only with retries (i.e., the sector is not 
*consistently* readable even after the ECC codes have been applied, but can be 
successfully read after multiple attempts which can do things like fiddle 
slightly with the head position over the track and signal amplification to try 
to get a better response).  A disk may try to rewrite such a sector in place to 
see if its readability improves as a result, and if it doesn't will then 
transparently revector the data to a spare sector if one exists and mark the 
original sector as 'bad'.  Background scrubbing gives the disk an opportunity 
to discover such sectors *before* they become completely unreadable, thus 
significantly improving reliability even in non-redundant environments.

3.  Uncorrectable errors (bursts too long for the ECC codes to handle even 
after the kinds of retries described above, but which the ECC codes can still 
detect):  scrubbing catches these as well, and if suitable redundancy exists it 
can correct them by rewriting the offending sector (the disk may transparently 
revector it if necessary, or the LVM or file system can if the disk can't).  
Disk vendor specs nominally state that one such error may occur for every 10^14 
bits transferred for a contemporary commodity (ATA or SATA) drive (i.e., about 
once in every 12.5 TB), but studies suggest that in practice they're much rarer.

4.  Undetectable errors (errors which the ECC codes don't detect but which 
ZFS's checksums presumably would).  Disk vendors no longer provide specs for 
this reliability metric.  My recollection from a decade or more ago is that 
back when they used to it was three orders of magnitude lower than the 
uncorrectable error rate:  if that still obtained it would mean about once in 
every 12.5 petabytes transferred, but given that the real-world incidence of 
uncorrectable errors is so much lower than speced and that ECC codes keep 
increasing in length it might be far lower than that now.

...

  Aside from the problems that scrubbing handles (and
 you need scrubbing even if you have checksums,
 because scrubbing is what helps you *avoid* data loss
 rather than just discover it after it's too late to
 do anything about it), and aside from problems 
 Again I think you're wrong on the basis for your
 point.

No:  you're just confused again.

 The checksumming 
 in ZFS (if I understand it correctly) isn't used for
 only detecting the 
 problem. If the ZFS pool has any redundancy at all,
 those same checksums 
 can be used to repair that same data, thus *avoiding*
 the data loss.

1.  Unlike things like disk ECC codes, ZFS's checksums can't

Re: [zfs-discuss] zfs rollback without unmounting a file system

2007-12-07 Thread can you guess?
Allowing a filesystem to be rolled back without unmounting it sounds unwise, 
given the potentially confusing effect on any application with a file currently 
open there.

And if a user can't roll back their home directory filesystem, is that so bad?  
Presumably they can still access snapshot versions of individual files or even 
entire directory sub-trees and copy them to their current state if they want to 
- or whistle up someone else to perform a rollback of their home directory if 
they really need to.

I'm not normally one to advocate protecting users from themselves, but I do 
think that applications have some rights to believe that there are some 
guarantees about stability as long as they have a file accessed (and that the 
system should terminate that access if it can't sustain those guarantees).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-12-07 Thread can you guess?
 So name these mystery alternatives that come anywhere
 close to the protection,

If you ever progress beyond counting on your fingers you might (with a lot of 
coaching from someone who actually cares about your intellectual development) 
be able to follow Anton's recent explanation of this (given that the 
higher-level overviews which I've provided apparently flew completely over your 
head).

 functionality,

I discussed that in detail elsewhere here yesterday (in more detail than 
previously in an effort to help the slower members of the class keep up).

 and ease of
 use

That actually may be a legitimate (though hardly decisive) ZFS advantage:  it's 
too bad its developers didn't extend it farther (e.g., by eliminating the 
vestiges of LVM redundancy management and supporting seamless expansion to 
multi-node server systems).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-12-06 Thread can you guess?
 apologies in advance for prolonging this thread ..

Why do you feel any need to?  If you were contributing posts as completely 
devoid of technical content as some of the morons here have recently been 
submitting I could understand it, but my impression is that the purpose of this 
forum is to explore the kind of questions that you're interested in discussing.

 i
 had considered  
 taking this completely offline, but thought of a few
 people at least  
 who might find this discussion somewhat interesting

And any who don't are free to ignore it, so no harm done there either.

 .. at the least i  
 haven't seen any mention of Merkle trees yet as the
 nerd in me yearns  
 for

I'd never heard of them myself until recently, despite having come up with the 
idea independently to use a checksumming mechanism very similar to ZFS's.  
Merkle seems to be an interesting guy - his home page is worth a visit.

 
 On Dec 5, 2007, at 19:42, bill todd - aka can you
 guess? wrote:
 
  what are you terming as ZFS' incremental risk
 reduction? ..  
  (seems like a leading statement toward a
 particular assumption)
 
  Primarily its checksumming features, since other
 open source  
  solutions support simple disk scrubbing (which
 given its ability to  
  catch most deteriorating disk sectors before they
 become unreadable  
  probably has a greater effect on reliability than
 checksums in any  
  environment where the hardware hasn't been slapped
 together so  
  sloppily that connections are flaky).
 
 ah .. okay - at first reading incremental risk
 reduction seems to  
 imply an incomplete approach to risk

The intent was to suggest a step-wise approach to risk, where some steps are 
far more significant than others (though of course some degree of overlap 
between steps is also possible).

*All* approaches to risk are incomplete.

 ...

 i do  
 believe that an interesting use of the merkle tree
 with a sha256 hash  
 is somewhat of an improvement over conventional
 volume based data  
 scrubbing techniques

Of course it is:  that's why I described it as 'incremental' rather than as 
'redundant'.  The question is just how *significant* an improvement it offers.

 since there can be a unique
 integration between  
 the hash tree for the filesystem block layout and a
 hierarchical data  
 validation method.  In addition to the finding
 unknown areas with the  
 scrub, you're also doing relatively inexpensive data
 validation  
 checks on every read.

Yup.

...
 
 sure - we've seen many transport errors,

I'm curious what you mean by that, since CRCs on the transports usually 
virtually eliminate them as problems.  Unless you mean that you've seen many 
*corrected* transport errors (indicating that the CRC and retry mechanisms are 
doing their job and that additional ZFS protection in this area is probably 
redundant).

 as well as
 firmware  
 implementation errors

Quantitative and specific examples are always good for this kind of thing; the 
specific hardware involved is especially significant to discussions of the sort 
that we're having (given ZFS's emphasis on eliminating the need for much 
special-purpose hardware).

 .. in fact with many arrays
 we've seen data  
 corruption issues with the scrub

I'm not sure exactly what you're saying here:  is it that the scrub has 
*uncovered* many apparent instances of data corruption (as distinct from, e.g., 
merely unreadable disk sectors)?

 (particularly if the
 checksum is  
 singly stored along with the data block)

Since (with the possible exception of the superblock) ZFS never stores a 
checksum 'along with the data block', I'm not sure what you're saying there 
either.

 -  just like
 spam you really  
 want to eliminate false positives that could indicate
 corruption  
 where there isn't any.

The only risk that ZFS's checksums run is the infinitesimal possibility that 
corruption won't be detected, not that they'll return a false positive.

  if you take some time to read
 the on disk  
 format for ZFS you'll see that there's a tradeoff
 that's done in  
 favor of storing more checksums in many different
 areas instead of  
 making more room for direct block pointers.

While I haven't read that yet, I'm familiar with the trade-off between using 
extremely wide checksums (as ZFS does - I'm not really sure why, since 
cryptographic-level security doesn't seem necessary in this application) and 
limiting the depth of the indirect block tree.  But (yet again) I'm not sure 
what you're trying to get at here.

...

 on this list we've seen a number of consumer
 level products  
 including sata controllers, and raid cards (which are
 also becoming  
 more commonplace in the consumer realm) that can be
 confirmed to  
 throw data errors.

Your phrasing here is a bit unusual ('throwing errors' - or exceptions - is not 
commonly related to corrupting data).  If you're referring to some kind of 
silent data corruption, once again specifics are important:  to put

Re: [zfs-discuss] Yager on ZFS

2007-12-06 Thread can you guess?
 can you guess? wrote:

  There aren't free alternatives in linux or freebsd
  that do what zfs does, period.
  
 
  No one said that there were:  the real issue is
 that there's not much reason to care, since the
 available solutions don't need to be *identical* to
 offer *comparable* value (i.e., they each have
 different strengths and weaknesses and the net result
 yields no clear winner - much as some of you would
 like to believe otherwise).
 

 I see you carefully snipped You would think the fact
 zfs was ported to
 freebsd so quickly would've been a good first
 indicator that the
 functionality wasn't already there.  A point you
 appear keen to avoid
 discussing.

Hmmm - do I detect yet another psychic-in-training here?  Simply ignoring 
something that one considers irrelevant does not necessarily imply any active 
desire to *avoid* discussing it.

I suspect that whoever ported ZFS to FreeBSD was a fairly uncritical enthusiast 
just as so many here appear to be (and I'll observe once again that it's very 
easy to be one, because ZFS does sound impressive until you really begin 
looking at it closely).  Not to mention the fact that open-source operating 
systems often gather optional features more just because they can than because 
they necessarily offer significant value:  all it takes is one individual who 
(for whatever reason) feels like doing the work.

Linux, for example, is up to its ears in file systems, all of which someone 
presumably felt it worthwhile to introduce there.  Perhaps FreeBSD proponents 
saw an opportunity to narrow the gap in this area (especially since 
incorporating ZFS into Linux appears to have licensing obstacles).

In any event, the subject under discussion here is not popularity but utility - 
*quantifiable* utility - and hence the porting of ZFS to FreeBSD is not 
directly relevant.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-12-06 Thread can you guess?
(Can we
 declare this thread
 dead already?)

Many have already tried, but it seems to have a great deal of staying power.  
You, for example, have just contributed to its continued vitality.

 
 Others seem to care.
 
  *identical* to offer *comparable* value (i.e., they
 each have
  different strengths and weaknesses and the net
 result yields no clear
  winner - much as some of you would like to believe
 otherwise).
 
 Interoperability counts for a lot for some people.

Then you'd better work harder on resolving the licensing issues with Linux.

  Fewer filesystems to
 earn about can count too.

And since ZFS differs significantly from its more conventional competitors, 
that's something of an impediment to acceptance.

 
 ZFS provides peace of mind that you tell us doesn't
 matter.

Sure it matters, if it gives that to you:  just don't pretend that it's of any 
*objective* significance, because *that* requires actual quantification.

  And it's
 actively developed and you and everyone else can see
 that this is so,

Sort of like ext2/3/4, and XFS/JFS (though the latter have the advantage of 
already being very mature, hence need somewhat less 'active' development).

 and that recent ZFS improvements and others that are
 in the pipe (and
 discussed publicly) are very good improvements, which
 all portends an
 even better future for ZFS down the line.

Hey, it could even become a leadership product someday.  Or not - time will 
tell.

 
 Whatever you do not like about ZFS today may be fixed
 tomorrow,

There'd be more hope for that if its developers and users seemed less obtuse.

 except
 for the parts about it being ZFS, opensource,
 Sun-developed, ..., the
 parts that really seem to bother you.

Specific citations of material that I've posted that gave you that impression 
would be useful:  otherwise, you just look like another self-professed psychic 
(is this a general characteristic of Sun worshipers, or just of ZFS fanboys?).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-12-06 Thread can you guess?
 can you guess? wrote:
  There aren't free alternatives in linux or freebsd
  that do what zfs does, period.
  
 
  No one said that there were:  the real issue is
 that there's not much reason to care, since the
 available solutions don't need to be *identical* to
 offer *comparable* value (i.e., they each have
 different strengths and weaknesses and the net result
 yields no clear winner - much as some of you would
 like to believe otherwise).
 Ok. So according to you, most of what ZFS does is
 available elsewhere, 
 and the features it has that nothing else has are'
 really a value add, 
 ar least not enough to produce a 'clear winner'. Ok,
 assume for a second 
 that I believe that.

Unlike so many here I don't assume things lightly - and this one seems 
particularly shaky.

 can you list one other software
 raid/filesystem 
 that as any feature (small or large) that ZFS lacks?

Well, duh.

 
 Because if all else is really equal, and ZFS is the
 only one with any 
 advantages then, whether those advantages are small
 or not (and I don't 
 agree with how small you think they are - see my
 other post that you've 
 ignored so far.)

Sorry - I do need to sleep sometimes.  But I'll get right to it, I assure you 
(or at worst soon:  time has gotten away from me again and I've got an 
appointment to keep this afternoon).

 I think there is a 'clear winner' -
 at least at the 
 moment - Things can change at any time.

You don't get out much, do you?

How does ZFS fall short of other open-source competitors (I'll limit myself to 
them, because once you get into proprietary systems - and away from the quaint 
limitations of Unix file systems - the list becomes utterly unmanageable)?  Let 
us count the ways (well, at least the ones that someone as uninformed as I am 
about open-source features can come up with off the top of his head), starting 
in the volume-management arena:

1.  RAID-Z, as I've explained elsewhere, is brain-damaged when it comes to 
effective disk utilization for small accesses (especially reads):  RAID-5 
offers the same space efficiency with N times the throughput for such workloads 
(used to be provided by mdadm on Linux unless the Linux LVM now supports it 
too).

2.  DRDB on Linux supports remote replication (IIRC there was an earlier, 
simpler mechanism that also did).

3.  Can you yet shuffle data off a disk such that it can be removed from a 
zpool?  LVM on Linux supports this.

4.  Last I knew, you couldn't change the number of disks in a RAID-Z stripe at 
all, let alone reorganize existing stripe layout on the fly.  Typical hardware 
RAIDs can do this and I thought that Linux RAID support could as well, but 
can't find verification now - so I may have been remembering a proposed 
enhancement.

And in the file system arena:

5.  No user/group quotas?  What *were* they thinking?  The discussions about 
quotas here make it clear that per-filesystem quotas are not an adequate 
alternative:  leaving aside the difficulty of simulating both user *and* group 
quotas using that mechanism, using it raises mount problems when very large 
numbers of users are involved, plus hard-link and NFS issues crossing mount 
points.

6.  ZFS's total disregard of on-disk file contiguity can torpedo 
sequential-access performance by well over a decimal order of magnitude in 
situations where files either start out severely fragmented (due to heavily 
parallel write activity during their population) or become so due to 
fine-grained random updates.

7.  ZFS's full-path COW approach increases the space overhead of snapshots 
compared with conventional file systems.

8.  Not available on Linux.

Damn - I've got to run.  Perhaps others more familiar with open-source 
alternatives will add to this list while I'm out.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-12-05 Thread can you guess?
I
  suspect ZFS will change that game in the future.
  In
  particular for someone doing lots of editing,
  snapshots can help recover from user error.
 
  Ah - so now the rationalization has changed to
 snapshot support.   
  Unfortunately for ZFS, snapshot support is pretty
 commonly available
 
 We can cherry pick features all day. People choose
 ZFS for the  
 combination (as well as its unique features).

Actually, based on the self-selected and decidedly unscientific sample of ZFS 
proponents that I've encountered around the Web lately, it appears that people 
choose ZFS in large part because a) they've swallowed the Last Word In File 
Systems viral marketing mantra hook, line, and sinker (that's in itself not 
all that surprising, because the really nitty-gritty details of file system 
implementation aren't exactly prime topics of household conversation - even 
among the technically inclined), b) they've incorporated this mantra into their 
own self-image (the 'fanboy' phenomenon - but at least in the case of existing 
Sun customers this is also not very surprising, because dependency on a vendor 
always tends to engender loyalty - especially if that vendor is not doing all 
that well and its remaining customers have become increasingly desperate for 
good news that will reassure them). and/or c) they're open-source zealots 
who've been sucked in by Jonathan's recent attempt to turn t
 he patent dispute with NetApp into something more profound than the mundane 
inter-corporation spat which it so clearly is.

All of which certainly helps explain why so many of those proponents are so 
resistant to rational argument:  their zeal is not technically based, just 
technically rationalized (as I was pointing out in the post to which you 
responded) - much more like the approach of a (volunteer) marketeer with an 
agenda than like that of an objective analyst (not to suggest that *no one* 
uses ZFS based on an objective appreciation of the trade-offs involved in doing 
so, of course - just that a lot of its more vociferous supporters apparently 
don't).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-12-05 Thread can you guess?
 my personal-professional data are important (this is
 my valuation, and it's an assumption you can't
 dispute).

Nor was I attempting to:  I was trying to get you to evaluate ZFS's incremental 
risk reduction *quantitatively* (and if you actually did so you'd likely be 
surprised at how little difference it makes - at least if you're at all 
rational about assessing it).

...

 I think for every fully digital people own data are
 vital, and almost everyone would reply NONE at your
 question what level of risk user is willing to
 tolerate.

The fact that appears to escape people like you it that there is *always* some 
risk, and you *have* to tolerate it (or not save anything at all).  Therefore 
the issue changes to just how *much* risk you're willing to tolerate for a 
given amount of effort.

(There's also always the possibility of silent data corruption, even if you use 
ZFS - because it only eliminates *some* of the causes of such corruption.  If 
your data is corrupted in RAM during the period when ZFS is not watching over 
it, for example, you're SOL.)

How to *really* protect valuable data has already been thoroughly discussed in 
this thread, though you don't appear to have understood it.  It takes multiple 
copies (most of them off-line), in multiple locations, with verification of 
every copy operation and occasional re-verification of the stored content - and 
ZFS helps with only part of one of these strategies (reverifying the integrity 
of your on-line copy).  If you don't take the rest of the steps, ZFS's 
incremental protection is virtually useless, because the risk of data loss from 
causes that ZFS doesn't protect against is so much higher than the incremental 
protection that it provides (i.e., you may *feel* noticeably better protected 
but you're just kidding yourself).  If you *do* take the rest of the steps, 
then it takes little additional effort to revalidate your on-line content as 
well as the off-line copies, so all ZFS provides is a small reduction in effort 
to achieve the same (very respectable) level of protecti
 on that other solutions can achieve when manual steps are taken to reverify 
the on-line copy as well as the off-line copies.

Try to step out of your my data is valuable rut and wrap your mind around the 
fact that ZFS's marginal contribution to its protection, real though it may be, 
just isn't very significant in most environments compared to the rest of the 
protection solution that it *doesn't* help with.  That's why I encouraged you 
to *quantify* the effect that ZFS's protection features have in *your* 
environment (along with its other risks that ZFS can't ameliorate):  until you 
do that, you're just another fanboy (not that there's anything wrong with that, 
as long as you don't try to present your personal beliefs as something of more 
objective validity).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-12-05 Thread can you guess?
  I was trying to get you
 to evaluate ZFS's  
  incremental risk reduction *quantitatively* (and if
 you actually  
  did so you'd likely be surprised at how little
 difference it makes  
  - at least if you're at all rational about
 assessing it).
 
 ok .. i'll bite since there's no ignore feature on
 the list yet:
 
 what are you terming as ZFS' incremental risk
 reduction? .. (seems  
 like a leading statement toward a particular
 assumption)

Primarily its checksumming features, since other open source solutions support 
simple disk scrubbing (which given its ability to catch most deteriorating disk 
sectors before they become unreadable probably has a greater effect on 
reliability than checksums in any environment where the hardware hasn't been 
slapped together so sloppily that connections are flaky).

Aside from the problems that scrubbing handles (and you need scrubbing even if 
you have checksums, because scrubbing is what helps you *avoid* data loss 
rather than just discover it after it's too late to do anything about it), and 
aside from problems deriving from sloppy assembly (which tend to become obvious 
fairly quickly, though it's certainly possible for some to be more subtle), 
checksums primarily catch things like bugs in storage firmware and otherwise 
undetected disk read errors (which occur orders of magnitude less frequently 
than uncorrectable read errors).

Robert Milkowski cited some sobering evidence that mid-range arrays may have 
non-negligible firmware problems that ZFS could often catch, but a) those are 
hardly 'consumer' products (to address that sub-thread, which I think is what 
applies in Stefano's case) and b) ZFS's claimed attraction for higher-end 
(corporate) use is its ability to *eliminate* the need for such products (hence 
its ability to catch their bugs would not apply - though I can understand why 
people who needed to use them anyway might like to have ZFS's integrity checks 
along for the ride, especially when using less-than-fully-mature firmware).

And otherwise undetected disk errors occur with negligible frequency compared 
with software errors that can silently trash your data in ZFS cache or in 
application buffers (especially in PC environments:  enterprise software at 
least tends to be more stable and more carefully controlled - not to mention 
their typical use of ECC RAM).

So depending upon ZFS's checksums to protect your data in most PC environments 
is sort of like leaving on a vacation and locking and bolting the back door of 
your house while leaving the front door wide open:  yes, a burglar is less 
likely to enter by the back door, but thinking that the extra bolt there made 
you much safer is likely foolish.

 .. are you  
 just trying to say that without multiple copies of
 data in multiple  
 physical locations you're not really accomplishing a
 more complete  
 risk reduction

What I'm saying is that if you *really* care about your data, then you need to 
be willing to make the effort to lock and bolt the front door as well as the 
back door and install an alarm system:  if you do that, *then* ZFS's additional 
protection mechanisms may start to become significant (because you're 
eliminated the higher-probability risks and ZFS's extra protection then 
actually reduces the *remaining* risk by a significant percentage).

Conversely, if you don't care enough about your data to take those extra steps, 
then adding ZFS's incremental protection won't reduce your net risk by a 
significant percentage (because the other risks that still remain are so much 
larger).

Was my point really that unclear before?  It seems as if this must be at least 
the third or fourth time that I've explained it.

 
 yes i have read this thread, as well as many of your
 other posts  
 around usenet and such .. in general i find your tone
 to be somewhat  
 demeaning (slightly rude too - but - eh, who's
 counting?  i'm none to  
 judge)

As I've said multiple times before, I respond to people in the manner they seem 
to deserve.  This thread has gone on long enough that there's little excuse for 
continued obtuseness at this point, but I still attempt to be pleasant as long 
as I'm not responding to something verging on being hostile.

 - now, you do know that we are currently in an
 era of  
 collaboration instead of deconstruction right?

Can't tell it from the political climate, and corporations seem to be following 
that lead (I guess they've finally stopped just gazing in slack-jawed disbelief 
at what this administration is getting away with and decided to cash in on the 
opportunity themselves).

Or were you referring to something else?

 .. so
 i'd love to see  
 the improvements on the many shortcomings you're
 pointing to and  
 passionate about written up, proposed, and freely
 implemented :)

Then ask the ZFS developers to get on the stick:  fixing the fragmentation 
problem discussed elsewhere should be easy, and RAID-Z is at least amenable to 
a 

Re: [zfs-discuss] Yager on ZFS

2007-12-05 Thread can you guess?
he isn't being
 paid by NetApp.. think bigger

O frabjous day!  Yet *another* self-professed psychic, but one whose internal 
voices offer different counsel.

While I don't have to be psychic myself to know that they're *all* wrong 
(that's an advantage of fact-based rather than faith-based opinions), a 
battle-of-the-incompetents would be amusing to watch (unless it took place in a 
realm which no mere mortals could visit).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-12-05 Thread can you guess?
...

  Hi bill, only a question:
  I'm an ex linux user migrated to solaris for zfs
 and
  its checksumming;
 
  So the question is:  do you really need that
 feature (please  
  quantify that need if you think you do), or do you
 just like it  
  because it makes you feel all warm and safe?
 
  Warm and safe is definitely a nice feeling, of
 course, but out in  
  the real world of corporate purchasing it's just
 one feature out of  
  many 'nice to haves' - and not necessarily the most
 important.  In  
  particular, if the *actual* risk reduction turns
 out to be  
  relatively minor, that nice 'feeling' doesn't carry
 all that much  
  weight.
 
 On the other hand, it's hard to argue for risk
 *increase* (using  
 something else)...

And no one that I'm aware of was doing anything like that:  what part of the 
All things being equal paragraph (I've left it in below in case you missed it 
the first time around) did you find difficult to understand?

- bill

...

  All things being equal, of course users would opt
 for even  
  marginally higher reliability - but all things are
 never equal.  If  
  using ZFS would require changing platforms or
 changing code, that's  
  almost certainly a show-stopper for enterprise
 users.  If using ZFS  
  would compromise performance or require changes in
 management  
  practices (e.g., to accommodate file-system-level
 quotas), those  
  are at least significant impediments.  In other
 words, ZFS has its  
  pluses and minuses just as other open-source file
 systems do, and  
  they *all* have the potential to start edging out
 expensive  
  proprietary solutions in *some* applications (and
 in fact have  
  already started to do so).
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-12-05 Thread can you guess?
 On Tue, 4 Dec 2007, Stefano Spinucci wrote:
 
  On 11/7/07, can you guess?
  [EMAIL PROTECTED]
  wrote:
  However, ZFS is not the *only* open-source
 approach
  which may allow that to happen, so the real
 question
  becomes just how it compares with equally
 inexpensive
  current and potential alternatives (and that would
  make for an interesting discussion that I'm not
 sure
  I have time to initiate tonight).
 
  - bill
 
  Hi bill, only a question:
  I'm an ex linux user migrated to solaris for zfs
 and its checksumming; you say there are other
 open-source alternatives but, for a linux end user,
 I'm aware only of Oracle btrfs
 (http://oss.oracle.com/projects/btrfs/), who is a
 Checksumming Copy on Write Filesystem not in a final
 state.
 
  what *real* alternatives are you referring to???
 
  if I missed something tell me, and I'll happily
 stay with linux with my data checksummed and
 snapshotted.
 
  bye
 
  ---
  Stefano Spinucci
 
 
 Hi Stefano,
 
 Did you get a *real* answer to your question?
 Do you think that this (quoted) message is a *real*
 answer?

Hi, Al - I see that you're still having difficulty understanding basic English, 
and your other recent technical-content-free drivel here suggests that you 
might be better off considering a career in janitorial work than in anything 
requiring even basic analytical competence.  But I remain willing to help you 
out with English until you can find the time to take a remedial course (though 
for help with finding a vocation more consonant with your abilities you'll have 
to look elsewhere).

Let's begin by repeating the question at issue, since failing to understand 
that may be at the core of your problem:

what *real* alternatives are you referring to???

Despite a similar misunderstanding by your equally-illiterate associate Mr. 
Cook, that was not a question about what alternatives provided the specific 
support in which Stefano was particularly interested (though in another part of 
my response to him I did attempt to help him understand why that interest might 
be misplaced).  Rather, it was a question about what *I* had referred to in an 
earlier post of mine, as you might also have gleaned from the first sentence of 
my response to that question (As I said in the post to which you 
responded...) had what passes for your brain been even minimally engaged when 
you read it.

My response to that question continued by listing some specific features 
(snapshots, disk scrubbing, software RAID) available in Linux and Free BSD that 
made them viable alternatives to ZFS for enterprise use (the context of that 
earlier post that I was being questioned about).  Whether Linux and FreeBSD 
also offer management aids I admitted I didn't know - though given ZFS's own 
limitations in this area such as the need to define mirror pairs and parity 
groups explicitly and the inability to expand parity groups it's not clear that 
lack thereof would constitute a significant drawback (especially since the 
management activities that their file systems require are comparable to what 
such enterprise installations are already used to dealing with).  And, in an 
attempt to forestall yet another round of babble, I then addressed the relative 
importance (or lack thereof) of several predictable Yes, but ZFS also offers 
wonderful feature X... responses.

Now, not being a psychic myself, I can't state with authority that Stefano 
really meant to ask the question that he posed rather than something else.  In 
retrospect, I suppose that some of his surrounding phrasing *might* suggest 
that he was attempting (however unskillfully) to twist my comment about other 
open source solutions being similarly enterprise-capable into a provably-false 
assertion that those other solutions offered the *same* features that he 
apparently considers so critical in ZFS rather than just comparably-useful 
ones.  But that didn't cross my mind at the time:  I simply answered the 
question that he asked, and in passing also pointed out that those features 
which he apparently considered so critical might well not be.

Once again, though, I've reached the limit of my ability to dumb down the 
discussion in an attempt to reach your level:  if you still can't grasp it, 
perhaps a friend will lend a hand.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-12-05 Thread can you guess?
 Literacy has nothing to do with the glaringly obvious
 BS you keep spewing.

Actually, it's central to the issue:  if you were capable of understanding what 
I've been talking about (or at least sufficiently humble to recognize the 
depths of your ignorance), you'd stop polluting this forum with posts lacking 
any technical content whatsoever.

  Rather than answer a question,
 which couldn't be answered,

The question that was asked was answered - it's hardly my problem if you could 
not competently parse the question, or the answer, or the subsequent 
explanation (though your continuing drivel after those three strikes suggests 
that you may simply be ineducable).

 because you were full of
 it, you tried to convince us all he really didn't
 know what he wanted.  

No:  I answered his question and *also* observed that he probably really didn't 
know what he wanted (at least insofar as being able to *justify* the intensity 
of his desire for it).

...
 
 There aren't free alternatives in linux or freebsd
 that do what zfs does, period.

No one said that there were:  the real issue is that there's not much reason to 
care, since the available solutions don't need to be *identical* to offer 
*comparable* value (i.e., they each have different strengths and weaknesses and 
the net result yields no clear winner - much as some of you would like to 
believe otherwise).

  You can keep talking
 in circles till you're blue in the face, or I suppose
 your fingers go numb in this case, but the fact isn't
 going to change.  Yes, people do want zfs for any
 number of reasons, that's why they're here.

Indeed, but it has become obvious that most of the reasons are non-technical in 
nature.  This place is fanboy heaven, where never is heard a discouraging word 
(and you're hip-deep in buffalo sh!t).

Hell, I came here myself 18 months ago because ZFS seemed interesting, but 
found out that the closer I looked, the less interesting it got.  Perhaps it's 
not surprising that so many of you never took that second step:  it does 
require actual technical insight, which seems to be in extremely short supply 
here.

So short that it's not worth spending time here from any technical standpoint:  
at this point I'm mostly here for the entertainment, and even that is starting 
to get a little tedious.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-12-05 Thread can you guess?
  Now, not being a psychic myself, I can't state
 with
  authority that Stefano really meant to ask the
  question that he posed rather than something else.
  In retrospect, I suppose that some of his
  surrounding phrasing *might* suggest that he was
  attempting (however unskillfully) to twist my
  comment about other open source solutions being
  similarly enterprise-capable into a provably-false
  assertion that those other solutions offered the
  *same* features that he apparently considers so
  critical in ZFS rather than just comparably-useful
  ones.  But that didn't cross my mind at the time:
  I
  simply answered the question that he asked, and in
  passing also pointed out that those features which
  he apparently considered so critical might well not
   be.
 dear bill,
 my question was honest

That's how I originally accepted it, and I wouldn't have revisited the issue 
looking for other interpretations if two people hadn't obviously thought it 
meant something else.

For that matter, even if you actually intended it to mean something else that 
doesn't imply that there was any devious intent.  In any event, what you 
actually asked was what I had referred to, and I told you:  it may not have met 
your personal goals for your own storage, but that wasn't relevant to the 
question that you asked (and that I answered).

Your English is so good that the possibility that it might be a second language 
had not occurred to me - but if so it would help explain any subtle 
miscommunication.

...

 if there are no alternatives to zfs,

As I explained, there are eminently acceptable alternatives to ZFS from any 
objective standpoint.

 I'd gladly
 stick with it,

And you're welcome to, without any argument from me - unless you try to 
convince other people that there are strong technical reasons to do so, in 
which case I'll challenge you to justify them in detail so that any hidden 
assumptions can be brought out into the open.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-12-05 Thread can you guess?
 I suppose we're all just wrong.

By George, you've got it!

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-12-04 Thread can you guess?
Your response here appears to refer to a different post in this thread.

 I never said I was a typical consumer.

Then it's unclear how your comment related to the material which you quoted 
(and hence to which it was apparently responding).

 If you look around photo forums, you'll see an
 interest the digital workflow which includes long
 term storage and archiving.  A chunk of these users
 will opt for an external RAID box (10%? 20%?).  I
 suspect ZFS will change that game in the future.  In
 particular for someone doing lots of editing,
 snapshots can help recover from user error.

Ah - so now the rationalization has changed to snapshot support.  Unfortunately 
for ZFS, snapshot support is pretty commonly available (e.g., in Linux's LVM - 
and IIRC BSD's as well - if you're looking at open-source solutions) so anyone 
who actually found this feature important has had access to it for quite a 
while already.

And my original comment which you quoted still obtains as far as typical 
consumers are concerned.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS write time performance question

2007-12-04 Thread can you guess?
 And some results (for OLTP workload):
 
 http://przemol.blogspot.com/2007/08/zfs-vs-vxfs-vs-ufs
 -on-scsi-array.html

While I was initially hardly surprised that ZFS offered only 11% - 15% of the 
throughput of UFS or VxFS, a quick glance at Filebench's OLTP workload seems to 
indicate that it's completely random-access in nature without any of the 
sequential-scan activity that can *really* give ZFS fits.  The fact that you 
were using an underlying hardware RAID really shouldn't have affected these 
relationships, given that it was configured as RAID-10.

It would be interesting to see your test results reconciled with a detailed 
description of the tests generated by the Kernel Performance Engineering group 
which are touted as indicating that ZFS performs comparably with other file 
systems in database use:  I actually don't find that too hard to believe 
(without having put all that much thought into it) when it comes to straight 
OLTP without queries that might result in sequential scans, but your 
observations seem to suggest otherwise (and the little that I have been able to 
infer about the methodology used to generate some of the rosy-looking ZFS 
performance numbers does not inspire confidence in the real-world applicability 
of those internally-generated results).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-12-04 Thread can you guess?
   On 11/7/07, can you guess?
  [EMAIL PROTECTED]
   wrote:
  However, ZFS is not the *only* open-source
 approach
  which may allow that to happen, so the real
 question
  becomes just how it compares with equally
 inexpensive
  current and potential alternatives (and that would
  make for an interesting discussion that I'm not
 sure
  I have time to initiate tonight).
  
  - bill
 
 Hi bill, only a question:
 I'm an ex linux user migrated to solaris for zfs and
 its checksumming;

So the question is:  do you really need that feature (please quantify that need 
if you think you do), or do you just like it because it makes you feel all warm 
and safe?

Warm and safe is definitely a nice feeling, of course, but out in the real 
world of corporate purchasing it's just one feature out of many 'nice to haves' 
- and not necessarily the most important.  In particular, if the *actual* risk 
reduction turns out to be relatively minor, that nice 'feeling' doesn't carry 
all that much weight.

 you say there are other open-source
 alternatives but, for a linux end user, I'm aware
 only of Oracle btrfs
 (http://oss.oracle.com/projects/btrfs/), who is a
 Checksumming Copy on Write Filesystem not in a final
 state.
 
 what *real* alternatives are you referring to???

As I said in the post to which you responded, I consider ZFS's ease of 
management to be more important (given that even in high-end installations 
storage management costs dwarf storage equipment costs) than its real but 
relatively marginal reliability edge, and that's the context in which I made my 
comment about alternatives (though even there if ZFS continues to require 
definition of mirror pairs and parity groups for redundancy that reduces its 
ease-of-management edge, as does its limitation to a single host system in 
terms of ease-of-scaling).

Specifically, features like snapshots, disk scrubbing (to improve reliability 
by dramatically reducing the likelihood of encountering an unreadable sector 
during a RAID rebuild), and software RAID (to reduce hardware costs) have been 
available for some time in Linux and FreeBSD, and canned management aids would 
not be difficult to develop if they don't exist already.  The dreaded 'write 
hole' in software RAID is a relatively minor exposure (since it only 
compromises data if a system crash or UPS failure - both rare events in an 
enterprise setting - sneaks in between a data write and the corresponding 
parity update and then, before the array has restored parity consistency in the 
background, a disk dies) - and that exposure can be reduced to seconds by a 
minuscule amount of NVRAM that remembers which writes were active (or to zero 
with somewhat more NVRAM to remember the updates themselves in an inexpensive 
hardware solution).

The real question is usually what level of risk an enterprise storage user is 
willing to tolerate.  At the paranoid end of the scale reside the users who 
will accept nothing less than z-series or Tandem-/Stratus-style end-to-end 
hardware checking from the processor traces on out - which rules out most 
environments that ZFS runs in (unless Sun's N-series telco products might fill 
the bill:  I'm not very familiar with them).  And once you get down into users 
of commodity processors, the risk level of using stable and robust file systems 
that lack ZFS's additional integrity checks is comparable to the risk inherent 
in the rest of the system (at least if the systems are carefully constructed, 
which should be a given in an enterprise setting) - so other open-source 
solutions are definitely in play there.

All things being equal, of course users would opt for even marginally higher 
reliability - but all things are never equal.  If using ZFS would require 
changing platforms or changing code, that's almost certainly a show-stopper for 
enterprise users.  If using ZFS would compromise performance or require changes 
in management practices (e.g., to accommodate file-system-level quotas), those 
are at least significant impediments.  In other words, ZFS has its pluses and 
minuses just as other open-source file systems do, and they *all* have the 
potential to start edging out expensive proprietary solutions in *some* 
applications (and in fact have already started to do so).

When we move from 'current' to 'potential' alternatives, the scope for 
competition widens.  Because it's certainly possible to create a file system 
that has all of ZFS's added reliability but runs faster, scales better, 
incorporates additional useful features, and is easier to manage.  That 
discussion is the one that would take a lot of time to delve into adequately 
(and might be considered off topic for this forum - which is why I've tried to 
concentrate here on improvements that ZFS could actually incorporate without 
turning it upside down).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss

Re: [zfs-discuss] Best stripe-size in array for ZFS mail storage?

2007-12-01 Thread can you guess?
 We will be using Cyrus to store mail on 2540 arrays.
 
 We have chosen to build 5-disk RAID-5 LUNs in 2
 arrays which are both connected to same host, and
 mirror and stripe the LUNs.  So a ZFS RAID-10 set
 composed of 4 LUNs.  Multi-pathing also in use for
 redundancy.

Sounds good so far:  lots of small files in a largish system with presumably 
significant access parallelism makes RAID-Z a non-starter, but RAID-5 should be 
OK, especially if the workload is read-dominated.  ZFS might aggregate small 
writes such that their performance would be good as well if Cyrus doesn't force 
them to be performed synchronously (and ZFS doesn't force them to disk 
synchronously on file close); even synchronous small writes could perform well 
if you mirror the ZFS small-update log:  flash - at least the kind with decent 
write performance - might be ideal for this, but if you want to steer clear of 
a specialized configuration just carving one small LUN for mirroring out of 
each array (you could use a RAID-0 stripe on each array if you were compulsive 
about keeping usage balanced; it would be nice to be able to 'center' it on the 
disks, but probably not worth the management overhead unless the array makes it 
easy to do so) should still offer a noticeable improv
 ement over just placing the ZIL on the RAID-5 LUNs.

 
 My question is any guidance on best choice in CAM for
 stripe size in the LUNs?
 
 Default is 128K right now, can go up to 512K, should
 we go higher?

By 'stripe size' do you mean the size of the entire stripe (i.e., your default 
above reflects 32 KB on each data disk, plus a 32 KB parity segment) or the 
amount of contiguous data on each disk (i.e., your default above reflects 128 
KB on each data disk for a total of 512 KB in the entire stripe, exclusive of 
the 128 KB parity segment)?

If the former, by all means increase it to 512 KB:  this will keep the largest 
ZFS block on a single disk (assuming that ZFS aligns them on 'natural' 
boundaries) and help read-access parallelism significantly in large-block cases 
(I'm guessing that ZFS would use small blocks for small files but still quite 
possibly use large blocks for its metadata).  Given ZFS's attitude toward 
multi-block on-disk contiguity there might not be much benefit in going to even 
larger stripe sizes, though it probably wouldn't hurt noticeably either as long 
as the entire stripe (ignoring parity) didn't exceed 4 - 16 MB in size (all the 
above numbers assume the 4 + 1 stripe configuration that you described).

In general, having less than 1 MB per-disk stripe segments doesn't make sense 
for *any* workload:  it only takes 10 - 20 milliseconds to transfer 1 MB from a 
contemporary SATA drive (the analysis for high-performance SCSI/FC/SAS drives 
is similar, since both bandwidth and latency performance improve), which is 
comparable to the 12 - 13 ms. that it takes on average just to position to it - 
and you can still stream data at high bandwidths in parallel from the disks in 
an array as long as you have a client buffer as large in MB as the number of 
disks you need to stream from to reach the required bandwidth (you want 1 
GB/sec?  no problem:  just use a 10 - 20 MB buffer and stream from 10 - 20 
disks in parallel).  Of course, this assumes that higher software layers 
organize data storage to provide that level of contiguity to leverage...

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best stripe-size in array for ZFS mail storage?

2007-12-01 Thread can you guess?
 Any reason why you are using a mirror of raid-5
 lun's?

Some people aren't willing to run the risk of a double failure - especially 
when recovery from a single failure may take a long time.  E.g., if you've 
created a disaster-tolerant configuration that separates your two arrays and a 
fire completely destroys one of them, you'd really like to be able to run the 
survivor without worrying too much until you can replace its twin (hence each 
must be robust in its own right).

The above situation is probably one reason why 'RAID-6' and similar approaches 
(like 'RAID-Z2') haven't generated more interest:  if continuous on-line access 
to your data is sufficiently critical to consider them, then it's also probably 
sufficiently critical to require such a disaster-tolerant approach (which 
dual-parity RAIDs can't address).

It would still be nice to be able to recover from a bad sector on the single 
surviving site, of course, but you don't necessarily need full-blown RAID-6 for 
that:  you can quite probably get by with using large blocks and appending a 
private parity sector to them (maybe two private sectors just to accommodate a 
situation where a defect hits both the last sector in the block and the parity 
sector that immediately follows it; it would also be nice to know that the 
block size is significantly smaller than a disk track size, for similar 
reasons).  This would, however, tend to require file-system involvement such 
that all data was organized into such large blocks:  otherwise, all writes for 
smaller blocks would turn into read/modify/writes.

Panasas (I always tend to put an extra 's' into that name, and to judge from 
Google so do a hell of a lot of other people:  is it because of the resemblance 
to 'parnassas'?) has been crowing about something that it calls 'tiered parity' 
recently, and it may be something like the above.

...

 How about running a ZFS mirror over RAID-0 luns? Then
 again, the
 downside is that you need intervention to fix a LUN
 after a disk goes
 boom! But you don't waste all that space :)

'Wasting' 20% of your disk space (in the current example) doesn't seem all that 
alarming - especially since you're getting more for that expense than just 
faster and more automated recovery if a disk (or even just a sector) fails.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] x4500 w/ small random encrypted text files

2007-12-01 Thread can you guess?
 If it's just performance you're after for small
 writes, I wonder if you've considered putting the ZIL
 on an NVRAM card?  It looks like this can give
 something like a 20x performance increase in some
 situations:
 
 http://blogs.sun.com/perrin/entry/slog_blog_or_bloggin
 g_on

That's certainly interesting reading, but it may be just a tad optimistic.  For 
example, it lists a throughput of 211 MB/sec with only *one* disk in the main 
pool - which unless that's also a solid-state disk is clearly unsustainable 
(i.e., you're just seeing the performance while the solid-state log is filling 
up, rather than what the performance will eventually stabilize at:  my guess is 
that the solid-state log may be larger than the file being updated, in which 
case updates just keep accumulating there without *ever* being forced to disk, 
which is unlikely to occur in most normal environments).

The numbers are a bit strange in other areas as well.  In the case of a single 
pool disk and no slog, 11 MB/sec represents about 1400 synchronous 8 KB updates 
per second on a disk with only about 1/10th that IOPS capacity even with 
queuing enabled (and when you take into account the need to propagate each such 
synchronous update all the way back to the superblock it begins to look 
somewhat questionable even from the bandwidth point of view).  One might 
suspect that what's happening is that once the first synchronous write has been 
submitted a whole bunch of additional ones accumulate while waiting for the 
disk to finish the first, and that ZFS is smart enough not to queue them up to 
the disk (which would require full-path updates for every one of them) but 
instead to gather them in its own cache and write them all back at once in one 
fell swoop (including a single update for the ancestor path) when the disk is 
free again.  This would explain not only the otherwise suspicious 
 performance but also why adding the slog provides so little improvement; it's 
also a tribute to the care that the ZFS developers put into this aspect of 
their implementation.

On the other hand, when an slog is introduced performance actually *declines* 
in systems with more than one pool disk, suggesting that the developers paid 
somewhat less attention to this aspect of the implementation (where if the 
updates are held and batched similarly to my conjecture above they ought to be 
able to reach something close to the disk's streaming-sequential bandwidth, 
unless there's some pathological interaction with the pool-disk updates that 
should have been avoidable).

Unless I'm missing something the bottom line appears to be that in the absence 
of an NVRAM-based slog you might be just as well (and sometimes better) off not 
using one at all.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-12-01 Thread can you guess?
[Zombie thread returns from the grave...]

  Getting back to 'consumer' use for a moment,
 though,
  given that something like 90% of consumers entrust
  their PC data to the tender mercies of Windows, and
 a
  large percentage of those neither back up their
 data,
  nor use RAID to guard against media failures, nor
  protect it effectively from the perils of Internet
  infection, it would seem difficult to assert that
  whatever additional protection ZFS may provide
 would
  make any noticeable difference in the consumer
 space
  - and that was the kind of reasoning behind my
  comment that began this sub-discussion.
 
 As a consumer at home, IT guy at work and amateur
 photographer, I think ZFS will help change that.

Let's see, now:

Consumer at home?  OK so far.

IT guy at work?  Nope, nothing like a mainstream consumer, who doesn't want to 
know about anything like the level of detail under discussion here.

Amateur photographer?  Well, sort of - except that you seem to be claiming to 
have reached the *final* stage of evolution that you lay out below, which - 
again - tends to place you *well* out of the mainstream.

Try reading my paragraph above again and seeing just how closely it applies to 
people like you.

Here's what I think photogs evolve through:
 ) What are negatives? - Mom/dad taking holiday
 photos
 2) Keep negatives in the envelope - average snapshot
 photog
 3) Keep them filed in boxes - started snapping with a
 SLR? Might be doing darkroom work
 4) Get acid free boxes - pro/am.  
 5) Store slides in archival environment (humidity,
 temp, etc). - obsessive
 
 In the digital world:
 1) keeps them on the card until printed.  Only keeps
 the print
 2) copies them to disk  erases them off the card.
  Gets burned when system disk dies
 2a) puts them on CD/DVD. Gets burned a little when the
 disk dies and some photos not on CD/DVDs yet.

OK so far.  My wife is an amateur photographer and that's the stage where she's 
at.  Her parents, however, are both retired *professional* photographers - and 
that's where they're at as well.

 3a) gets an external USB drive to store things.  Gets
 burned when that disk dies.

That sounds as if it should have been called '2b' rather than '3a', since 
there's still only one copy of the data.

 3b) run raid in the box.
 3c) gets an external RAID disk (buffalo/ReadyNAS,
 etc).

While these (finally) give you some redundancy, they don't protect against loss 
due to user errors, system errors, or virii (well, an external NAS might help 
some with the last two, but not a simple external RAID).  They also cost 
significantly more (and are considerably less accessible to the average 
consumer) than simply keeping a live copy on your system plus an archive copy 
(better yet, *two* archive copies) on DVDs (the latter is what my wife and her 
folks do for any photos they care about).

 4) archives to multiple places.
 etc...

At which point you find out that you didn't need RAID after all (see above):  
you just leave the photos on a flash card (which are dirt-cheap these days) and 
your system disk until they've been copied to the archive media.

 
 5) gets ZFS and does transfer direct to local disk
 from flash card.

Which doesn't give you any data redundancy at all unless you're using multiple 
drives (guess how many typical consumers do) and doesn't protect you from user 
errors, system errors, or virii (unless you use an external NAS to help with 
the last two - again, guess how many typical consumers do) - and you'd *still* 
arguably be better off using the approach I described in my previous paragraph 
(since there's nothing like off-site storage if you want *real* protection).

In other words, you can't even make the ZFS case for the final-stage 
semi-professional photographer above, let alone anything remotely resembling a 
'consumer':  you'd just really, really like to justify something that you've 
become convinced is hot.

There's obviously been some highly-effective viral marketing at work here.

 
 Today I can build a Solaris file server for a
 reasonable price with off the shelf parts ($300 +
 disks).

*Build* a file server?  You must be joking:  if a typical consumer wants to 
*buy* a file server they can do so (though I'm not sure that a large percentage 
of 'typical' consumers actually *have* done so) - but expecting them to go out 
and shop for one running ZFS is - well, 'hopelessly naive' doesn't begin to do 
the idea justice.

  I can't get near that for a WAFL based
 system.

Please don't try to reintroduce WAFL into the consumer part of this discussion: 
 I though we'd finally succeeded in separating the sub-threads.

...
 
 I can see ZFS coming to ready made networked RAID box
 that a pro-am photographer could purchase.

*If* s/he had any interest in ZFS per se - see above.

  I don't
 ever see that with WAFL.  And either FS on a network
 RAID box will be less error prone then a box running
 ext3/xfs as is typical now.

'Less error 

Re: [zfs-discuss] Best stripe-size in array for ZFS mail storage?

2007-12-01 Thread can you guess?
 Hi Bill,

...

  lots of small files in a
 largish system with presumably significant access
 parallelism makes RAID-Z a non-starter,
 Why does lots of small files in a largish system
 with presumably 
 significant access parallelism makes RAID-Z a
 non-starter?
 thanks,
 max

Every ZFS block in a RAID-Z system is split across the N + 1 disks in a stripe 
- so not only do N + 1 disks get written for every block update, but N disks 
get *read* on every block *read*.

Normally, small files can be read in a single I/O request to one disk (even in 
conventional parity-RAID implementations).  RAID-Z requires N I/O requests 
spread across N disks, so for parallel-access reads to small files RAID-Z 
provides only about 1/Nth the throughput of conventional implementations unless 
the disks are sufficiently lightly loaded that they can absorb the additional 
load that RAID-Z places on them without reducing throughput commensurately.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best stripe-size in array for ZFS mail storage?

2007-12-01 Thread can you guess?
 We are running Solaris 10u4 is the log option in
 there?

Someone more familiar with the specifics of the ZFS releases will have to 
answer that.

 
 If this ZIL disk also goes dead, what is the failure
 mode and recovery option then?

The ZIL should at a minimum be mirrored.  But since that won't give you as much 
redundancy as your main pool has, perhaps you should create a small 5-disk 
RAID-0 LUN sharing the disks of each RAID-5 LUN and mirror the log to all four 
of them:  even if one entire array box is lost, the other will still have a 
mirrored ZIL and all the RAID-5 LUNs will be the same size (not that I'd expect 
a small variation in size between the two pairs of LUNs to be a problem that 
ZFS couldn't handle:  can't it handle multiple disk sizes in a mirrored pool as 
long as each individual *pair* of disks matches?).

Having 4 copies of the ZIL on disks shared with the RAID-5 activity will 
compromise the log's performance, since each log write won't complete until the 
slowest copy finishes (i.e., congestion in either of the RAID-5 pairs could 
delay it).  It still should usually be faster than just throwing the log in 
with the rest of the RAID-5 data, though.

Then again, I see from your later comment that you have the same questions that 
I had about whether the results reported in 
http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on suggest that having 
a ZIL may not help much anyway (at least for your specific workload:  I can 
imagine circumstances in which performance of small, synchronous writes might 
be more critical than other performance, in which case separating them out 
could be useful).

 
 We did get the 2540 fully populated with 15K 146-gig
 drives.  With 12 disks, and wanting to have at least
 ONE hot global spare in each array, and needing to
 keep LUNs the same size, you end up doing 2 5-disk
 RAID-5 LUNs and 2 hot spares in each array.  Not that
 I really need 2 spares I just didn't see any way to
 make good use of an extra disk in each array.  If we
 wanted to dedicate them instead to this ZIL need,
 what is best way to go about that?

As I noted above, you might not want to have less redundancy in the ZIL than 
you have in the main pool:  while the data in the ZIL is only temporary (until 
it gets written back to the main pool), there's a good chance that there will 
*always* be *some* data in it, so if you lost one array box entirely at least 
that small amount of data would be at the mercy of any failure on the log disk 
that made any portion of the log unreadable.

Now, if you could dedicate all four spare disks to the log (mirroring it 4 
ways) and make each box understand that it was OK to steal one of them to use 
as a hot spare should the need arise, that might give you reasonable protection 
(since then any increased exposure would only exist until the failed disk was 
manually replaced - and normally the other box would still hold two copies as 
well).  But I have no idea whether the box provides anything like that level of 
configurability.

...

 Hundreds of POP and IMAP user processes coming and
 going from users reading their mail.  Hundreds more
 LMTP processes from mail being delivered to the Cyrus
 mail-store.

And with 10K or more users a *lot* of parallelism in the workload - which is 
what I assumed given that you had over 1 TB of net email storage space (but I 
probably should have made that assumption more explicit, just in case it was 
incorrect).

  Sometimes writes predominate over reads,
 depends on time of day whether backups are running,
 etc.  The servers are T2000 with 16 gigs RAM so no
 shortage of room for ARC cache. I have turned off
 cache flush also pursuing performance.

From Neil's comment in the blog entry that you referenced, that sounds *very* 
dicey (at least by comparison with the level of redundancy that you've built 
into the rest of your system) - even if you have rock-solid UPSs (which have 
still been known to fail).  Allowing a disk to lie to higher levels of the 
system (if indeed that's what you did by 'turning off cache flush') by saying 
that it's completed a write when it really hasn't is usually a very bad idea, 
because those higher levels really *do* make important assumptions based on 
that information.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best stripe-size in array for ZFS mail storage?

2007-12-01 Thread can you guess?
 I think the point of dual battery-backed controllers
 is
 that data should never be lost.  Am I wrong?

That depends upon exactly what effect turning off the ZFS cache-flush mechanism 
has.  If all data is still sent to the controllers as 'normal' disk writes and 
they have no concept of, say, using *volatile* RAM to store stuff when higher 
levels enable the disk's write-back cache nor any inclination to pass along 
such requests blithely to their underlying disks (which of course would subvert 
any controller-level guarantees, since they can evict data from their own 
write-back caches as soon as the disk write request completes), then presumably 
as long as they get the data they guarantee that it will eventually get to the 
platters and the ZFS cache-flush mechanism is a no-op.

Of course, if that's true then disabling cache-flush should have no noticeable 
effect on performance (the controller just answers Done as soon as it 
receives a cache-flush request, because there's no applicable cache to flush), 
so you might as well just leave it enabled.  Conversely, if you found that 
disabling it *did* improve performance, then it probably opened up a 
significant reliability hole.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best stripe-size in array for ZFS mail storage?

2007-12-01 Thread can you guess?
 Bill, you have a long-winded way of saying I don't
 know.  But thanks for elucidating the possibilities.

Hmmm - I didn't mean to be *quite* as noncommittal as that suggests:  I was 
trying to say (without intending to offend) FOR GOD'S SAKE, MAN:  TURN IT BACK 
ON!, and explaining why (i.e., that either disabling it made no difference and 
thus it might as well be enabled, or that if indeed it made a difference that 
indicated that it was very likely dangerous).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best stripe-size in array for ZFS mail storage?

2007-12-01 Thread can you guess?
  That depends upon exactly what effect turning off
 the
  ZFS cache-flush mechanism has.
 
 The only difference is that ZFS won't send a
 SYNCHRONIZE CACHE command at the end of a transaction
 group (or ZIL write). It doesn't change the actual
 read or write commands (which are always sent as
 ordinary writes -- for the ZIL, I suspect that
 setting the FUA bit on writes rather than flushing
 the whole cache might provide better performance in
 some cases, but I'm not sure, since it probably
 depends what other I/O might be outstanding.)

It's a bit difficult to imagine a situation where flushing the entire cache 
unnecessarily just to force the ZIL would be preferable - especially if ZFS 
makes any attempt to cluster small transaction groups together into larger 
aggregates (in which case you'd like to let them continue to accumulate until 
the aggregate is large enough to be worth forcing to disk in a single I/O).

 
  Of course, if that's true then disabling
 cache-flush
  should have no noticeable effect on performance
 (the
  controller just answers Done as soon as it
 receives
  a cache-flush request, because there's no
 applicable
  cache to flush), so you might as well just leave
 it
  enabled.
 
 The problem with SYNCHRONIZE CACHE is that its
 semantics aren't quite defined as precisely as one
 would want (until a fairly recent update). Some
 controllers interpret it as push all data to disk
 even if they have battery-backed NVRAM.

That seems silly, given that for most other situations they consider that data 
in NVRAM is equivalent to data on the platter.  But silly or not, if that's the 
way some arrays interpret the command, then it does have performance 
implications (and the other reply I just wrote would be unduly alarmist in such 
cases).

Thanks for adding some actual experience with the hardware to what had been a 
purely theoretical discussion.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-21 Thread can you guess?
In order to be reasonably representative of a real-world situation, I'd suggest 
the following additions:

 1) create a large file (bigger than main memory) on
 an empty ZFS pool.

1a.  The pool should include entire disks, not small partitions (else seeks 
will be artificially short).

1b.  The file needs to be a *lot* bigger than the cache available to it, else 
caching effects on the reads will be non-negligible.

1c.  Unless the file fills up a large percentage of the pool the rest of the 
pool needs to be fairly full (else the seeks that updating the file generates 
will, again, be artificially short ones).

 2) time a sequential scan of the file
 3) random write i/o over say, 50% of the file (either
 with or without
 matching blocksize)

3a.  Unless the file itself fills up a large percentage of the pool, do this 
while other significant other updating activity is also occurring in the pool 
so that the local holes in the original file layout created by some of its 
updates don't get favored for use by subsequent updates to the same file 
(again, artificially shortening seeks).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-21 Thread can you guess?
...

 This needs to be proven with a reproducible,
 real-world workload before it
 makes sense to try to solve it.  After all, if we
 cannot measure where 
 we are,
 how can we prove that we've improved?

Ah - Tests  Measurements types:  you've just gotta love 'em.

Wife:  Darling, is there really supposed to be that much water in the bottom 
of our boat?

TM:  There's almost always a little water in the bottom of a boat, Love.

Wife:  But I think it's getting deeper!

TM:  I suppose you *could* be right:  I'll just put this mark where the water 
is now, and then after a few minutes we can see if it really has gotten deeper 
and, if so, just how much we really may need to worry about it.

Wife:  I think I'll use this bucket to get rid of some of it, just in case.

TM:  No, don't do that:  then we won't be able to see how bad the problem is!

Wife:  But -

TM:  And try not to rock the boat:  it changes the level of the water at the 
mark that I just made.

Wife:  I'm really not a very good swimmer, dear:  let's just head for shore.

TM:  That would be silly if there turns out not to be any problem, wouldn't 
it?

(Wife hits TM over head with bucket, grabs oars, and starts rowing.)

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] User-visible non-blocking / atomic ops in ZFS

2007-11-21 Thread can you guess?
I'm going to combine three posts here because they all involve jcone:

First, as to my message heading:

The 'search forum' mechanism can't find his posts under the 'jcone' name (I was 
curious, because they're interesting/strange, depending on how one looks at 
them).  I've also noticed (once in his case, once in Louwtjie's) that the 'last 
post' column of one thread may reflect a post made to a different thread.

Second, in response to your Indexing other than hash tables post:

The only way you could get a file system like ZFS to perform indexed look-ups 
for you would be to make each of your 'records' an entire file with the 
appropriate look-up name, and ReiserFS may be the only current file system that 
could handle this reasonably well

This is an outgrowth of the Unix mindset that files must only be byte-streams 
rather than anything more powerful (such as the single- and multi-key indexed 
files of traditional minicomputer and mainframe systems) - and that's 
especially unfortunate in ZFS's case, because system-managed COW mechanisms 
just happen to be a dynamite way to handle b-trees (you could do so at the 
application level on top of ZFS via use of a sparse file plus a facility to 
deallocate space in it explicitly, but you'd still need an entire separate 
level of in-file space-allocation/deallocation mechanism).  B-trees are the 
obvious solution to the kind of partial-key and/or key-range queries that you 
described.

Finally, in response to your current post (which sounds more as if it had come 
from a hardware engineer than from a database type):

All the facilities that you describe are traditionally handled by transactions 
of one form or another, and only read-only transactions can normally be 
non-blocking (because they simply capture a consistent point-in-time database 
state and operate upon that, ignoring any subsequent changes that may occur 
during their lifetimes).  Other less-popular but more general non-blocking 
approaches exist which simply abort upon detecting conflict rather than attempt 
to wait for the conflict to evaporate, which tends not to scale very well 
because (unlike the case with non-blocking low-level hardware synchronization) 
restarting a transaction when you don't have to can very often result in a 
*lot* of redundant work being performed; they include some multi-version 
approaches that implement more general 'time domain addressing' than that just 
described for read-only transactions and the rare implementations based upon 
'optimistic' concurrency control that let conflicts occur and then decide
  whether to abort someone when they attempt to commit.

ZFS supports transactions only for its internal use, and cannot feasibly 
support arbitrarily complex transactions because its atomicity approach depends 
upon gathering all transaction updates in RAM before writing them back 
atomically to disk (yes, it could perhaps do so in stages, since the entire new 
tree structure doesn't become visible until its root has been made persistent, 
but that could arbitrarily delay other write activity in the system).  While I 
think that supporting user-level transactions is a useful file-system feature 
and a few file systems such as Transarc's Structured File System have actually 
done so, ZFS would have to change significantly to do so for anything other 
than *very* limited user-level transactions - hence I wouldn't hold my breath 
waiting for such support in ZFS.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] User-visible non-blocking / atomic ops in ZFS

2007-11-21 Thread can you guess?
 The B-trees I'm used to tree divide in arbitrary
 places across the whole 
 key, so doing partial-key queries is painful.

While the b-trees in DEC's Record Management Services (RMS) allowed 
multi-segment keys, they treated the entire key as a byte-string as far as 
prefix searches went (i.e., the segmentation wasn't significant to that, and 
there's no obvious reason why it should have been in other implementations).

 
 I can't find Structured File System Transarc
 usefully in Google.  Do 
 you have a link handy?  If not, never mind.

Well, transarc.com now leads to a porn site, so that's not much help.  And 
Wikipedia's entry for Transarc is regrettably sparse.

Transarc was a Pittsburgh RD company formed by some *very* bright CMU people.  
It's probably best known for its 'Encina' distributed transaction environment 
(SFS was actually part of Encina, but IIRC a separable one), for having 
developed the distributed file system (DFS) component of the Open Group's 
Distributed Computing Environment (DCE), and for AFS, the productized (and now 
open source) version of CMU's distributed Andrew file system; my own 
acquaintance with Transarc became closer when I was helping develop a 
distributed transactional object system in the mid''90s and we were using their 
book Camelot and Avalon for high-level design inspiration.  They were always 
closely associated with IBM, which absorbed them as a wholly-owned subsidiary 
in 1994 (and I've heard relatively little about them since).

SFS was one of their lesser-known achievements:  a record-oriented 
transactional file system.  I've always felt that system-managed 
record-oriented files were useful, in part because a lot of the nitty-gritty 
space management that's required (e.g., to handle the structured pages that 
tend to be necessary to accommodate data that's allowed to change its size or 
is required to remain in some key order under insertion/update/deletion 
activity) duplicates similar space-management required of the system to manage 
conventional byte-stream files and in part because any kind of system-wide 
lock- and deadlock-management facilities tend to want to tie into such data at 
a higher-than-byte-stream level (e.g., because the locked entities may have to 
move around) - so SFS was interesting to me.  Unfortunately, it's been long 
enough that I can't remember too many details about it - e.g., it may or may 
not have supported interlocked access at the record field level - and at least 
after a qui
 ck search I can't find any papers about it that I may have downloaded (that 
era was before I really recognized how evanescent Web material often may be).

I actually did get a Google hit at position 19 with the search terms you used 
(after a plethora of hits on log structured file system, of course), but it 
wasn't very enlightening.  Nor were several later ones, until hit 42 at the 
University of Waterloo - a .pdf that contains at least a brief description 
starting on page 21 (including a thinly-disguised rip-off of a figure in 
GrayReuter's classic Transaction Processing - but it's not quite 
*identical*...).

Aha - good old reliable IBM *does* still have some SFS documentation on line 
that hit 75 noticed; munging that URL a bit led to 
http://publib.boulder.ibm.com/infocenter/txformp/v5r1/index.jsp?noscript=1 
(expand Encina Books in the left-hand frame and start digging...).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-20 Thread can you guess?
...

 My understanding of ZFS (in short: an upside down
 tree) is that each block is referenced by it's
 parent. So regardless of how many snapshots you take,
 each block is only ever referenced by one other, and
 I'm guessing that the pointer and checksum are both
 stored there.
 
 If that's the case, to move a block it's just a case
 of:
 - read the data
 - write to the new location
 - update the pointer in the parent block

Which changes the contents of the parent block (the change in the data checksum 
changed it as well), and thus requires that this parent also be rewritten 
(using COW), which changes the pointer to it (and of course its checksum as 
well) in *its* parent block, which thus also must be re-written... and finally 
a new copy of the superblock is written to reflect the new underlying tree 
structure - all this in a single batch-written 'transaction'.

The old version of each of these blocks need only be *saved* if a snapshot 
exists and it hasn't previously been updated since that snapshot was created.  
But all the blocks need to be COWed even if no snapshot exists (in which case 
the old versions are simply discarded).

...
 
 PS.
 
 1. You'd still need an initial defragmentation pass
 to ensure that the file was reasonably piece-wise
 contiguous to begin with.
 
 No, not necessarily.  If you were using a zpool
 configured like this I'd hope you were planning on
 creating the file as a contiguous block in the first
 place :)

I'm not certain that you could ensure this if other updates in the system were 
occurring concurrently.  Furthermore, the file may be extended dynamically as 
new data is inserted, and you'd like to have some mechanism that could restore 
reasonable contiguity to the result (which can be difficult to accomplish in 
the foreground if, for example, free space doesn't happen to exist on the disk 
right after the existing portion of the file).

...
 
 Any zpool with this option would probably be
 dedicated to the database file and nothing else.  In
 fact, even with multiple databases I think I'd have a
 single pool per database.

It's nice if you can afford such dedicated resources, but it seems a bit 
cavalier to ignore users who just want decent performance from a database that 
has to share its resources with other activity.

Your prompt response is probably what prevented me from editing my previous 
post after I re-read it and realized I had overlooked the fact that 
over-writing the old data complicates things.  So I'll just post the revised 
portion here:


3.  Now you must make the above transaction persistent, and then randomly 
over-write the old data block with the new data (since that data must be in 
place before you update the path to it below, and unfortunately since its 
location is not arbitrary you can't combine this update with either the 
transaction above or the transaction below).

4.  You can't just slide in the new version of the block using the old 
version's existing set of ancestors because a) you just deallocated that path 
above (introducing additional mechanism to preserve it temporarily almost 
certainly would not be wise), b) the data block checksum changed, and c) in any 
event this new path should be *newer* than the path to the old version's new 
location that you just had to establish (if a snapshot exists, that's the path 
that should be propagated to it by the COW mechanism).  However, this is just 
the normal situation whenever you update a data block (save for the fact that 
the block itself was already written above):  all the *additional* overhead 
occurred in the previous steps.

So instead of a single full-path update that fragments the file, you have two 
full-path updates, a random write, and possibly a random read initially to 
fetch the old data.  And you still need an initial defrag pass to establish 
initial contiguity.  Furthermore, these additional resources are consumed at 
normal rather than the reduced priority at which a background reorg can 
operate.  On the plus side, though, the file would be kept contiguous all the 
time rather than just returned to contiguity whenever there was time to do so.

...

 Taking it a stage further, I wonder if this would
 work well with the prioritized write feature request
 (caching writes to a solid state disk)?
  http://www.genunix.org/wiki/index.php/OpenSolaris_Sto
 age_Developer_Wish_List
 
 That could potentially mean there's very little
 slowdown:
  - Read the original block
 - Save that to solid state disk
  - Write the the new block in the original location
 - Periodically stream writes from the solid state
 disk to the main storage

I'm not sure this would confer much benefit if things in fact need to be 
handled as I described above.  In particular, if a snapshot exists you almost 
certainly must establish the old version in its new location in the snapshot 
rather than just capture it in the log; if no snapshot exists you could capture 
the old version in the log and 

Re: [zfs-discuss] ZFS + DB + fragments

2007-11-20 Thread can you guess?
...

 With regards sharing the disk resources with other
 programs, obviously it's down to the individual
 admins how they would configure this,

Only if they have an unconstrained budget.

 but I would
 suggest that if you have a database with heavy enough
 requirements to be suffering noticable read
 performance issues due to fragmentation, then that
 database really should have it's own dedicated drives
 and shouldn't be competing with other programs.

You're not looking at it from a whole-system viewpoint (which if you're 
accustomed to having your own dedicated storage devices is understandable).

Even if your database performance is acceptable, if it's performing 50x as many 
disk seeks as it would otherwise need to when scanning a table that's affecting 
the performance of *other* applications.

 
 Also, I'm not saying defrag is bad (it may be the
 better solution here), just that if you're looking at
 performance in this kind of depth, you're probably
 experienced enough to have created the database in a
 contiguous chunk in the first place :-)

As I noted, ZFS may not allow you to ensure that and in any event if the 
database grows that contiguity may need to be reestablished.  You could grow 
the db in separate files, each of which was preallocated in full (though again 
ZFS may not allow you to ensure that each is created contiguously on disk), but 
while databases may include such facilities as a matter of course it would 
still (all other things being equal) be easier to manage everything if it could 
just extend a single existing file (or one file per table, if they needed to be 
kept separate) as it needed additional space.

 
 I do agree that doing these writes now sounds like a
 lot of work.  I'm guessing that needing two full-path
 updates to achieve this means you're talking about a
 much greater write penalty.

Not all that much.  Each full-path update is still only a single write request 
to the disk, since all the path blocks (again, possibly excepting the 
superblock) are batch-written together, thus mostly increasing only streaming 
bandwidth consumption.

...

 It may be that ZFS is not a good fit for this kind of
 use, and that if you're really concerned about this
 kind of performance you should be looking at other
 file systems.

I suspect that while it may not be a great fit now with relatively minor 
changes it could be at least an acceptable one.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-20 Thread can you guess?
Rats - I was right the first time:  there's a messy problem with snapshots.

The problem is that the parent of the child that you're about to update in 
place may *already* be in one or more snapshots because one or more of its 
*other* children was updated since each snapshot was created.  If so, then each 
snapshot copy of the parent is pointing to the location of the existing copy of 
the child you now want to update in place, and unless you change the snapshot 
copy of the parent (as well as the current copy of the parent) the snapshot 
will point to the *new* copy of the child you are now about to update (with an 
incorrect checksum to boot).

With enough snapshots, enough children, and bad enough luck, you might have to 
change the parent (and of course all its ancestors...) in every snapshot.

In other words, Nathan's approach is pretty much infeasible in the presence of 
snapshots.  Background defragmention works as long as you move the entire 
region (which often has a single common parent) to a new location, which if the 
source region isn't excessively fragmented may not be all that expensive; it's 
probably not something you'd want to try at normal priority *during* an update 
to make Nathan's approach work, though, especially since you'd then wind up 
moving the entire region on every such update rather than in one batch in the 
background.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-20 Thread can you guess?
 But the whole point of snapshots is that they don't
 take up extra space on the disk.  If a file (and
 hence a block) is in every snapshot it doesn't mean
 you've got multiple copies of it.  You only have one
 copy of that block, it's just referenced by many
 snapshots.

I used the wording copies of a parent loosely to mean previous states of the 
parent that also contain pointers to the current state of the child about to be 
updated in place.

 
 The thing is, the location of that block isn't saved
 separately in every snapshot either - the location is
 just stored in it's parent.

And in every earlier version of the parent that was updated for some *other* 
reason and still contains a pointer to the current child that someone using 
that snapshot must be able to follow correctly.

  So moving a block is
 just a case of updating one parent.

No:  every version of the parent that points to the current version of the 
child must be updated.

...

 If you think about it, that has to work for the old
 data since as I said before, ZFS already has this
 functionality.  If ZFS detects a bad block, it moves
 it to a new location on disk.  If it can already do
 that without affecting any of the existing snapshots,
 so there's no reason to think we couldn't use the
 same code for a different purpose.

Only if it works the way you think it works, rather than, say, by using a 
look-aside list of moved blocks (there shouldn't be that many of them), or by 
just leaving the bad block in the snapshot (if it's mirrored or 
parity-protected, it'll still be usable there unless a second failure occurs; 
if not, then it was lost anyway).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-20 Thread can you guess?
...

 just rearrange your blocks sensibly -
 and to at least some degree you could do that while
 they're still cache-resident

Lots of discussion has passed under the bridge since that observation above, 
but it may have contained the core of a virtually free solution:  let your 
table become fragmented, but each time that a sequential scan is performed on 
it determine whether the region that you're currently scanning is 
*sufficiently* fragmented that you should retain the sequential blocks that 
you've just had to access anyway in cache until you've built up around 1 MB of 
them and then (in a background thread) flush the result contiguously back to a 
new location in a single bulk 'update' that changes only their location rather 
than their contents.

1.  You don't incur any extra reads, since you were reading sequentially anyway 
and already have the relevant blocks in cache.  Yes, if you had reorganized 
earlier in the background the current scan would have gone faster, but if scans 
occur sufficiently frequently for their performance to be a significant issue 
then the *previous* scan will probably not have left things *all* that 
fragmented.  This is why you choose a fragmentation threshold to trigger reorg 
rather than just do it whenever there's any fragmentation at all, since the 
latter would probably not be cost-effective in some circumstances; conversely, 
if you only perform sequential scans once in a blue moon, every one may be 
completely fragmented but it probably wouldn't have been worth defragmenting 
constantly in the background to avoid this, and the occasional reorg triggered 
by the rare scan won't constitute enough additional overhead to justify heroic 
efforts to avoid it.  Such a 'threshold' is a crude but possi
 bly adequate metric; a better but more complex one would perhaps nudge up the 
threshold value every time a sequential scan took place without an intervening 
update, such that rarely-updated but frequently-scanned files would eventually 
approach full contiguity, and an even finer-grained metric would maintain such 
information about each individual *region* in a file, but absent evidence that 
the single, crude, unchanging threshold (probably set to defragment moderately 
aggressively - e.g., whenever it takes more than 3 or 5 disk seeks to inhale a 
1 MB region) is inadequate these sound a bit like over-kill.

2.  You don't defragment data that's never sequentially scanned, avoiding 
unnecessary system activity and snapshot space consumption.

3.  You still incur additional snapshot overhead for data that you do decide to 
defragment for each block that hadn't already been modified since the most 
recent snapshot, but performing the local reorg as a batch operation means that 
only a single copy of all affected ancestor blocks will wind up in the snapshot 
due to the reorg (rather than potentially multiple copies in multiple snapshots 
if snapshots were frequent and movement was performed one block at a time).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] pls discontinue troll bait was: Yager on ZFS and

2007-11-19 Thread can you guess?
 OTOH, when someone whom I don't know comes across as
 a pushover, he loses 
 credibility.

It may come as a shock to you, but some people couldn't care less about those 
who assess 'credibility' on the basis of form rather than on the basis of 
content - which means that you can either lose out on potentially useful 
knowledge by ignoring them due to their form, or change your own attitude.

 I'd expect a senior engineer to show not only
 technical expertise but also 
 the ability to handle difficult situations, *not*
 adding to the 
 difficulties by his comments.

Another surprise for you, I'm afraid:  some people just don't meet your 
expectations in this area.  In particular, I only 'show my ability to handle 
difficult situations' in the manner that you suggest when I have some actual 
interest in the outcome - otherwise, I simply do what I consider appropriate 
and let the chips fall where they may.

Deal with that in whatever manner you see fit (just as I do).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] pls discontinue troll bait was: Yager on ZFS and

2007-11-19 Thread can you guess?
 :  Big talk from someone who seems so intent on hiding
 :  their credentials.
 
 : Say, what?  Not that credentials mean much to me since I evaluate people
 : on their actual merit, but I've not been shy about who I am (when I
 : responded 'can you guess?' in registering after giving billtodd as my
 : member name I was being facetious).
 
 You're using a web-based interface to a mailing list and the 'billtodd'
 bit doesn't appear to any users (such as me) subscribed via that
 mechanism.

Then perhaps Sun should make more of a point of this in their Web-based 
registration procedure.

  So yes, 'can you guess?' is unhelpful and makes you look as if
 you're being deliberately unhelpful.

Appearances can be deceiving, in large part because they're so subjective.  
That's why sensible people dig beneath them before forming any significant 
impressions.

...

 : If you're still referring to your incompetent alleged research, [...]
 : [...] right out of the
 : same orifice from which you've pulled the rest of your crap.
 
 It's language like that that is causing the problem.

No, it's ignorant loudmouths like cook and al who are causing the problem:  I'm 
simply responding to them as I see fit.

  IMHO you're being a
 tad rude.

I'm being rude as hell to people who truly earned it, and intend to continue 
until they shape up or shut up.  So if you feel that there's a problem here 
that you'd like to help fix, I suggest that you try tackling it at its source.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-19 Thread can you guess?
Regardless of the merit of the rest of your proposal, I think you have put your 
finger on the core of the problem:  aside from some apparent reluctance on the 
part of some of the ZFS developers to believe that any problem exists here at 
all (and leaving aside the additional monkey wrench that using RAID-Z here 
would introduce, because one could argue that files used in this manner are 
poor candidates for RAID-Z anyway hence that there's no need to consider 
reorganizing RAID-Z files), the *only* down-side (other than a small matter of 
coding) to defragmenting files in the background in ZFS is the impact that 
would have on run-time performance (which should be minimal if the 
defragmentation is performed at lower priority) and the impact it would have on 
the space consumed by a snapshot that existed while the defragmentation was 
being done.

One way to eliminate the latter would be simply not to reorganize while any 
snapshot (or clone) existed:  no worse than the situation today, and better 
whenever no snapshot or clone is present.  That would change the perceived 
'expense' of a snapshot, though, since you'd know you were potentially giving 
up some run-time performance whenever one existed - and it's easy to imagine 
installations which might otherwise like to run things such that a snapshot was 
*always* present.

Another approach would be just to accept any increased snapshot space overhead. 
 So many sequentially-accessed files are just written once and read-only 
thereafter that a lot of installations might not see any increased snapshot 
overhead at all.  Some files are never accessed sequentially (or done so only 
in situations where performance is unimportant), and if they could be marked 
Don't reorganize then they wouldn't contribute any increased snapshot 
overhead either.

One could introduce controls to limit the times when reorganization was done, 
though my inclination is to suspect that such additional knobs ought to be 
unnecessary.

One way to eliminate almost completely the overhead of the additional disk 
accesses consumed by background defragmentation would be to do it as part of 
the existing background scrubbing activity, but for actively-updated files one 
might want to defragment more often than one needed to scrub.

In any event, background defragmention should be a relatively easy feature to 
introduce and try out if suitable multi-block contiguous allocation mechanisms 
already exist to support ZFS's existing batch writes.  Use of ZIL to perform 
opportunistic defragmentation while updated data was still present in the cache 
might be a bit more complex, but could still be worth investigating.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] I was going to send you an email

2007-11-18 Thread can you guess?
until I remembered that you said that you were speaking for others as well and 
decided that I'd like to speak to them too.

As I said in a different thread, I really do try to respond to people in the 
manner that they deserve (and believe that in most cases here I have done so):  
even though I recognize that this may be off-putting it's sometimes the only 
way to break through bias and complacency, and since I came to zfs-discuss in 
search of technical interaction rather than a warm, fuzzy feeling of belonging 
I don't see too much of a down-side (unless I've managed to scare off anyone 
who might otherwise have contributed some technical insight, which would be 
unfortunate).

But I do apologize if I've managed to offend any less-rabid bystanders (I was 
beginning to wonder whether there *were* any less-rabid bystanders) in the 
process, since that was not my intent.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] pls discontinue troll bait was: Yager on ZFS and

2007-11-18 Thread can you guess?
 You've been trolling from the get-go and continue to
 do so.

Y'know, cookie, before letting the drool onto your keyboard you really ought to 
learn to research it.

I said a good deal of what I've said recently well over a year ago here (and in 
fact had forgotten how much detail I went into back then, else I might just 
have given a pointer to it).

  First it's I have the magical fix, which
 wasn't a fix at all.

Just because you can't understanding something doesn't mean it isn't feasible, 
dearie.

  You claim to want to better the
 project,

I'll have to see a reference for that, I'm afraid:  while I have some interest 
in ZFS from a technical standpoint, I've never had any kind of commitment to it.

 ...
 
 You rant and rave about how this is so much like wafl
 from a technical perspective, but then claim to not
 work for netapp or even KNOW anyone from netapp.  Yet
 a quick search of the net has you claiming to have
 worked with netapp hardware for years.

Another reference required, I'm afraid:  I've never touched a NetApp box, nor 
to the best of my knowledge used one (of course, when you're interacting with 
the Internet, you don't know what hardware may be on the other end).

  You must be
 the ONLY person on this planet to have used a vendors
 wares for YEARS, have an intimate technical knowledge
 of the wares, but not know a SINGLE person who works
 for the company selling or supporting such wares.

No, cookie:  you're just as incompetent a researcher as you are technically.

...

 ^^did you see that paragraph, it was a list of names
 of all the people on this list who care what you have
 to say.

Ah - I see that you responded to this post before responding to the poster who 
just proved you wrong (yet again).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] pls discontinue troll bait was: Yager on ZFS and

2007-11-18 Thread can you guess?
Ah - no references to back up your drivel, I see.

No surprise there, of course - but thanks for playing.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] pls discontinue troll bait was: Yager on ZFS and

2007-11-18 Thread can you guess?
 Big talk from someone who seems so intent on hiding
 their credentials.

Say, what?  Not that credentials mean much to me since I evaluate people on 
their actual merit, but I've not been shy about who I am (when I responded 'can 
you guess?' in registering after giving billtodd as my member name I was being 
facetious).

If you're still referring to your incompetent alleged research, I'm still 
waiting for something I can look at:  I do happen to know that another Bill 
Todd has long been associated with Interbase/Firebird (which is kind of ironic 
since Jim Starkey is an old friend of mine from my eleven years at DEC, though 
it's been a few years since we managed to get together), but aside from the 
possibility that you confused him with me I can only suspect that you pulled 
your 'discovery' right out of the same orifice from which you've pulled the 
rest of your crap.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] pls discontinue troll bait was: Yager on ZFS and ZFS

2007-11-17 Thread can you guess?
 
 I've been observing two threads on zfs-discuss with
 the following 
 Subject lines:
 
 Yager on ZFS
 ZFS + DB + fragments
 
 and have reached the rather obvious conclusion that
 the author can 
 you guess? is a professional spinmeister,

Ah - I see we have another incompetent psychic chiming in - and judging by his 
drivel below a technical incompetent as well.  While I really can't help him 
with the former area, I can at least try to educate him in the latter.

...

 Excerpt 1:  Is this premium technical BullShit (BS)
 or what?

Since you asked:  no, it's just clearly beyond your grade level, so I'll try to 
dumb it down enough for you to follow.

 
 - BS 301 'grad level technical BS'
 ---
 
 Still, it does drive up snapshot overhead, and if you
 start trying to 
 use snapshots to simulate 'continuous data
 protection' rather than 
 more sparingly the problem becomes more significant
 (because each 
 snapshot will catch any background defragmentation
 activity at a 
 different point, such that common parent blocks may
 appear in more 
 than one snapshot even if no child data has actually
 been updated). 
 Once you introduce CDP into the process (and it's
 tempting to, since 
 the file system is in a better position to handle it
 efficiently than 
 some add-on product), rethinking how one approaches
 snapshots (and COW 
 in general) starts to make more sense.

Do you by any chance not even know what 'continuous data protection' is?  It's 
considered a fairly desirable item these days and was the basis for several hot 
start-ups (some since gobbled up by bigger fish that apparently agreed that 
they were onto something significant), since it allows you to roll back the 
state of individual files or the system as a whole to *any* historical point 
you might want to (unlike snapshots, which require that you anticipate points 
you might want to roll back to and capture them explicitly - or take such 
frequent snapshots that you'll probably be able to get at least somewhere near 
any point you might want to, a second-class simulation of CDP which some 
vendors offer because it's the best they can do and is precisely the activity 
which I outlined above, expecting that anyone sufficiently familiar with file 
systems to be able to follow the discussion would be familiar with it).

But given your obvious limitations I guess I should spell it out in words of 
even fewer syllables:

1.  Simulating CDP without actually implementing it means taking very frequent 
snapshots.

2.  Taking very frequent snapshots means that you're likely to interrupt 
background defragmentation activity such that one child of a parent is moved 
*before* the snapshot is taken while another is moved *after* the snapshot is 
taken, resulting in the need to capture a before-image of the parent (because 
at least one of its pointers is about to change) *and all ancestors of the 
parent* (because the pointer change will propagate through all the ancestral 
checksums - and pointers, with COW) in every snapshot that occurs immediately 
prior to moving *any* of its children rather than just having to capture a 
single before-image of the parent and all its ancestors after which all its 
child pointers will likely get changed before the next snapshot is taken.

So that's what any competent reader should have been able to glean from the 
comments that stymied you.  The paragraph's concluding comments were 
considerably more general in nature and thus legitimately harder to follow:  
had you asked for clarification rather than just assumed that they were BS 
simply because you couldn't understand them you would not have looked like such 
an idiot, but since you did call them into question I'll now put a bit more 
flesh on them for those who may be able to follow a discussion at that level of 
detail:

3.  The file system is in a better position to handle CDP than some external 
mechanism because

a) the file system knows (right down to the byte level if it wants to) exactly 
what any individual update is changing,

b) the file system knows which updates are significant (e.g., there's probably 
no intrinsic need to capture rollback information for lazy writes because the 
application didn't care whether they were made persistent at that time, but for 
any explicitly-forced writes or syncs a rollback point should be established), 
and

c) the file system is already performing log forces (where a log is involved) 
or batch disk updates (a la ZFS) to honor such application-requested 
persistence, and can piggyback the required CDP before-image persistence on 
them rather than requiring separate synchronous log or disk accesses to do so.

4.  If you've got full-fledged CDP, it's questionable whether you need 
snapshots as well (unless you have really, really inflexible requirements for 
virtually instantaneous rollback and/or for high-performance writable-clone 
access) - and if CDP turns out to be this decade's important new file

Re: [zfs-discuss] ZFS + DB + fragments

2007-11-16 Thread can you guess?
...

 I personally believe that since most people will have
 hardware LUN's
 (with underlying RAID) and cache, it will be
 difficult to notice
 anything. Given that those hardware LUN's might be
 busy with their own
 wizardry ;) You will also have to minimize the effect
 of the database
 cache ...

By definition, once you've got the entire database in cache, none of this 
matters (though filling up the cache itself takes some added time if the table 
is fragmented).

Most real-world databases don't manage to fit all or even mostly in cache, 
because people aren't willing to dedicate that much RAM to running them.  
Instead, they either use a lot less RAM than the database size or share the 
system with other activity that shares use of the RAM.

In other words, they use a cost-effective rather than a money-is-no-object 
configuration, but then would still like to get the best performance they can 
from it.

 
 It will be a tough assignment ... maybe someone has
 already done this?
 
 Thinking about this (very abstract) ... does it
 really matter?
 
 [8KB-a][8KB-b][8KB-c]
 
 So what it 8KB-b gets updated and moved somewhere
 else? If the DB gets
 a request to read 8KB-a, it needs to do an I/O
 (eliminate all
 caching). If it gets a request to read 8KB-b, it
 needs to do an I/O.
 
 Does it matter that b is somewhere else ...

Yes, with any competently-designed database.

 it still
 needs to go get
 it ... only in a very abstract world with read-ahead
 (both hardware or
 db) would 8KB-b be in cache after 8KB-a was read.

1.  If there's no other activity on the disk, then the disk's track cache will 
acquire the following data when the first block is read, because it has nothing 
better to do.  But if the all the disks are just sitting around waiting for 
this table scan to get to them, then if ZFS has a sufficiently intelligent 
read-ahead mechanism it could help out a lot here as well:  the differences 
become greater when the system is busier.

2.  Even a moderately smart disk will detect a sequential access pattern if one 
exists and may read ahead at least modestly after having detected that pattern 
even if it *does* have other requests pending.

3.  But in any event any competent database will explicitly issue prefetches 
when it knows (and it *does* know) that it is scanning a table sequentially - 
and will also have taken pains to try to ensure that the table data is laid out 
such that it can be scanned efficiently.  If it's using disks that support 
tagged command queuing it may just issue a bunch of single-database-block 
requests at once, and the disk will organize them such that they can all be 
satisfied by a single streaming access; with disks that don't support queuing, 
the database can elect to issue a single large I/O request covering many 
database blocks, accomplishing the same thing as long as the table is in fact 
laid out contiguously on the medium (the database knows this if it's handling 
the layout directly, but when it's using a file system as an intermediary it 
usually can only hope that the file system has minimized file fragmentation).

 
 Hmmm... the only way is to get some data :) *hehe*

Data is good, as long as you successfully analyze what it actually means:  it 
either tends to confirm one's understanding or to refine it.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-11-16 Thread can you guess?
 can you guess? billtodd at metrocast.net writes:
  
  You really ought to read a post before responding
 to it:  the CERN study
  did encounter bad RAM (and my post mentioned that)
 - but ZFS usually can't
  do a damn thing about bad RAM, because errors tend
 to arise either
  before ZFS ever gets the data or after it has
 already returned and checked
  it (and in both cases, ZFS will think that
 everything's just fine).
 
 According to the memtest86 author, corruption most
 often occurs at the moment 
 memory cells are written to, by causing bitflips in
 adjacent cells. So when a 
 disk DMA data to RAM, and corruption occur when the
 DMA operation writes to 
 the memory cells, and then ZFS verifies the checksum,
 then it will detect the 
 corruption.
 
 Therefore ZFS is perfectly capable (and even likely)
 to detect memory 
 corruption during simple read operations from a ZFS
 pool.
 
 Of course there are other cases where neither ZFS nor
 any other checksumming 
 filesystem is capable of detecting anything (e.g. the
 sequence of events: data 
 is corrupted, checksummed, written to disk).

Indeed - the latter was the first of the two scenarios that I sketched out.  
But at least on the read end of things ZFS should have a good chance of 
catching errors due to marginal RAM.
That must mean that most of the worrisome alpha-particle problems of yore have 
finally been put to rest (since they'd be similarly likely to trash data on the 
read side after ZFS had verified it).  I think I remember reading that 
somewhere at some point, but I'd never gotten around to reading that far in the 
admirably-detailed documentation that accompanies memtest:  thanks for 
enlightening me.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-15 Thread can you guess?
 can you guess? wrote:
  For very read intensive and position sensitive
  applications, I guess 
  this sort of capability might make a difference?
  
  No question about it.  And sequential table scans
 in databases 
  are among the most significant examples, because
 (unlike things 
  like streaming video files which just get laid down
 initially 
  and non-synchronously in a manner that at least
 potentially 
  allows ZFS to accumulate them in large, contiguous
 chunks - 
  though ISTR some discussion about just how well ZFS
 managed 
  this when it was accommodating multiple such write
 streams in 
  parallel) the tables are also subject to
 fine-grained, 
  often-random update activity.
  
  Background defragmentation can help, though it
 generates a 
  boatload of additional space overhead in any
 applicable snapshot.
 
 The reason that this is hard to characterize is that
 there are
 really two very different configurations used to
 address different
 performance requirements: cheap and fast.  It seems
 that when most
 people first consider this problem, they do so from
 the cheap
 perspective: single disk view.  Anyone who strives
 for database
 performance will choose the fast perspective:
 stripes.

And anyone who *really* understands the situation will do both.

  Note: data
 redundancy isn't really an issue for this analysis,
 but consider it
 done in real life.  When you have a striped storage
 device under a
 file system, then the database or file system's view
 of contiguous
 data is not contiguous on the media.

The best solution is to make the data piece-wise contiguous on the media at the 
appropriate granularity - which is largely determined by disk access 
characteristics (the following assumes that the database table is large enough 
to be spread across a lot of disks at moderately coarse granularity, since 
otherwise it's often small enough to cache in the generous amounts of RAM that 
are inexpensively available today).

A single chunk on an (S)ATA disk today (the analysis is similar for 
high-performance SCSI/FC/SAS disks) needn't exceed about 4 MB in size to yield 
over 80% of the disk's maximum possible (fully-contiguous layout) sequential 
streaming performance (after the overhead of an 'average' - 1/3 stroke - 
initial seek and partial rotation are figured in:  the latter could be avoided 
by using a chunk size that's an integral multiple of the track size, but on 
today's zoned disks that's a bit awkward).  A 1 MB chunk yields around 50% of 
the maximum streaming performance.  ZFS's maximum 128 KB 'chunk size' if 
effectively used as the disk chunk size as you seem to be suggesting yields 
only about 15% of the disk's maximum streaming performance (leaving aside an 
additional degradation to a small fraction of even that should you use RAID-Z). 
 And if you match the ZFS block size to a 16 KB database block size and use 
that as the effective unit of distribution across the set of disks, you'll obt
 ain a mighty 2% of the potential streaming performance (again, we'll be 
charitable and ignore the further degradation if RAID-Z is used).

Now, if your system is doing nothing else but sequentially scanning this one 
database table, this may not be so bad:  you get truly awful disk utilization 
(2% of its potential in the last case, ignoring RAID-Z), but you can still read 
ahead through the entire disk set and obtain decent sequential scanning 
performance by reading from all the disks in parallel.  But if your database 
table scan is only one small part of a workload which is (perhaps the worst 
case) performing many other such scans in parallel, your overall system 
throughput will be only around 4% of what it could be had you used 1 MB chunks 
(and the individual scan performances will also suck commensurately, of course).

Using 1 MB chunks still spreads out your database admirably for parallel 
random-access throughput:  even if the table is only 1 GB in size (eminently 
cachable in RAM, should that be preferable), that'll spread it out across 1,000 
disks (2,000, if you mirror it and load-balance to spread out the accesses), 
and for much smaller database tables if they're accessed sufficiently heavily 
for throughput to be an issue they'll be wholly cache-resident.  Or another way 
to look at it is in terms of how many disks you have in your system:  if it's 
less than the number of MB in your table size, then the table will be spread 
across all of them regardless of what chunk size is used, so you might as well 
use one that's large enough to give you decent sequential scanning performance 
(and if your table is too small to spread across all the disks, then it may 
well all wind up in cache anyway).

ZFS's problem (well, the one specific to this issue, anyway) is that it tries 
to use its 'block size' to cover two different needs:  performance for 
moderately fine-grained updates (though its need to propagate those updates 
upward to the root of the applicable tree

Re: [zfs-discuss] Yager on ZFS

2007-11-15 Thread can you guess?
...

 Well, ZFS allows you to put its ZIL on a separate
 device which could
 be NVRAM.

And that's a GOOD thing (especially because it's optional rather than requiring 
that special hardware be present).  But if I understand the ZIL correctly not 
as effective as using NVRAM as a more general kind of log for a wider range of 
data sizes and types, as WAFL does.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS for consumers WAS:Yager on ZFS

2007-11-15 Thread can you guess?
...

 At home the biggest reason I
 went with ZFS for my
 data is ease of management. I split my data up based
 on what it is ...
 media (photos, movies, etc.), vendor stuff (software,
 datasheets,
 etc.), home directories, and other misc. data. This
 gives me a good
 way to control backups based on the data type.

It's not immediately clear why simply segregating the different data types into 
different directory sub-trees wouldn't allow you to do pretty much the same 
thing.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-15 Thread can you guess?
Richard Elling wrote:

...

 there are
 really two very different configurations used to
 address different
 performance requirements: cheap and fast.  It seems
 that when most
 people first consider this problem, they do so from
 the cheap
 perspective: single disk view.  Anyone who strives
 for database
 performance will choose the fast perspective:
 stripes.
 

 And anyone who *really* understands the situation will do both.
   
 
 I'm not sure I follow.  Many people who do high performance
 databases use hardware RAID arrays which often do not
 expose single disks.

They don't have to expose single disks:  they just have to use reasonable chunk 
sizes on each disk, as I explained later.

Only very early (or very low-end) RAID used very small per-disk chunks (up to 
64 KB max).  Before the mid-'90s chunk sizes had grown to 128 - 256 KB per disk 
on mid-range arrays in order to improve disk utilization in the array.  From 
talking with one of its architects years ago my impression is that HP's (now 
somewhat aging) EVA series uses 1 MB as its chunk size (the same size I used as 
an example, though today one could argue for as much as 4 MB and soon perhaps 
even more).

The array chunk size is not the unit of update, just the unit of distribution 
across the array:  RAID-5 will happily update a single 4 KB file block within a 
given array chunk and the associated 4 KB of parity within the parity chunk.  
But the larger chunk size does allow files to retain the option of using 
logical contiguity to attain better streaming sequential performance, rather 
than splintering that logical contiguity at fine grain across multiple disks.

...

 A single chunk on an (S)ATA disk today (the analysis is similar for 
 high-performance SCSI/FC/SAS disks) needn't exceed about 4 MB in size 
 to yield over 80% of the disk's maximum possible (fully-contiguous 
 layout) sequential streaming performance (after the overhead of an 
 'average' - 1/3 stroke - initial seek and partial rotation are figured 
 in:  the latter could be avoided by using a chunk size that's an 
 integral multiple of the track size, but on today's zoned disks that's 
 a bit awkward).  A 1 MB chunk yields around 50% of the maximum 
 streaming performance.  ZFS's maximum 128 KB 'chunk size' if 
 effectively used as the disk chunk size as you seem to be suggesting 
 yields only about 15% of the disk's maximum streaming performance 
 (leaving aside an additional degradation to a small fraction of even 
 that should you use RAID-Z).  And if you match the ZFS block size to a 
 16 KB database block size and use that as the effective unit of 
 distribution across the set of disks, you'll 
 obtain a mighty 2% of the potential streaming performance (again, we'll 
 be charitable and ignore the further degradation if RAID-Z is used).

   
 
 You do not seem to be considering the track cache, which for
 modern disks is 16-32 MBytes.  If those disks are in a RAID array,
 then there is often larger read caches as well.

Are you talking about hardware RAID in that last comment?  I thought ZFS was 
supposed to eliminate the need for that.

  Expecting a seek and
 read for each iop is a bad assumption.

The bad assumption is that the disks are otherwise idle and therefore have the 
luxury of filling up their track caches - especially when I explicitly assumed 
otherwise in the following paragraph in that post.  If the system is heavily 
loaded the disks will usually have other requests queued up (even if the next 
request comes in immediately rather than being queued at the disk itself, an 
even half-smart disk will abort any current read-ahead activity so that it can 
satisfy the new request).

Not that it would necessarily do much good for the case currently under 
discussion even if the disks weren't otherwise busy and they did fill up the 
track caches:  ZFS's COW policies tend to encourage data that's updated 
randomly at fine grain (as a database table often is) to be splattered across 
the storage rather than neatly arranged such that the next data requested from 
a given disk will just happen to reside right after the previous data requested 
from that disk.

 
 Now, if your system is doing nothing else but sequentially scanning 
 this one database table, this may not be so bad:  you get truly awful 
 disk utilization (2% of its potential in the last case, ignoring 
 RAID-Z), but you can still read ahead through the entire disk set and 
 obtain decent sequential scanning performance by reading from all the 
 disks in parallel.  But if your database table scan is only one small 
 part of a workload which is (perhaps the worst case) performing many 
 other such scans in parallel, your overall system throughput will be 
 only around 4% of what it could be had you used 1 MB chunks (and the 
 individual scan performances will also suck commensurately, of course).

...

 Real data would be greatly appreciated.  In my tests, I see
 reasonable media bandwidth speeds 

Re: [zfs-discuss] Yager on ZFS

2007-11-15 Thread can you guess?
Adam Leventhal wrote:
 On Thu, Nov 08, 2007 at 07:28:47PM -0800, can you guess? wrote:
 How so? In my opinion, it seems like a cure for the brain damage of RAID-5.
 Nope.

 A decent RAID-5 hardware implementation has no 'write hole' to worry about, 
 and one can make a software implementation similarly robust with some effort 
 (e.g., by using a transaction log to protect the data-plus-parity 
 double-update or by using COW mechanisms like ZFS's in a more intelligent 
 manner).
 
 Can you reference a software RAID implementation which implements a solution
 to the write hole and performs well.

No, but I described how to use a transaction log to do so and later on in the 
post how ZFS could implement a different solution more consistent with its 
current behavior.  In the case of the transaction log, the key is to use the 
log not only to protect the RAID update but to protect the associated 
higher-level file operation as well, such that a single log force satisfies 
both (otherwise, logging the RAID update separately would indeed slow things 
down - unless you had NVRAM to use for it, in which case you've effectively 
just reimplemented a low-end RAID controller - which is probably why no one has 
implemented that kind of solution in a stand-alone software RAID product).

...
 
 The part of RAID-Z that's brain-damaged is its 
 concurrent-small-to-medium-sized-access performance (at least up to request 
 sizes equal to the largest block size that ZFS supports, and arguably 
 somewhat beyond that):  while conventional RAID-5 can satisfy N+1 
 small-to-medium read accesses or (N+1)/2 small-to-medium write accesses in 
 parallel (though the latter also take an extra rev to complete), RAID-Z can 
 satisfy only one small-to-medium access request at a time (well, plus a 
 smidge for read accesses if it doesn't verity the parity) - effectively 
 providing RAID-3-style performance.
 
 Brain damage seems a bit of an alarmist label.

I consider 'brain damage' to be if anything a charitable characterization.

 While you're certainly right
 that for a given block we do need to access all disks in the given stripe,
 it seems like a rather quaint argument: aren't most environments that matter
 trying to avoid waiting for the disk at all?

Everyone tries to avoid waiting for the disk at all.  Remarkably few succeed 
very well.

 Intelligent prefetch and large
 caches -- I'd argue -- are far more important for performance these days.

Intelligent prefetch doesn't do squat if your problem is disk throughput (which 
in server environments it frequently is).  And all caching does (if you're 
lucky and your workload benefits much at all from caching) is improve your 
system throughput at the point where you hit the disk throughput wall.

Improving your disk utilization, by contrast, pushes back that wall.  And as I 
just observed in another thread, not by 20% or 50% but potentially by around 
two decimal orders of magnitude if you compare the sequential scan performance 
to multiple randomly-updated database tables between a moderately 
coarsely-chunked conventional RAID and a fine-grained ZFS block size (e.g., the 
16 KB used by the example database) with each block sprayed across several 
disks.

Sure, that's a worst-case scenario.  But two orders of magnitude is a hell of a 
lot, even if it doesn't happen often - and suggests that in more typical cases 
you're still likely leaving a considerable amount of performance on the table 
even if that amount is a lot less than a factor of 100.

 
 The easiest way to fix ZFS's deficiency in this area would probably be to 
 map each group of N blocks in a file as a stripe with its own parity - which 
 would have the added benefit of removing any need to handle parity groups at 
 the disk level (this would, incidentally, not be a bad idea to use for 
 mirroring as well, if my impression is correct that there's a remnant of 
 LVM-style internal management there).  While this wouldn't allow use of 
 parity RAID for very small files, in most installations they really don't 
 occupy much space compared to that used by large files so this should not 
 constitute a significant drawback.
 
 I don't really think this would be feasible given how ZFS is stratified
 today, but go ahead and prove me wrong: here are the instructions for
 bringing over a copy of the source code:
 
   http://www.opensolaris.org/os/community/tools/scm

Now you want me not only to design the fix but code it for you?  I'm afraid 
that you vastly overestimate my commitment to ZFS:  while I'm somewhat 
interested in discussing it and happy to provide what insights I can, I really 
don't personally care whether it succeeds or fails.

But I sort of assumed that you might.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-15 Thread can you guess?
...

 For modern disks, media bandwidths are now getting to
 be  100 MBytes/s.
 If you need 500 MBytes/s of sequential read, you'll
 never get it from 
 one disk.

And no one here even came remotely close to suggesting that you should try to.

 You can get it from multiple disks, so the questions
 are:
 1. How to avoid other bottlenecks, such as a
  shared fibre channel 
 ath?  Diversity.
 2. How to predict the data layout such that you
 can guarantee a wide 
 spread?

You've missed at least one more significant question:

3.  How to lay out the data such that this 500 MB/s drain doesn't cripple 
*other* concurrent activity going on in the system (that's what increasing the 
amount laid down on each drive to around 1 MB accomplishes - otherwise, you can 
easily wind up using all the system's disk resources to satisfy that one 
application, or even fall short if you have fewer than 50 disks available, 
since if you spread the data out relatively randomly in 128 KB chunks on a 
system with disks reasonably well-filled with data you'll only be obtaining 
around 10 MB/s from each disk, whereas with 1 MB chunks similarly spread about 
each disk can contribute more like 35 MB/s and you'll need only 14 - 15 disks 
to meet your requirement).

Use smaller ZFS block sizes and/or RAID-Z and things get rapidly worse.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-14 Thread can you guess?
 This question triggered some silly questions in my
 mind:

Actually, they're not silly at all.

 
 Lots of folks are determined that the whole COW to
 different locations 
 are a Bad Thing(tm), and in some cases, I guess it
 might actually be...
 
 What if ZFS had a pool / filesystem property that
 caused zfs to do a 
 journaled, but non-COW update so the data's relative
 location for 
 databases is always the same?

That's just what a conventional file system (no need even for a journal, when 
you're updating in place) does when it's not guaranteeing write atomicity (you 
address the latter below).

 
 Or - What if it did a double update: One to a staged
 area, and another 
 immediately after that to the 'old' data blocks.
 Still always have 
 on-disk consistency etc, at a cost of double the
 I/O's...

It only requires an extra disk access if the new data is too large to dump 
right into the journal itself (which guarantees that the subsequent in-place 
update can complete).  Whether the new data is dumped into the log or into a 
temporary location the pointer to which is logged, the subsequent in-place 
update can be deferred until it's convenient (e.g., until after any additional 
updates to the same data have also been accumulated, activity has cooled off, 
and the modified blocks are getting ready to be evicted from the system cache - 
and, optionally, until the target disks are idle or have their heads positioned 
conveniently near the target location).

ZFS's small-synchronous-write log can do something similar as long as the 
writes aren't too large to place in it.  However, data that's only persisted in 
the journal isn't accessible via the normal snapshot mechanisms (well, if an 
entire file block was dumped into the journal I guess it could be, at the cost 
of some additional complexity in journal space reuse), so I'm guessing that ZFS 
writes back any dirty data that's in the small-update journal whenever a 
snapshot is created.

And if you start actually updating in place as described above, then you can't 
use ZFS-style snapshotting at all:  instead of capturing the current state as 
the snapshot with the knowledge that any subsequent updates will not disturb 
it, you have to capture the old state that you're about to over-write and stuff 
it somewhere else - and then figure out how to maintain appropriate access to 
it while the rest of the system moves on.

Snapshots make life a lot more complex for file systems than it used to be, and 
COW techniques make snapshotting easy at the expense of normal run-time 
performance - not just because they make update-in-place infeasible for 
preserving on-disk contiguity but because of the significant increase in disk 
bandwidth (and snapshot storage space) required to write back changes all the 
way up to whatever root structure is applicable:  I suspect that ZFS does this 
on every synchronous update save for those that it can leave temporarily in its 
small-update journal, and it *has* to do it whenever a snapshot is created.

 
 Of course, both of these would require non-sparse
 file creation for the 
 DB etc, but would it be plausible?

Update-in-place files can still be sparse:  it's only data that already exists 
that must be present (and updated in place to preserve sequential access 
performance to it).

 
 For very read intensive and position sensitive
 applications, I guess 
 this sort of capability might make a difference?

No question about it.  And sequential table scans in databases are among the 
most significant examples, because (unlike things like streaming video files 
which just get laid down initially and non-synchronously in a manner that at 
least potentially allows ZFS to accumulate them in large, contiguous chunks - 
though ISTR some discussion about just how well ZFS managed this when it was 
accommodating multiple such write streams in parallel) the tables are also 
subject to fine-grained, often-random update activity.

Background defragmentation can help, though it generates a boatload of 
additional space overhead in any applicable snapshot.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-11-14 Thread can you guess?
 some business do not accept any kind of risk

Businesses *always* accept risk:  they just try to minimize it within the 
constraints of being cost-effective.  Which is a good thing for ZFS, because it 
can't eliminate risk either, just help to minimize it cost-effectively.

However, the subject here is not business use but 'consumer' use.

...

 at the moment only ZFS can give this assurance, plus
 the ability to
 self correct detected
 errors.

You clearly aren't very familiar with WAFL (which can do the same).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-11-14 Thread can you guess?
...

  And how about FAULTS?
  hw/firmware/cable/controller/ram/...
 
  If you had read either the CERN study or what I
 already said about  
  it, you would have realized that it included the
 effects of such  
  faults.
 
 
 ...and ZFS is the only prophylactic available.

You don't *need* a prophylactic if you're not having sex:  the CERN study found 
*no* clear instances of faults that would occur in consumer systems and that 
could be attributed to the kinds of errors that ZFS can catch and more 
conventional file systems can't.  It found faults in the interaction of its 
add-on RAID controller (not a normal 'consumer' component) with its WD disks, 
it found single-bit errors that appeared to correlate with ECC RAM errors 
(i.e., likely occurred in RAM rather than at any point where ZFS would be 
involved), it found block-sized errors that appeared to correlate with 
misplaced virtual memory allocation (again, outside ZFS's sphere of influence).

 
 
 
  ...
 
   but I had a box that was randomly
  corrupting blocks during
  DMA.  The errors showed up when doing a ZFS
 scrub
  and
  I caught the
  problem in time.
 
  Yup - that's exactly the kind of error that ZFS
 and
  WAFL do a
  perhaps uniquely good job of catching.
 
  WAFL can't catch all: It's distantly isolated from
  the CPU end.
 
  WAFL will catch everything that ZFS catches,
 including the kind of  
  DMA error described above:  it contains validating
 information  
  outside the data blocks just as ZFS does.
 
 Explain how it can do that, when it is isolated from
 the application  
 by several layers including the network?

Darrell covered one aspect of this (i.e., that ZFS couldn't either if it were 
being used in a server), but there's another as well:  as long as the NFS 
messages between client RAM and server RAM are checksummed in RAM on both ends, 
then that extends the checking all the way to client RAM (the same place where 
local ZFS checks end) save for any problems occurring *in* RAM at one end or 
the other (and ZFS can't deal with in-RAM problems either:  all it can do is 
protect the data until it gets to RAM).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-11-14 Thread can you guess?
 can you guess? wrote:
 
  at the moment only ZFS can give this assurance,
 plus
  the ability to
  self correct detected
  errors.
  
 
  You clearly aren't very familiar with WAFL (which
 can do the same).
 


...

 so far as I can tell it's quite
 irrelevant to me at home; I 
 can't afford it.

Neither can I - but the poster above was (however irrelevantly) talking about 
ZFS's supposedly unique features for *businesses*, so I answered in that 
context.

(By the way, something has gone West with my email and I'm temporarily unable 
to send the response I wrote to your message night before last.  If you meant 
to copy it here as well, just do so and I'll respond to it here.)

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-14 Thread can you guess?
 Nathan Kroenert wrote:

...

 What if it did a double update: One to a
 staged area, and another 
  immediately after that to the 'old' data blocks.
 Still always have 
  on-disk consistency etc, at a cost of double the
 I/O's...
 
 This is a non-starter.  Two I/Os is worse than one.

Well, that attitude may be supportable for a write-only workload, but then so 
is the position that you really don't even need *one* I/O (since no one will 
ever need to read the data and you might as well just drop it on the floor).

In the real world, data (especially database data) does usually get read after 
being written, and the entire reason the original poster raised the question 
was because sometimes it's well worth taking on some additional write overhead 
to reduce read overhead.  In such a situation, if you need to protect the 
database from partial-block updates as well as to keep it reasonably laid out 
for sequential table access, then performing the two writes described is about 
as good a solution as one can get (especially if the first of them can be 
logged - even better, logged in NVRAM - such that its overhead can be amortized 
across multiple such updates by otherwise independent processes, and even more 
especially if, as is often the case, the same data gets updated multiple times 
in sufficiently close succession that instead of 2N writes you wind up only 
needing to perform N+1 writes, the last being the only one that updates the 
data in place after the activity has cooled down).

 
  Of course, both of these would require non-sparse
 file creation for the 
  DB etc, but would it be plausible?
  
  For very read intensive and position sensitive
 applications, I guess 
  this sort of capability might make a difference?
 
 We are all anxiously awaiting data...

Then you might find it instructive to learn more about the evolution of file 
systems on Unix:

In The Beginning there was the block, and the block was small, and it was 
isolated from its brethren, and darkness was upon the face of the deep because 
any kind of sequential performance well and truly sucked.

Then (after an inexcusably lengthy period of such abject suckage lasting into 
the '80s) there came into the world FFS, and while there was still only the 
block the block was at least a bit larger, and it was at least somewhat less 
isolated from its brethren, and once in a while it actually lived right next to 
them, and while sequential performance still usually sucked at least it sucked 
somewhat less.

And then the disciples Kleiman and McVoy looked upon FFS and decided that mere 
proximity was still insufficient, and they arranged that blocks should (at 
least when convenient) be aggregated into small groups (56 KB actually not 
being all that small at the time, given the disk characteristics back then), 
and the Great Sucking Sound of Unix sequential-access performance was finally 
reduced to something at least somewhat quieter than a dull roar.

But other disciples had (finally) taken a look at commercial file systems that 
had been out in the real world for decades and that had had sequential 
performance down pretty well pat for nearly that long.  And so it came to pass 
that corporations like Veritas (VxFS), and SGI (EFS  XFS), and IBM (JFS) 
imported the concept of extents into the Unix pantheon, and the Gods of 
Throughput looked upon it, and it was good, and (at least in those systems) 
Unix sequential performance no longer sucked at all, and even non-corporate 
developers whose faith was strong nearly to the point of being blind could not 
help but see the virtues revealed there, and began incorporating extents into 
their own work, yea, even unto ext4.

And the disciple Hitz (for it was he, with a few others) took a somewhat 
different tack, and came up with a 'write anywhere file layout' but had the 
foresight to recognize that it needed some mechanism to address sequential 
performance (not to mention parity-RAID performance).  So he abandoned 
general-purpose approaches in favor of the Appliance, and gave it most 
uncommodity-like but yet virtuous NVRAM to allow many consecutive updates to be 
aggregated into not only stripes but adjacent stripes before being dumped to 
disk, and the Gods of Throughput smiled upon his efforts, and they became known 
throughout the land.

Now comes back Sun with ZFS, apparently ignorant of the last decade-plus of 
Unix file system development (let alone development in other systems dating 
back to the '60s).  Blocks, while larger (though not necessarily proportionally 
larger, due to dramatic increases in disk bandwidth), are once again often 
isolated from their brethren.  True, this makes the COW approach a lot easier 
to implement, but (leaving aside the debate about whether COW as implemented in 
ZFS is a good idea at all) there is *no question whatsoever* that it returns a 
significant degree of suckage to sequential performance - especially for data 
subject to small, random 

Re: [zfs-discuss] Yager on ZFS

2007-11-14 Thread can you guess?
 
 On 14-Nov-07, at 7:06 AM, can you guess? wrote:
 
  ...
 
  And how about FAULTS?
  hw/firmware/cable/controller/ram/...
 
  If you had read either the CERN study or what I
  already said about
  it, you would have realized that it included the
  effects of such
  faults.
 
 
  ...and ZFS is the only prophylactic available.
 
  You don't *need* a prophylactic if you're not
 having sex:  the CERN  
  study found *no* clear instances of faults that
 would occur in  
  consumer systems and that could be attributed to
 the kinds of  
  errors that ZFS can catch and more conventional
 file systems can't.
 
 Hmm, that's odd, because I've certainly had such
 faults myself. (Bad  
 RAM is a very common one,

You really ought to read a post before responding to it:  the CERN study did 
encounter bad RAM (and my post mentioned that) - but ZFS usually can't do a 
damn thing about bad RAM, because errors tend to arise either before ZFS ever 
gets the data or after it has already returned and checked it (and in both 
cases, ZFS will think that everything's just fine).

 that nobody even thinks to
 check.)

Speak for yourself:  I've run memtest86+ on all our home systems, and I run it 
again whenever encountering any problem that might be RAM-related.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-11-14 Thread can you guess?
...

   Well single bit error rates may be rare in
 normal
   operation hard
   drives, but from a systems perspective, data can
 be
   corrupted anywhere
   between disk and CPU.
  
   The CERN study found that such errors (if they
 found any at all,
   which they couldn't really be sure of) were far
 less common than
 
 I will note from multiple personal experiences these
 issues _do_ happen
 with netapp and emc (symm and clariion)

And Robert already noted that they've occurred in his mid-range arrays.  In 
both cases, however, you're talking about decidedly non-consumer hardware, and 
had you looked more carefully at the material to which you were responding you 
would have found that its comments were in the context of experiences with 
consumer hardware (and in particular what *quantitative* level of additional 
protection ZFS's 'special sauce' can be considered to add to its reliability).

Errors introduced by mid-range and high-end arrays don't enter into that 
discussion (though they're interesting for other reasons).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Response to phantom dd-b post

2007-11-12 Thread can you guess?
 
 In the previous and current responses, you seem quite
 determined of 
 others misconceptions.

I'm afraid that your sentence above cannot be parsed grammatically.  If you 
meant that I *have* determined that some people here are suffering from various 
misconceptions, that's correct.

 Given that fact and the first
 paragraph of your 
 response below, I think you can figure out why nobody
 on this list will 
 reply to you again.

Predicting the future (especially the actions of others) is usually a feat 
reserved for psychics:  are you claiming to be one (perhaps like the poster who 
found it 'clear' that I was a paid NetApp troll - one of the aforementioned 
misconceptions)?

Oh, well - what can one expect from someone who not only top-posts but 
completely fails to trim quotations?  I see that you appear to be posting from 
a .edu domain, so perhaps next year you will at least mature to the point of 
becoming sophomoric.

Whether people here find it sufficiently uncomfortable to have their beliefs 
(I'm almost tempted to say 'faith', in some cases) challenged that they'll 
indeed just shut up I really wouldn't presume to guess.  As for my own 
attitude, if you actually examine my responses rather than just go with your 
gut (which doesn't seem to be a very reliable guide in your case) you'll find 
that I tend to treat people pretty much as they deserve.  If they don't pay 
attention to what they're purportedly responding to or misrepresent what I've 
said, I do chide them a bit (since I invariably *do* pay attention to what 
*they* say and make sincere efforts to respond to exactly that), and if they're 
confrontational and/or derogatory then they'll find me very much right back in 
their face.

Perhaps it's some kind of territorial thing - that people bridle when they find 
a seriously divergent viewpoint popping up in a cozy little community where 
most most of them have congregated because they already share the beliefs of 
the group.  Such in-bred communities do provide a kind of sanctuary and feeling 
of belonging:  perhaps it's unrealistic to expect most people to be able to 
rise above that and deal rationally with the wider world's entry into their 
little one.

Or not:  we'll see.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-11-12 Thread can you guess?
Thanks for taking the time to flesh these points out.  Comments below:

...

 The compression I see varies from something like 30%
 to 50%, very 
 roughly (files reduced *by* 30%, not files reduced
 *to* 30%).   This is 
 with the Nikon D200, compressed NEF option.  On some
 of the lower-level 
 bodies, I believe the compression can't be turned
 off.  Smaller files 
 will of course get hit less often -- or it'll take
 longer to accumulate 
 the terrabyte, is how I'd prefer to think of it.

Either viewpoint works.  And since the compression is not that great, you still 
wind up consuming a lot of space.  Effectively, you're trading (at least if 
compression is an option rather than something that you're stuck with) the 
possibility that a picture will become completely useless should a bit get 
flipped for a storage space reduction of 30% - 50% - and that's a good trade, 
since it effectively allows you to maintain a complete backup copy on disk (for 
archiving, preferably off line) almost for free compared with the uncompressed 
option.

 
 Damage that's fixable is still damage; I think of
 this in archivist 
 mindset, with the disadvantage of not having an
 external budget to be my 
 own archivist. 

There will *always* be the potential for damage, so the key is to make sure 
that any damage is easily fixable.  The best way to do this is to a) keep 
multiple copies, b) keep them isolated from each other (that's why RAID is not 
a suitable approach to archiving), and c) check (scrub) them periodically to 
ensure that if you lose a piece (whether a bit or a sector) you can restore the 
affected data from another copy and thus return your redundancy to full 
strength.

For serious archiving, you probably want to maintain at least 3 such copies 
(possibly more if some are on media of questionable longevity).  For normal 
use, there's probably negligible risk of losing any data if you maintain only 
two on reasonably reliable media:  'MAID' experience suggests that scrubbing as 
little as every few months reduces the likelihood of encountering detectable 
errors while restoring redundancy by several orders of magnitude (i.e., down to 
something like once in a PB at worst for disks - becoming comparable to the 
levels of bit-flip errors that the disk fails to detect at all).

Which is what I've been getting at w.r.t. ZFS in this particular application 
(leaving aside whether it can reasonably be termed a 'consumer' application - 
because bulk video storage is becoming one and it not only uses a similar 
amount of storage space but should probably be protected using similar 
strategies):  unless you're seriously worried about errors in the once-per-PB 
range, ZFS primarily just gives you automated (rather than manually-scheduled) 
scrubbing (and only for your on-line copy).  Yes, it will help detect hardware 
faults as well if they happen to occur between RAM and the disk (and aren't 
otherwise detected - I'd still like to know whether the 'bad cable' experiences 
reported here occurred before ATA started CRCing its transfers), but while 
there's anecdotal evidence of such problems presented here it doesn't seem to 
be corroborated by the few actual studies that I'm familiar with, so that risk 
is difficult to quantify.

Getting back to 'consumer' use for a moment, though, given that something like 
90% of consumers entrust their PC data to the tender mercies of Windows, and a 
large percentage of those neither back up their data, nor use RAID to guard 
against media failures, nor protect it effectively from the perils of Internet 
infection, it would seem difficult to assert that whatever additional 
protection ZFS may provide would make any noticeable difference in the consumer 
space - and that was the kind of reasoning behind my comment that began this 
sub-discussion.

By George, we've managed to get around to having a substantive discussion after 
all:  thanks for persisting until that occurred.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Response to phantom dd-b post

2007-11-12 Thread can you guess?
Well, I guess we're going to remain stuck in this sub-topic for a bit longer:

  The vast majority of what ZFS can detect (save for
 *extremely* rare
  undetectable bit-rot and for real hardware
 (path-related) errors that
  studies like CERN's have found to be very rare -
 and you have yet to
  provide even anecdotal evidence to the contrary) 
 
 You wanted anectodal evidence:

To be accurate, the above was not a solicitation for just any kind of anecdotal 
evidence but for anecdotal evidence that specifically contradicted the notion 
that otherwise undetected path-related hardware errors are 'very rare'.

 During my personal
 experience with only two 
 home machines, ZFS has helped me detect corruption at
 least three times in a 
 period of a few months.
 
 One due to silent corruption due to a controller bug
 (and a driver that did 
 not work around it).

If that experience occurred using what could be considered normal consumer 
hardware and software, that's relevant (and disturbing).  As I noted earlier, 
the only path-related problem that the CERN study unearthed involved their 
(hardly consumer-typical) use of RAID cards, the unusual demands that those 
cards placed on the WD disk firmware (to the point where it produced on-disk 
errors), and the cards' failure to report accompanying disk time-outs.

 
 Another time corruption during hotswapping (though
 this does not necessarily 
 count since I did it on hardware that I did not know
 was supposed to support 
 it, and I would not have attempted it to begin with
 otherwise).

Using ZFS as a test platform to see whether you could get away with using 
hardware in a manner that it may not have been intended to be used may not 
really qualify as 'consumer' use.  As I've noted before, consumer relevance 
remains the point in question here (since that's the point that fired off this 
lengthy sub-discussion).

...
 
 In my professional life I have seen bitflips a few
 times in the middle of real 
 live data running on real servers that are used for
 important data. As a 
 result I have become pretty paranoid about it all,
 making heavy use of par2.

And well you should - but, again, that's hardly 'consumer' use.

...

  can also be detected by 
  scrubbing, and it's arguably a lot easier to apply
 brute-force scrubbing
  (e.g., by scheduling a job that periodically copies
 your data to the null
  device if your system does not otherwise support
 the mechanism) than to
  switch your file system.
 
 How would your magic scrubbing detect arbitrary data
 corruption without 
 checksumming

The assertion is that it would catch the large majority of errors that ZFS 
would catch (i.e., all the otherwise detectable errors, most of them detected 
by the disk when it attempts to read a sector), leaving a residue of no 
noticeable consequence to consumers (especially as one could make a reasonable 
case that most consumers would not experience any noticeable problem even if 
*none* of these errors were noticed).

 or redundancy?

Redundancy is necessary if you want to fix (not just catch) errors, but 
conventional mechanisms provide redundancy just as effective as ZFS's.  (With 
the minor exception of ZFS's added metadata redundancy, but the likelihood that 
an error will happen to hit the relatively minuscule amount of metadata on a 
disk rather than the sea of data on it is, for consumers, certainly negligible, 
especially considering all the far more likely potential risks in the use of 
their PCs.)

 
 A lot of the data people save does not have
 checksumming.

*All* disk data is checksummed, right at the disk - and according to the 
studies I'm familiar with this detects most errors (certainly enough of those 
that ZFS also catches to satisfy most consumers).  If you've got any 
quantitative evidence to the contrary, by all means present it.

...
 
 I think one needs to stop making excuses by observing
 properties of specific 
 file types and simlar.

I'm afraid that's incorrect:  given the statistical incidence of the errors in 
question here, in normal consumer use only humongous files will ever experience 
them with non-neglible probability.  So those are the kinds of files at issue.

When such a file experiences one of these errors, then either it will be one 
that ZFS is uniquely (save for WAFL) capable of detecting, or it will be one 
that more conventional mechanisms can detect.  The latter are, according to the 
studies I keep mentioning, far more frequent (only relatively, of course:  
we're still only talking about one in every 10 TB or so, on average and 
according to manufacturers' specs, which seem to be if anything pessimistic in 
this area), and comprise primarily unreadable disk sectors which (as long as 
they're detected in a timely manner by scrubbing, whether ZFS's or some 
manually-scheduled mechanism) simply require that the bad sector (or file) be 
replaced by a good copy to restore the desired level of redundancy.

When we get into the 

Re: [zfs-discuss] Response to phantom dd-b post

2007-11-11 Thread can you guess?
 
 Chill. It's a filesystem. If you don't like it,
 don't use it.

Hey, I'm cool - it's mid-November, after all.  And it's not about liking or not 
liking ZFS:  it's about actual merits vs. imagined ones, and about legitimate 
praise vs. illegitimate hype.

Some of us have a professional interest in such things.  If you don't, by all 
means feel free to ignore the discussion.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-11-11 Thread can you guess?
Hallelujah!  I don't know when this post actually appeared in the forum, but it 
wasn't one I'd seen until right now.  If it didn't just appear due to whatever 
kind of fluke made the 'disappeared' post appear right now too, I apologize for 
having missed it earlier.

 In a compressed raw file, it'll affect the rest of
 the file generally; 
 so it essentially renders the whole thing useless,
 unless it happens to 
 hit towards the end and you can crop around it.  If
 it hits in metadata 
 (statistically unlikely, the bulk of the file is
 image data) it's 
 probably at worst annoying, but it *might* hit one of
 the bits software 
 uses to recognize and validate the file, too.
 
 In an uncompressed raw file, if it hits in image data
 it'll affect 
 probably 9 pixels; it's easily fixed.

That's what I figured (and the above is the first time you've mentioned 
*compressed* RAW files, so the obvious next observation is that if they 
compress well - and if not, why bother compressing them? - then the amount of 
room that they occupy is significantly smaller and the likelihood of getting an 
error in one is similarly smaller).

...

  Even assuming that you meant 'MB' rather than 'Mb'
 above, that suggests that it would take you well over
 a decade to amass 1 TB of RAW data (assuming that, as
 you suggest both above and later, you didn't
 accumulate several hundred MB of pictures *every* day
 but just on those days when you were traveling, at a
 sporting event, etc.).

 
 I seem to come up with a DVD full every month or two
 these days, 
 myself.  I mean, it varies; there was this one
 weekend I filled 4 or 
 some such; but it varies both ways, and that average
 isn't too far 
 off.   25GB a year seems to take 40 years to reach
 1TB.  However, my 
 rate has increased so dramatically in the last 7
 years that I'm not at 
 all sure what to expect; is it time for the curve to
 level off yet, for 
 me?  Who knows!

Well, it still looks as if you're taking well over a decade to fill 1 TB at 
present, as I estimated.

 
 Then again, I'm *also* working on scanning in the
 *last* 40 years worth 
 of photos, and those tend to be bigger (scans are
 less good pixels so 
 you need more of them), and *that* runs the numbers
 up, in chunks when I 
 take time to do a big scanning batch.

OK - that's another new input, though not yet a quantitative one.

...

  Even if you've got your original file archived,
 you
  still need
  your working copies available, and Adobe Photoshop
  can turn that
  RAW file into a PSD of nearly 60Mb in some cases.
  
 
  If you really amass all your pictures this way
 (rather than, e.g., use Photoshop on some of them and
 then save the result in a less verbose format), I'll
 suggest that this takes you well beyond the
 'consumer' range of behavior.

 
 It's not snapshot usage, but it's common amateur
 usage.  Amateurs tend 
 to do lots of the same things professionals do (and
 sometimes better, 
 though not usually).  Hobbies are like that. 
 
 The argument for the full Photoshop file is the
 concept of 
 nondestructive editing.  I do retouching on new
 layers instead of 
 erasing what I already have with the new stuff. I use
 adjustment layers 
 with layer masks for curve adjustments.  I can go
 back and improve the 
 mask, or nudge the curves, without having to start
 over from scratch.  
 It's a huge win.  And it may be more valuable for
 amateurs, actually; 
 professionals tend to have the experience to know
 their minds better and 
 know when they have it right, so many of them may do
 less revisiting old 
 stuff and improving it a bit.  Also, when the job is
 done and sent to 
 the client, they tend not to care about it any more.

OK - but at a *maximum* of 60 MB per shot you're still talking about having to 
manually massage at least 20,000 shots in Photoshop before the result consumes 
1 TB of space.  That's a *lot* of manual labor:  do you really perform it on 
anything like that number of shots?

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-11-11 Thread can you guess?
...

 Having
 my MP3 collection
 gotten fucked up thanks to neither Windows nor NTFS
 being able to
 properly detect and report in-flight data corruption
 (i.e. bad cable),
 after copying it from one drive to another to replace
 one of them, I'm
 really glad that I've ZFS to manage my data these
 days.

Hmmm.  All this talk about bad cables by you and others sounds more like older 
ATA (before transfers over the cable got CRC protection) than like contemporary 
drives.  Was your experience with a recent drive and controller?

...

 As far as all these reliability studies go, my
 practical experience is
 quite the opposite. I'm fixing computers of friends
 and acquaintances
 left and right, bad sectors are rather pretty common.

I certainly haven't found them to be common, unless a drive was on the verge of 
major failure.  Though if a drive is used beyond its service life (usually 3 - 
5 years) they may become more common.

In any case, if a conventional scrub would detect the bad sector then ZFS per 
se wouldn't add unique value (save that the check would be automated rather 
than something that the user, or system assembler, had to set up to be 
scheduled).

I really meant it, though, when I said that I don't completely discount 
anecdotal experience:  I just like to get more particulars before deciding how 
much to weigh it against more formal analyses.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-11-11 Thread can you guess?
 
 On 9-Nov-07, at 3:23 PM, Scott Laird wrote:
 
  Most video formats are designed to handle
 errors--they'll drop a frame
  or two, but they'll resync quickly.  So, depending
 on the size of the
  error, there may be a visible glitch, but it'll
 keep working.
 
  Interestingly enough, this applies to a lot of
 MPEG-derived formats as
  well, like MP3.  I had a couple bad copies of MP3s
 that I tried to
  listen to on my computer a few weeks ago (podcasts
 copied via
  bluetooth off of my phone, apparently with no error
 checking), and it
  made the story hard to follow when a few seconds
 would disappear out
  of the middle, but it didn't destroy the file.
 
 Well that's nice. How about your database, your
 source code, your ZIP  
 file, your encrypted file, ...

They won't be affected, because they're so much smaller that (at something like 
1 error per 10 TB) the chance of an error hitting them is negligible:  that was 
the whole point of singling out huge video files as the only likely candidates 
to worry about.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-11-11 Thread can you guess?
 
 On 9-Nov-07, at 2:45 AM, can you guess? wrote:

...

  This suggests that in a ZFS-style installation
 without a hardware  
  RAID controller they would have experienced at
 worst a bit error  
  about every 10^14 bits or 12 TB
 
 
 And how about FAULTS?
 hw/firmware/cable/controller/ram/...

If you had read either the CERN study or what I already said about it, you 
would have realized that it included the effects of such faults.

...

   but I had a box that was randomly
  corrupting blocks during
  DMA.  The errors showed up when doing a ZFS scrub
 and
  I caught the
  problem in time.
 
  Yup - that's exactly the kind of error that ZFS and
 WAFL do a  
  perhaps uniquely good job of catching.
 
 WAFL can't catch all: It's distantly isolated from
 the CPU end.

WAFL will catch everything that ZFS catches, including the kind of DMA error 
described above:  it contains validating information outside the data blocks 
just as ZFS does.

...

  CERN was using relatively cheap disks
 
 Don't forget every other component in the chain.

I didn't, and they didn't:  read the study.

...

  Your position is similar to that of an audiophile
 enthused about a  
  measurable but marginal increase in music quality
 and trying to  
  convince the hoi polloi that no other system will
 do:  while other  
  audiophiles may agree with you, most people just
 won't consider it  
  important - and in fact won't even be able to
 distinguish it at all.
 
 Data integrity *is* important.

You clearly need to spend a lot more time trying to understand what you've read 
before responding to it.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Response to phantom dd-b post

2007-11-11 Thread can you guess?
Just to note here as well as earlier that some of the confusion about what you 
had and had not said was related to my not having seen the post where you 
talked about RAW and compressed RAW errors until this morning.  Since your 
other mysteriously 'disappeared' post also appeared recently, I suspect that 
the RAW/compressed post was not present earlier when we were talking about its 
contents, but it is also possible that I just missed it.  In any case, my 
response to you was based on your claim below (by selective quoting) that 
this content had been in a post that I had responded to.

- bill

   can you guess? wrote:
  
  ...
  
Most of the balance of your post isn't
 addressed
  in
   any detail because it carefully avoids the
   fundamental issues that I raised:
  
   
   Not true; and by selective quoting you have
  removed
   my specific 
   responses to most of these issues.
  
  While I'm naturally reluctant to call you an
 outright
  liar, David, you have hardly so far in this
  discussion impressed me as someone whose
 presentation
  is so well-organized and responsive to specific
  points that I can easily assume that I simply
 missed
  those responses.  If you happen to have a copy of
  that earlier post, I'd like to see it resubmitted
  (without modification).
 
 Oh, dear:  I got one post/response pair out of phase
 with the above - the post which I claimed did not
 address the issues that I raised *is* present here
 (and indeed does not address them).
 
 I still won't call you an outright liar:  you're
 obviously just *very* confused about what qualifies
 as responding to specific points.  And, just for the
 record, if you do have a copy of the post that
 disappeared, I'd still like to see it.
 
  
1.  How much visible damage does a single-bit
  error
   actually do to the kind of large photographic
  (e.g.,
   RAW) file you are describing?  If it trashes the
  rest
   of the file, as you state is the case with jpeg,
  then
   you might have a point (though you'd still have
 to
   address my second issue below), but if it
 results
  in
   a virtually invisible blemish they you most
  certainly
   don't.
  
   
   I addressed this quite specifically, for two
 cases
   (compressed raw vs. 
   uncompressed raw) with different results.
  
  Then please do so where we all can see it.
 
 Especially since there's no evidence of it in the
 post (still right here, up above) where you appear to
 be claiming that you did.
 
 - bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Response to phantom dd-b post

2007-11-11 Thread can you guess?
 No, you aren't cool, and no it isn't about zfs or
 your interest in it.  It was clear from the get-go
 that netapp was paying you to troll any discussion on
 it,

It's (quite literally) amazing how the most incompetent individuals turn out to 
be those who are the most certain of their misconceptions.  In fact, there have 
been studies done that establish this as a statistically-significant trait 
among that portion of the population - so at least you aren't alone in this 
respect.

For the record, I have no connection with NetApp, I have never had any 
connection with NetApp (save for appreciating the elegance of their products), 
they never in any way asked me to take any part in any discussion on any 
subject whatsoever (let alone offered to pay me to do so), I don't even *know* 
anyone at NetApp (at least that I'm aware of) save by professional reputation.  
In other words, you've got your head so far up your ass that you're not only 
ready to make accusations that you do not (and in fact could not) have any 
evidence to support, you're ready to make accusations that are factually flat 
wrong.

Simply because an individual of your caliber apparently cannot conceive of the 
possibility that someone might take sufficient personal and professional 
interest in a topic to devote actual time and effort to attempting to cut 
through the hype that mostly well-meaning but less-than-objective and 
largely-uncritical supporters are shoveling out?  Sheesh.

...
 
 Yes, every point you've made could be refuted.

Rather than drool about it, try taking an actual shot at doing so:  though I'd 
normally consider talking with you to be a waste of my time, I'll make an 
exception in this case.  Call it a grudge match, if you want:  I *really* don't 
like the kind of incompetence that someone who behaves as you just did 
represents and also consider it something in the nature of a civic duty to 
expose if for what it is.

...

 I suggest getting a blog and ranting there, you have
 no audience here.

Another demonstrably incorrect statement, I'm afraid:  the contents of this 
thread make it clear that some people here, despite their preconceptions, do 
consider a detailed analysis of ZFS's relative strengths to be a fit subject 
for discussion.  And since it's only human for them to resist changing those 
preconceptions, it's hardly surprising that the discussion gets slightly heated 
at times.

Education frequently can only occur through confrontation:  existing biases 
make it difficult for milder forms to get through.  I'd like to help people 
here learn something, but I'm at least equally interested in learning things 
myself - and since there are areas in which I consider ZFS's design to be 
significantly sub-optimal, where better to test that opinion than here?

Unfortunately, so far the discussion has largely bogged down in debate over 
just how important ZFS's unique (save for WAFL) checksum protection mechanisms 
may be, and has not been very productive given the reluctance of many here to 
tackle that question quantitatively (though David eventually started to do so) 
- so there's been very little opportunity for learning on my part save for a 
few details about media-file internals.  I'm more interested in discussing 
things like whether my suggested fix for RAID-Z's poor parallel-small-access 
performance overlooked some real obstacle, or why ZFS was presented as a 
highly-scalable file system when its largest files can require up to 6 levels 
of indirect blocks (making performance for random-access operations suck and 
causing snapshot data for updated large files to balloon) and it offers no 
obvious extension path to clustered operation (a single node - especially a 
single *commodity* node of the type that ZFS otherwise favors - runs o
 ut of steam in the PB range, or even lower for some workloads, and even 
breaking control out into a separate metadata server doesn't get you that much 
farther), or whether ZFS's apparently-centralized block-allocation mechanisms 
can scale well (using preallocation to predistribute large chunks that can be 
managed independently helps, but again doesn't get you beyond the PB range at 
best), or the blind spot that some of the developers appear to have about the 
importance of on-disk contiguity for streaming access performance (128 KB 
chunks just don't cut it in terms of efficient disk utilization in parallel 
environments unless they're grouped together), or its trade-off of run-time 
performance and space use for performance when accessing snapshots (I'm 
guessing that it was more faith in the virtue of full-tree-path updating as 
compared with using a transaction log that actually caused that decision, so 
perhaps that's the real subject for discussion).

Of course, given that ZFS is what it is, there's natural tendency just to plow 
forward and not 'waste time' revisiting already-made decisions - so the people 
best able to discuss them may not want to.  But you 

[zfs-discuss] Response to phantom dd-b post

2007-11-10 Thread can you guess?
This is a bit weird: I just wrote the following response to a dd-b post that 
now seems to have disappeared from the thread. Just in case that's a temporary 
aberration, I'll submit it anyway as a new post.

 can you guess? wrote:
  Ah - thanks to both of you. My own knowledge of
 video format internals is so limited that I assumed
 most people here would be at least equally familiar
 with the notion that a flipped bit or two in a video
 would hardly qualify as any kind of disaster (or
 often even as being noticeable, unless one were
 searching for it, in the case of commercial-quality
 video).
 

 But also, you're thinking like a consumer,

Well, yes - since that's the context of my comment to which you originally 
responded. Did you manage to miss that, even after I repeated it above in the 
post to which you're responding *this* time?

not like
 an archivist. A bit
 lost in an achival video *is* a disaster, or at least
 a serious degradation.

Or not, unless you're really, really obsessive-compulsive about it - certainly 
*far* beyond the point of being reasonably characterized as a 'consumer'.

...

And since the CERN study seems
 to suggest that the vast majority of errors likely to
 be encountered at this level of incidence (and which
 could be caught by ZFS) are *detectable* errors,
 they'll (in the unlikely event that you encounter
 them at all) typically only result in requiring use
 of a RAID (or backup) copy (surely
  one wouldn't be entrusting data of any real value
 to a single disk).
 

 They'll only be detected when the files are *read*;
 ZFS has the scrub
 concept, but most RAID systems don't,

Perhaps you're just not very familiar with other systems, David.

For example, see 
http://gentoo-wiki.com/HOWTO_Gentoo_Install_on_Software_RAID#Data_Scrubbing, 
where it tells you how to run a software RAID scrub manually (or presumably in 
a cron job if it can't be configured to be more automatic). Or a variety of 
Adaptec RAID cards which support two different forms of scanning/fixup which 
presumably could also be scheduled externally if an internal scheduling 
mechanism is not included). I seriously doubt that these are the only such 
facilities out there: they're just ones I happen to be able to cite with 
minimal effort.

...

  So I see no reason to change my suggestion that
 consumers just won't notice the level of increased
 reliability that ZFS offers in this area: not only
 would the difference be nearly invisible even if the
 systems they ran on were otherwise perfect, but in
 the real world consumers have other reliability
 issues to worry about that occur multiple orders of
 magnitude more frequently than the kinds that ZFS
 protects against.
 

 And yet I know many people who have lost data in ways
 that ZFS would
 have prevented.

Specifics would be helpful here. How many? Can they reasonably be characterized 
as consumers (I'll remind you once more: *that's* the subject to which your 
comments purport to be responding)? Can the data loss reasonably be 
characterized as significant (to 'consumers')? Were the causes hardware 
problems that could reasonably have been avoided ('bad cables' might translate 
to 'improperly inserted, overly long, or severely kinked cables', for example - 
and such a poorly-constructed system will tend to have other problems that ZFS 
cannot address)?

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-11-10 Thread can you guess?
 can you guess? wrote:

...

  If you include
 'image files of various
  sorts', as he did (though this also raises the
 question of whether we're
  still talking about 'consumers'), then you also
 have to specify exactly
  how damaging single-bit errors are to those various
 'sorts' (one might
  guess not very for the uncompressed formats that
 might well be taking up
  most of the space).  And since the CERN study seems
 to suggest that the
  vast majority of errors likely to be encountered at
 this level of
  incidence (and which could be caught by ZFS) are
 *detectable* errors,
  they'll (in the unlikely event that you encounter
 them at all) typically
  only result in requiring use of a RAID (or backup)
 copy (surely one
  wouldn't be entrusting data of any real value to a
 single disk).
 
 
 I have to comment here. As a bloke with a bit of a
 photography
 habit - I have a 10Mpx camera and I shoot in RAW mode
 - it is
 very, very easy to acquire 1Tb of image files in
 short order.

So please respond to the question that I raised above (and that you yourself 
quoted):  just how much damage will a single-bit error do to such a RAW file?

 
 Each of the photos I take is between 8 and 11Mb, and
 if I'm
 at a sporting event or I'm travelling for work or
 pleasure,
 it is *incredibly* easy to amass several hundred Mb
 of photos
 every single day.

Even assuming that you meant 'MB' rather than 'Mb' above, that suggests that it 
would take you well over a decade to amass 1 TB of RAW data (assuming that, as 
you suggest both above and later, you didn't accumulate several hundred MB of 
pictures *every* day but just on those days when you were traveling, at a 
sporting event, etc.).

 
 I'm by no means a professional photographer (so I'm
 not out
 taking photos every single day), although a very
 close friend
 of mine is. My photo storage is protected by ZFS with
 mirroring
 and backups to dvd media. My profotog friend has 3
 copies of
 all her data - working set, immediate copy on
 usb-attached disk,
 and second backup also on usb-attached disk but
 disconnected.

Sounds wise on both your parts - and probably makes ZFS's extra protection 
pretty irrelevant (I won't bother repeating why here).

 
 Even if you've got your original file archived, you
 still need
 your working copies available, and Adobe Photoshop
 can turn that
 RAW file into a PSD of nearly 60Mb in some cases.

If you really amass all your pictures this way (rather than, e.g., use 
Photoshop on some of them and then save the result in a less verbose format), 
I'll suggest that this takes you well beyond the 'consumer' range of behavior.

 
 It is very easy for the storage medium to acquire
 some degree
 of corruption - whether it's a CF or SD card, they
 all use
 FAT32. I have been in the position of losing photos
 due to
 this. Not many - perhaps a dozen over the course of
 12 months.

So in those cases you didn't maintain multiple copies.  Bad move, and usually 
nothing that using ZFS could help with.  While I'm not intimately acquainted 
with flash storage, my impression is that data loss usually occurs due to bad 
writes (since once written the data just sits there persistently and AFAIK is 
not subject ot the kinds of 'bit rot' that disk and tape data can experience).  
So if the loss occurs to the original image captured on flash before it can be 
copied elsewhere, you're just SOL and nothing ZFS offers could help you.

 
 That flipped bit which you seem to be dismissing as
 hardly...
 a disaster can in fact make your photo file totally
 useless,
 because not only will you probably not be able to get
 the file
 off the media card, but whatever software you're
 using to keep
 track of your catalog will also be unable to show you
 the
 entire contents. That might be the image itself, or
 it might
 be the equally important EXIF information.

Here come those pesky numbers again, I'm afraid.  Because given that the size 
difference between your image data and the metadata (including EXIF 
information, if that's what I suspect it is) is at least several orders of 
magnitude, the chance that the bad bit will be in something other than the 
image data is pretty negligible.

So even if you can format your card to use ZFS (can you?  if not, what possible 
relevance does your comment above have to this discussion?), doing so won't 
help at all:  the affected file will still be inaccessible (unless you use ZFS 
to create a redundant pool across multiple such cards:  is that really what 
you're suggesting should be done?) both to normal extraction (though couldn't 
dd normally get off everything but the bad sector?) and to your cataloging 
software.

 
 I don't depend on FAT32-formatted media cards to make
 my
 living, fortunately, but if I did I imagine I'd
 probably end
 up only using each card for about a month before
 exercising
 caution and purchasing a new one rather than
 depending on the
 card itself to be reliable any more.

The 'wear

Re: [zfs-discuss] Response to phantom dd-b post

2007-11-10 Thread can you guess?
 can you guess? wrote:
  This is a bit weird: I just wrote the following
 response to a dd-b post that now seems to have
 disappeared from the thread. Just in case that's a
 temporary aberration, I'll submit it anyway as a new
 post.

 
 Strange things certainly happen here now and then. 
 
 The post you're replying to is one I definitely did
 send in.  Could I 
 have messed up and sent it just to you, thus causing
 confusion when you 
 read it, deleted it, remembered it as in the group
 rather than direct?

I used the forum's 'quote original' feature in replying and then received a 
screen-full of Java errors saying that the parent post didn't exist when I 
attempted to submit it.

Most of the balance of your post isn't addressed in any detail because it 
carefully avoids the fundamental issues that I raised:

1.  How much visible damage does a single-bit error actually do to the kind of 
large photographic (e.g., RAW) file you are describing?  If it trashes the rest 
of the file, as you state is the case with jpeg, then you might have a point 
(though you'd still have to address my second issue below), but if it results 
in a virtually invisible blemish they you most certainly don't.

2.  If you actually care about your data, you'd have to be a fool to entrust it 
to *any* single copy, regardless of medium.  And once you've got more than one 
copy, then you're protected (at the cost of very minor redundancy restoration 
effort in the unlikely event that any problem occurs) against the loss of any 
one copy due to a minor error - the only loss of non-negligible likelihood that 
ZFS protects against better than other file systems.

If you're relying upon RAID to provide the multiple copies - though this would 
also arguably be foolish, if only due to the potential for trashing all the 
copies simultaneously - you'd probably want to schedule occasional scrubs, just 
in case you lost a disk.  But using RAID as a substitute for off-line 
redundancy is hardly suitable in the kind of archiving situations that you 
describe - and therefore ZFS has absolutely nothing of value to offer there:  
you should be using off-line copies, and occasionally checking all copies for 
readability (e.g., by copying them to the null device - again, something you 
could do for your on-line copy with a cron job and which you should do for your 
off-line copy/copies once in a while as well.

In sum, your support of ZFS in this specific area seems very much knee-jerk in 
nature rather than carefully thought out - exactly the kind of 'over-hyping' 
that I pointed out in my first post in this thread.

...

  And yet I know many people who have lost data in
 ways
  that ZFS would
  have prevented.
  
 
  Specifics would be helpful here. How many? Can they
 reasonably be characterized as consumers (I'll remind
 you once more: *that's* the subject to which your
 comments purport to be responding)? Can the data loss
 reasonably be characterized as significant (to
 'consumers')? Were the causes hardware problems that
 could reasonably have been avoided ('bad cables'
 might translate to 'improperly inserted, overly long,
 or severely kinked cables', for example - and such a
 poorly-constructed system will tend to have other
 problems that ZFS cannot address)?

 
 Reasonably avoided is irrelevant; they *weren't*
 avoided.

While that observation has at least some merit, I'll observe that you jumped 
directly to the last of my questions above while carefully ignoring the three 
questions that preceded it.

...

 Nearly everybody I can think of who's used a computer
 for more than a 
 couple of years has stories of stuff they've lost.

Of course they have - and usually in ways that ZFS would have been no help 
whatsoever in mitigating.

  I
 knew a lot of 
 people who lost their entire hard drive at one point
 or other especially 
 in the 1985-1995 timeframe.

Fine example of a situation where only redundancy can save you, and where good 
old vanilla-flavored RAID (with scrubbing - but, as I noted, that's hardly 
something that ZFS has any corner on) provides comparable protection to 
ZFS-with-mirroring.

  The people were quite
 upset by the loss; 
 I'm not going to accept somebody else deciding it's
 not significant. 

I never said such situations were not significant, David:  I simply observed 
(and did so again above) that in virtually all of them ZFS offered no 
particular advantage over more conventional means of protection.

You need to get a grip and try to understand the *specifics* of what's being 
discussed here if you want to carry on a coherent discussion about it.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Response to phantom dd-b post

2007-11-10 Thread can you guess?
 can you guess? wrote:

...

  Most of the balance of your post isn't addressed in
 any detail because it carefully avoids the
 fundamental issues that I raised:

 
 Not true; and by selective quoting you have removed
 my specific 
 responses to most of these issues.

While I'm naturally reluctant to call you an outright liar, David, you have 
hardly so far in this discussion impressed me as someone whose presentation is 
so well-organized and responsive to specific points that I can easily assume 
that I simply missed those responses.  If you happen to have a copy of that 
earlier post, I'd like to see it resubmitted (without modification).

  1.  How much visible damage does a single-bit error
 actually do to the kind of large photographic (e.g.,
 RAW) file you are describing?  If it trashes the rest
 of the file, as you state is the case with jpeg, then
 you might have a point (though you'd still have to
 address my second issue below), but if it results in
 a virtually invisible blemish they you most certainly
 don't.

 
 I addressed this quite specifically, for two cases
 (compressed raw vs. 
 uncompressed raw) with different results.

Then please do so where we all can see it.

 
  2.  If you actually care about your data, you'd
 have to be a fool to entrust it to *any* single copy,
 regardless of medium.  And once you've got more than
 one copy, then you're protected (at the cost of very
 minor redundancy restoration effort in the unlikely
 event that any problem occurs) against the loss of
 any one copy due to a minor error - the only loss of
 non-negligible likelihood that ZFS protects against
 better than other file systems.

 
 You have to detect the problem first.   ZFS is in a
 much better position 
 to detect the problem due to block checksums.

Bulls***, to quote another poster here who has since been strangely quiet.  The 
vast majority of what ZFS can detect (save for *extremely* rare undetectable 
bit-rot and for real hardware (path-related) errors that studies like CERN's 
have found to be very rare - and you have yet to provide even anecdotal 
evidence to the contrary) can also be detected by scrubbing, and it's arguably 
a lot easier to apply brute-force scrubbing (e.g., by scheduling a job that 
periodically copies your data to the null device if your system does not 
otherwise support the mechanism) than to switch your file system.

 
  If you're relying upon RAID to provide the multiple
 copies - though this would also arguably be foolish,
 if only due to the potential for trashing all the
 copies simultaneously - you'd probably want to
 schedule occasional scrubs, just in case you lost a
 disk.  But using RAID as a substitute for off-line
 redundancy is hardly suitable in the kind of
 archiving situations that you describe - and
 therefore ZFS has absolutely nothing of value to
 offer there:  you should be using off-line copies,
 and occasionally checking all copies for readability
 (e.g., by copying them to the null device - again,
 something you could do for your on-line copy with a
 cron job and which you should do for your off-line
 copy/copies once in a while as well.

 
 You have to detect the problem first.

And I just described how to above - in a manner that also handles the off-line 
storage that you *should* be using for archival purposes (where ZFS scrubbing 
is useless).

  ZFS block
 checksums will detect 
 problems that a simple read-only pass through most
 other filesystems 
 will not detect. 

The only problems that ZFS will detect that a simple read-through pass will not 
are those that I just enumerated above:  *extremely* rare undetectable bit-rot 
and real hardware (path-related) errors that studies like CERN's have found to 
be very rare (like, none in the TB-sized installation under discussion here).

 
  In sum, your support of ZFS in this specific area
 seems very much knee-jerk in nature rather than
 carefully thought out - exactly the kind of
 'over-hyping' that I pointed out in my first post in
 this thread.

 
 And your opposition to ZFS appears knee-jerk and
 irrational, from this 
 end.  But telling you that will have no beneficial
 effect, any more than 
 what you just told me about how my opinions appear to
 you.  Couldn't we 
 leave personalities out of this, in future?

When someone appears to be arguing irrationally, it's at least worth trying to 
straighten him out.  But I'll stop - *if* you start addressing the very 
specific and quantitative issues that you've been so assiduously skirting until 
now.

 
  ...
 

  And yet I know many people who have lost data in
  
  ways
  
  that ZFS would
  have prevented.
  
  
  Specifics would be helpful here. How many? Can
 they

  reasonably be characterized as consumers (I'll
 remind
  you once more: *that's* the subject to which your
  comments purport to be responding)? Can the data
 loss
  reasonably be characterized as significant

Re: [zfs-discuss] Response to phantom dd-b post

2007-11-10 Thread can you guess?
  can you guess? wrote:
 
 ...
 
   Most of the balance of your post isn't addressed
 in
  any detail because it carefully avoids the
  fundamental issues that I raised:
 
  
  Not true; and by selective quoting you have
 removed
  my specific 
  responses to most of these issues.
 
 While I'm naturally reluctant to call you an outright
 liar, David, you have hardly so far in this
 discussion impressed me as someone whose presentation
 is so well-organized and responsive to specific
 points that I can easily assume that I simply missed
 those responses.  If you happen to have a copy of
 that earlier post, I'd like to see it resubmitted
 (without modification).

Oh, dear:  I got one post/response pair out of phase with the above - the post 
which I claimed did not address the issues that I raised *is* present here (and 
indeed does not address them).

I still won't call you an outright liar:  you're obviously just *very* confused 
about what qualifies as responding to specific points.  And, just for the 
record, if you do have a copy of the post that disappeared, I'd still like to 
see it.

 
   1.  How much visible damage does a single-bit
 error
  actually do to the kind of large photographic
 (e.g.,
  RAW) file you are describing?  If it trashes the
 rest
  of the file, as you state is the case with jpeg,
 then
  you might have a point (though you'd still have to
  address my second issue below), but if it results
 in
  a virtually invisible blemish they you most
 certainly
  don't.
 
  
  I addressed this quite specifically, for two cases
  (compressed raw vs. 
  uncompressed raw) with different results.
 
 Then please do so where we all can see it.

Especially since there's no evidence of it in the post (still right here, up 
above) where you appear to be claiming that you did.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-11-09 Thread can you guess?
Thanks for the detailed reply, Robert.  A significant part of it seems to be 
suggesting that high-end array hardware from multiple vendors may be 
*introducing* error sources that studies like CERN's (and Google's, and CMU's) 
never encountered (based, as they were, on low-end hardware).

If so, then at least a major part of your improved experience is not due to 
using ZFS per se but to getting rid of the high-end equipment and using more 
reliable commodity parts:  a remarkable thought - I wonder if anyone has ever 
done that kind of a study.

A quick Google of ext3 fsck did not yield obvious examples of why people needed 
to run fsck on ext3, though it did remind me that by default ext3 runs fsck 
just for the hell of it every N (20?) mounts - could that have been part of 
what you were seeing?

There are two problems with over-hyping a product:  it gives competitors 
something legitimate to refute, and it leaves the impression that the product 
has to be over-sold because it doesn't have enough *real* merits to stand on.  
Well, for people like me there's a third problem:  we just don't like spin.  
When a product has legitimate strengths that set it out out from the pack, it 
seems a shame to over-sell it the same way that a mediocre product is sold and 
waste the opportunity to take the high ground that it actually does own.

I corrected your misunderstanding about WAFL's separate checksums in my October 
26th response to you in 
http://storagemojo.com/2007/10/25/sun-fires-back-at-netapp/ - though in that 
response I made a reference to something that I seem to have said somewhere (I 
have no idea where) other than in that thread.  In any event, one NetApp paper 
detailing their use is 3356.pdf (first hit if you Google Introduction to Data 
ONTAP 7G) - search for 'checksum' and read about block and zone checksums in 
locations separate from the data that they protect.

As just acknowledged above, I occasionally recall something incorrectly.  I now 
believe that the mechanisms described there were put in place more to allow use 
of disks with standard 512-byte sector sizes than specifically to separate the 
checksums from the data, and that while thus separating the checksums may 
achieve a result similar to ZFS's in-parent checksums the quote that you 
provided may indicate the primary mechanism that WAFL uses to validate its 
data:  whether the 'checksums' reside with the data or elsewhere, I now 
remember reading (I found the note that I made years ago, but it didn't provide 
a specific reference and I just spent an hour searching NetApp's Web site for 
it without success) that the in-block (or near-to-block) 'checksums' include 
not only file identity and offset information but a block generation number (I 
think this is what the author meant by the 'identity' of the block) that 
increments each time the block is updated, and that this generation nu
 mber is kept in the metadata block that points to the file block, thus 
allowing the metadata block to verify with a high degree of certainty that the 
target block is indeed not only the right file block, containing the right 
data, but the right *version* of that block.

As I said, thanks (again) for the detailed response,

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-11-09 Thread can you guess?
 bull*
  -- richard

Hmmm.

Was that bull* as in

Numbers?  We don't need no stinking numbers!  We're so cool that we work for a 
guy who thinks he's Steve Jobs!

or

Silly engineer!  Can't you see that I've got my rakish Marketing hat on?  
Backwards!

or

I jes got back from an early start on my weekend an you better [hic] watch 
what you say, buddy, if you [Hic] don't want to get a gallon of [HIC] 
slightly-used beer and nachos all over your [HIC!] shoes [HIIICCC!!!] oh, 
sh- [BLARRG]

Inquiring minds want to know.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


  1   2   >