Re: [zfs-discuss] Re: Production ZFS Server Death (06/06)

2006-12-02 Thread Casper . Dik

While other file systems, when they become corrupt, allow you to  
salvage data :-)


They allow you to salvage what you *think* is your data.

But in reality, you have no clue what the disks are giving you.

Casper
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Production ZFS Server Death (06/06)

2006-12-02 Thread Al Hopper
On Sat, 2 Dec 2006, Chad Leigh -- Shire.Net LLC wrote:


 On Dec 2, 2006, at 12:06 AM, Ian Collins wrote:

  Chad Leigh -- Shire.Net LLC wrote:
 
 
  On Dec 1, 2006, at 10:17 PM, Ian Collins wrote:
 
  Chad Leigh -- Shire.Net LLC wrote:
 
  There is not?  People buy disk drives and expect them to corrupt
  their data?  I expect the drives I buy to work fine (knowing that
  there could be bugs etc in them, the same as with my RAID systems).
 
  So you trust your important data to a single drive?  I doubt
  it.   But I
  bet you do trust your data to a hardware RAID array.
 
 
  Yes, but not because I expect a single drive to be more error prone
  (versus total failure).  Total drive failure on a single disk loses
  all your data.  But we are not talking total failure, we are talking
  errors that corrupt data.  I buy individual drives with the
  expectation that they are designed to be error free and are error
  free for the most part and I do not expect a RAID array to be more
  robust in this regard (after all, the RAID is made up of a bunch of
  single drives).
 
  But people expect RAID to protect them from the corruption caused by a
  partial failure, say a bad block, which is a common failure mode.

 They do?  I must admit no experience with the big standalone raid
 array storage units, just (expensive) HW raid cards, but I have never
 expected an array to protect me against data corruption.  Bad blocks
 can be detected and remapped, and maybe the array can recalculate the
 block from parity etc, but that is a known disk error, and not the
 subtle kinds of errors created by the RAID array that are being
 claimed here.

The
  worst system failure I experienced was caused by one half of a mirror
  experiencing bad blocks and the corrupt data being nicely mirrored on
  the other drive.  ZFS would have saved this system from failure.

 None of my comments are meant to denigrate ZFS.  I am implementing it
 myself.

 
  Some people on this list think that the RAID arrays are more likely
  to corrupt your data than JBOD (both with ZFS on top, for example, a
  ZFS mirror of 2 raid arrays or a JBOD mirror or raidz).  There is no
  proof of this or even reasonable hypothetical explanation for this
  that I have seen presented.
 
  I don't think that the issue here, it's more one of perceived data
  integrity.  People who have been happily using a single RAID 5 are now
  finding that the array has been silently corrupting their data.

 They are?  They are being told that the problems they are having is
 due to that but there is no proof.  It could be a bad driver for
 example.

  People
  expect errors form single drives,

 They do?  The tech specs show very low failure rates for single
 drives in terms of bit errors.

  so they put them in a RAID knowing the
  firmware will protect them from drive errors.

 The RAID firmware will not protect them from bit errors on block
 reads unless the disk detects that the whole block is bad.  I admit
 not knowing how much the disk itself can detect bit errors with CRC
 or similar sorts of things.

This is incorrect.  Lets take a simple example of a H/W RAID5 with 4 disk
drives.  If disk 1 returns a bad block when a stripe of data is read (and
does not indicate an error condition), the RAID firmware will calculate
the parity/CRC for the entire stripe (as it *always* does) and see that
that there is an error present and transparently correct the error, before
returning the corrected data upstream to the application (server).  It
can't correct every possible error - there will be limits depending on
which CRC algorithms are implemented and the extend of the faulty data.
But, in general, those algorithms, if correctly chosen and implemented,
will correct most errors, most of the time.

The main reason why not *all* the possible errors can be corrected, is
because there are compromises to be made in:

- the number of bits of CRC that will be calculated and stored
- the CPU and memory resources required to perform the CRC calculations
- limitations in the architecture of the RAID h/w, for example, how much
bandwidth is available between the CPU, memory, disk I/O controllers and
what level of bus contention can be tolerated
- whether the RAID vendor wishes to make any money (hardware costs must be
minimized)
- whether the RAID vendor wishes to win benchmarking comparisons with
their competition
- how smart the firmware developers are and how much pressure is put on
them to get the product to market
- blah, blah, blah

  They often fail to
  recognise that the RAID firmware may not be perfect.

 ZFS, JBOS disk controllers, drivers for said disk controllers, etc
 may not be perfect either.

 
  ZFS looks to be the perfect tool for mirroring hardware RAID arrays,
  with the advantage over other schemes of knowing which side of the
  mirror has an error.  Thus ZFS can be used as a tool to compliment,
  rather than replace hardware RAID.

 I agree.  That is what I am doing :-)

 

Re: [zfs-discuss] Re: Production ZFS Server Death (06/06)

2006-12-02 Thread Al Hopper
On Sat, 2 Dec 2006, Chad Leigh -- Shire.Net LLC wrote:


 On Dec 2, 2006, at 6:01 AM, [EMAIL PROTECTED] wrote:

 
  While other file systems, when they become corrupt, allow you to
  salvage data :-)
 
 
  They allow you to salvage what you *think* is your data.
 
  But in reality, you have no clue what the disks are giving you.

 I stand by what I said.  If you have a massive disk failure, yes.
 You are right.

 When you have subtle corruption, some of the data and meta data is
 bad but not all.  In that case you can recover (and verify the data
 if you have the means to do so) t he parts that did not get
 corrupted.  My ZFS experience so far is that it basically said the
 whole 20GB pool was dead and I seriously doubt all 20GB was corrupted.

That was because you built a pool with no redundancy.  In the case where
ZFS does not have a redundant config from which to try to reconstruct the
data (today) it simply says: sorry charlie - you pool is corrupt.

Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
   Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
 OpenSolaris Governing Board (OGB) Member - Feb 2006
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Production ZFS Server Death (06/06)

2006-12-02 Thread Toby Thain


On 2-Dec-06, at 12:56 PM, Al Hopper wrote:


On Sat, 2 Dec 2006, Chad Leigh -- Shire.Net LLC wrote:



On Dec 2, 2006, at 6:01 AM, [EMAIL PROTECTED] wrote:




While other file systems, when they become corrupt, allow you to
salvage data :-)



They allow you to salvage what you *think* is your data.

But in reality, you have no clue what the disks are giving you.


I stand by what I said.  If you have a massive disk failure, yes.
You are right.

When you have subtle corruption, some of the data and meta data is
bad but not all.  In that case you can recover (and verify the data
if you have the means to do so) t he parts that did not get
corrupted.  My ZFS experience so far is that it basically said the
whole 20GB pool was dead and I seriously doubt all 20GB was  
corrupted.


That was because you built a pool with no redundancy.  In the case  
where
ZFS does not have a redundant config from which to try to  
reconstruct the

data (today) it simply says: sorry charlie - you pool is corrupt.


Is that the whole story though? Even without redundancy, isn't there  
a lot of resilience against corruption (redundant metadata, etc)?


--Toby



Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
   Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
 OpenSolaris Governing Board (OGB) Member - Feb 2006
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] replacing a drive in a raidz vdev

2006-12-02 Thread Bill Sommerfeld
On Sat, 2006-12-02 at 00:08 -0500, Theo Schlossnagle wrote:
 I had a disk malfunction in a raidz pool today.  I had an extra on in  
 the enclosure and performed a: zpool replace pool old new and several  
 unexpected behaviors have transpired:
 
 the zpool replace command hung for 52 minutes during which no zpool  
 commands could be executed (like status, iostat or list).

So, I've observed that zfs will continue to attempt to do I/O to the
outgoing drive while a replacement is in progress.  (seems
counterintuitive - I'd expect that you'd want to touch the outgoing
drive as little as possible, perhaps only attempting to read from it in
the event that a block wasn't recoverable from the healthy drives).

 When it finally returned, the drive was marked as replacing as I  
 expected from reading the man page.  However, it's progress counter  
 has not been monotonically increasing.  It started at 1% and then  
 went to 5% and then back to 2%, etc. etc.

do you have any cron jobs set up to do periodic snapshots?
If so, I think you're seeing:

6343667 scrub/resilver has to start over when a snapshot is taken

I ran into this myself this week - replaced a drive, and the resilver
made it to 95% before a snapshot cron job fired and set things back to
0%.

- Bill


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] replacing a drive in a raidz vdev

2006-12-02 Thread Theo Schlossnagle


On Dec 2, 2006, at 1:32 PM, Bill Sommerfeld wrote:


On Sat, 2006-12-02 at 00:08 -0500, Theo Schlossnagle wrote:

I had a disk malfunction in a raidz pool today.  I had an extra on in
the enclosure and performed a: zpool replace pool old new and several
unexpected behaviors have transpired:

the zpool replace command hung for 52 minutes during which no zpool
commands could be executed (like status, iostat or list).


So, I've observed that zfs will continue to attempt to do I/O to the
outgoing drive while a replacement is in progress.  (seems
counterintuitive - I'd expect that you'd want to touch the outgoing
drive as little as possible, perhaps only attempting to read from  
it in

the event that a block wasn't recoverable from the healthy drives).


When it finally returned, the drive was marked as replacing as I
expected from reading the man page.  However, it's progress counter
has not been monotonically increasing.  It started at 1% and then
went to 5% and then back to 2%, etc. etc.


do you have any cron jobs set up to do periodic snapshots?
If so, I think you're seeing:

6343667 scrub/resilver has to start over when a snapshot is taken

I ran into this myself this week - replaced a drive, and the resilver
made it to 95% before a snapshot cron job fired and set things back to
0%.


Yesterday, a snapshot was taking to assist in backups -- that could  
be it.


// Theo Schlossnagle
// CTO -- http://www.omniti.com/~jesus/
// OmniTI Computer Consulting, Inc. -- http://www.omniti.com/


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Production ZFS Server Death (06/06)

2006-12-02 Thread Chad Leigh -- Shire.Net LLC


On Dec 2, 2006, at 10:56 AM, Al Hopper wrote:


On Sat, 2 Dec 2006, Chad Leigh -- Shire.Net LLC wrote:



On Dec 2, 2006, at 6:01 AM, [EMAIL PROTECTED] wrote:




While other file systems, when they become corrupt, allow you to
salvage data :-)



They allow you to salvage what you *think* is your data.

But in reality, you have no clue what the disks are giving you.


I stand by what I said.  If you have a massive disk failure, yes.
You are right.

When you have subtle corruption, some of the data and meta data is
bad but not all.  In that case you can recover (and verify the data
if you have the means to do so) t he parts that did not get
corrupted.  My ZFS experience so far is that it basically said the
whole 20GB pool was dead and I seriously doubt all 20GB was  
corrupted.


That was because you built a pool with no redundancy.  In the case  
where
ZFS does not have a redundant config from which to try to  
reconstruct the

data (today) it simply says: sorry charlie - you pool is corrupt.


Where a RAID system would still be salvageable.

Chad

---
Chad Leigh -- Shire.Net LLC
Your Web App and Email hosting provider
chad at shire.net





smime.p7s
Description: S/MIME cryptographic signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Production ZFS Server Death (06/06)

2006-12-02 Thread Jeff Victor

Chad Leigh -- Shire.Net LLC wrote:


On Dec 2, 2006, at 10:56 AM, Al Hopper wrote:


On Sat, 2 Dec 2006, Chad Leigh -- Shire.Net LLC wrote:



On Dec 2, 2006, at 6:01 AM, [EMAIL PROTECTED] wrote:




While other file systems, when they become corrupt, allow you to
salvage data :-)


They allow you to salvage what you *think* is your data.

But in reality, you have no clue what the disks are giving you.



I stand by what I said.  If you have a massive disk failure, yes.
You are right.

When you have subtle corruption, some of the data and meta data is
bad but not all.  In that case you can recover (and verify the data
if you have the means to do so) t he parts that did not get
corrupted.  My ZFS experience so far is that it basically said the
whole 20GB pool was dead and I seriously doubt all 20GB was  corrupted.


That was because you built a pool with no redundancy.  In the case  where
ZFS does not have a redundant config from which to try to  reconstruct the
data (today) it simply says: sorry charlie - you pool is corrupt.


Where a RAID system would still be salvageable.


That is a comparison of apples to oranges.  The RAID system has Redundancy.  If 
the ZFS pool had been configured with redundancy, it would have fared at least as 
well as the RAID system.


Without redundancy, neither of them can magically reconstruct data.  The RAID 
system would simply be an AID system.



--
Jeff VICTOR  Sun Microsystemsjeff.victor @ sun.com
OS AmbassadorSr. Technical Specialist
Solaris 10 Zones FAQ:http://www.opensolaris.org/os/community/zones/faq
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Production ZFS Server Death (06/06)

2006-12-02 Thread Chad Leigh -- Shire.Net LLC


On Dec 2, 2006, at 12:29 PM, Jeff Victor wrote:


Chad Leigh -- Shire.Net LLC wrote:

On Dec 2, 2006, at 10:56 AM, Al Hopper wrote:

On Sat, 2 Dec 2006, Chad Leigh -- Shire.Net LLC wrote:



On Dec 2, 2006, at 6:01 AM, [EMAIL PROTECTED] wrote:




While other file systems, when they become corrupt, allow you to
salvage data :-)


They allow you to salvage what you *think* is your data.

But in reality, you have no clue what the disks are giving you.



I stand by what I said.  If you have a massive disk failure, yes.
You are right.

When you have subtle corruption, some of the data and meta data is
bad but not all.  In that case you can recover (and verify the data
if you have the means to do so) t he parts that did not get
corrupted.  My ZFS experience so far is that it basically said the
whole 20GB pool was dead and I seriously doubt all 20GB was   
corrupted.


That was because you built a pool with no redundancy.  In the  
case  where
ZFS does not have a redundant config from which to try to   
reconstruct the

data (today) it simply says: sorry charlie - you pool is corrupt.

Where a RAID system would still be salvageable.


That is a comparison of apples to oranges.  The RAID system has  
Redundancy.  If the ZFS pool had been configured with redundancy,  
it would have fared at least as well as the RAID system.


Without redundancy, neither of them can magically reconstruct  
data.  The RAID system would simply be an AID system.


That is not the question.  Assuming the error came OUT of the RAID  
system (which it did in this case as there was a bug in the driver  
and the cache did not get flushed in a certain shutdown situation),  
another FS would have been salvageable as the whole 20GB of the pool  
was not corrupt.


Chad

---
Chad Leigh -- Shire.Net LLC
Your Web App and Email hosting provider
chad at shire.net





smime.p7s
Description: S/MIME cryptographic signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Production ZFS Server Death (06/06)

2006-12-02 Thread Dick Davies

On 02/12/06, Chad Leigh -- Shire.Net LLC [EMAIL PROTECTED] wrote:


On Dec 2, 2006, at 10:56 AM, Al Hopper wrote:

 On Sat, 2 Dec 2006, Chad Leigh -- Shire.Net LLC wrote:



 On Dec 2, 2006, at 6:01 AM, [EMAIL PROTECTED] wrote:



 When you have subtle corruption, some of the data and meta data is
 bad but not all.  In that case you can recover (and verify the data
 if you have the means to do so) t he parts that did not get
 corrupted.  My ZFS experience so far is that it basically said the
 whole 20GB pool was dead and I seriously doubt all 20GB was
 corrupted.



 That was because you built a pool with no redundancy.  In the case
 where
 ZFS does not have a redundant config from which to try to
 reconstruct the
 data (today) it simply says: sorry charlie - you pool is corrupt.



Where a RAID system would still be salvageable.


RAID level what? How is anything salvagable if you lose your only copy?

ZFS does store multiple copies of metadata in a single vdev, so I
assume we're talking about data here.

--
Rasputin :: Jack of All Trades - Master of Nuns
http://number9.hellooperator.net/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Production ZFS Server Death (06/06)

2006-12-02 Thread Rich Teer
On Sat, 2 Dec 2006, Al Hopper wrote:

  Some people on this list think that the RAID arrays are more likely
  to corrupt your data than JBOD (both with ZFS on top, for example, a
  ZFS mirror of 2 raid arrays or a JBOD mirror or raidz).  There is no
 
 Can you present a cut/paste where that assertion was made?

I don't want to put words in Chad's mouth, but I think he might be
misunderstanding representations that people make here about ZFS vs
HW RAID.  I don't think that people have asserted that RAID arrays
are more likely to corrupt data than a JBOD; what I think people ARE
asserting is that corruption is more likely to go undetected in a HW
RAID than in a JBOD with ZFS.  (A subtle, but important, difference.)

The reason for this is understandable: if you write some data to a
HW RAID device, you assume that unless otherwise notified, your data
is safe.  The HW RAID, but its very nature, is a black box that we
assume is OK.  With ZFS+JBOD, ZFS' built in end-to-end error checking
will catch any silent errors created in the JBOD, when they happen,
and can correct them (or at least notify you) right away.

-- 
Rich Teer, SCSA, SCNA, SCSECA, OpenSolaris CAB member

President,
Rite Online Inc.

Voice: +1 (250) 979-1638
URL: http://www.rite-group.com/rich
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss