Re: [zfs-discuss] Mysterious corruption with raidz2 vdev (1 checksum err on disk, 2 on vdev?)

2007-07-29 Thread Tomas Ögren
On 28 July, 2007 - Marc Bevand sent me these 0,7K bytes:

 Matthew Ahrens Matthew.Ahrens at sun.com writes:
  
  So the errors on the raidz2 vdev indeed indicate that at least 3 disks 
  below 
  it gave the wrong data for a those 2 blocks; we just couldn't tell which 3+ 
  disks they were.
 
 Something must be seriously wrong with this server. This is the first time I 
 see an uncorrectable checksum error in a raidz2 vdev. I would suggest Kevin 
 to 
 run memtest86 or similar. It is more likely bad data has been written on the 
 disks in the first place (due to flaky RAM/CPU/mobo/cables) rather than 3+ 
 disks corrupting data in the same stripe !

They are all connected to the same controller.. which might have had a
bad day.. but memory corruption sounds like a plausible problem too.. My
workstation suddenly started having trouble compiling hello world..
memtest to the rescue, the next day I found 340 errors..

/Tomas
-- 
Tomas Ögren, [EMAIL PROTECTED], http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mysterious corruption with raidz2 vdev (1 checksum err on disk, 2 on vdev?)

2007-07-27 Thread Matthew Ahrens
Kevin wrote:
 After a scrub of a pool with 3 raidz2 vdevs (each with 5 disks in them) I see 
 the following status output. Notice that the raidz2 vdev has 2 checksum 
 errors, but only one disk inside the raidz2 vdev has a checksum error. How is 
 this possible? I thought that you would have to have 3 errors in the same 
 'stripe' within a raidz2 vdev in order for the error to become unrecoverable.

A checksum error on a disk indicates that we know for sure that this disk 
gave us wrong data.  With raidz[2], if we are unable to reconstruct the 
block successfully but no disk admitted that it failed, then we have no way 
of knowing which disk(s) are actually incorrect.

So the errors on the raidz2 vdev indeed indicate that at least 3 disks below 
it gave the wrong data for a those 2 blocks; we just couldn't tell which 3+ 
disks they were.

It's as if I know that A+B==3, but A is 1 and B is 3.  I can't tell if A is 
wrong or B is wrong (or both!).

The checksum errors on the cXtXdX vdevs didn't result in data loss, because 
we reconstructed the data from the other disks in the raidz group.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mysterious corruption with raidz2 vdev (1 checksum err on disk, 2 on vdev?)

2007-07-27 Thread Marc Bevand
Matthew Ahrens Matthew.Ahrens at sun.com writes:
 
 So the errors on the raidz2 vdev indeed indicate that at least 3 disks below 
 it gave the wrong data for a those 2 blocks; we just couldn't tell which 3+ 
 disks they were.

Something must be seriously wrong with this server. This is the first time I 
see an uncorrectable checksum error in a raidz2 vdev. I would suggest Kevin to 
run memtest86 or similar. It is more likely bad data has been written on the 
disks in the first place (due to flaky RAM/CPU/mobo/cables) rather than 3+ 
disks corrupting data in the same stripe !

-marc


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Mysterious corruption with raidz2 vdev (1 checksum err on disk, 2 on vdev?)

2007-07-25 Thread Kevin
After a scrub of a pool with 3 raidz2 vdevs (each with 5 disks in them) I see 
the following status output. Notice that the raidz2 vdev has 2 checksum errors, 
but only one disk inside the raidz2 vdev has a checksum error. How is this 
possible? I thought that you would have to have 3 errors in the same 'stripe' 
within a raidz2 vdev in order for the error to become unrecoverable.

And I have not reset any errors with zpool clear ...

Comments will be appreciated. Thanks.

$ zpool status -v
  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: scrub completed with 1 errors on Mon Jul 23 19:59:07 2007
config:

NAME STATE READ WRITE CKSUM
tank ONLINE   0 0 2
  raidz2 ONLINE   0 0 2
c2t0d0   ONLINE   0 0 1
c2t1d0   ONLINE   0 0 0
c2t2d0   ONLINE   0 0 0
c2t3d0   ONLINE   0 0 0
c2t4d0   ONLINE   0 0 0
  raidz2 ONLINE   0 0 0
c2t5d0   ONLINE   0 0 0
c2t6d0   ONLINE   0 0 0
c2t7d0   ONLINE   0 0 0
c2t8d0   ONLINE   0 0 0
c2t9d0   ONLINE   0 0 0
  raidz2 ONLINE   0 0 0
c2t10d0  ONLINE   0 0 0
c2t11d0  ONLINE   0 0 0
c2t12d0  ONLINE   0 0 1
c2t13d0  ONLINE   0 0 0
c2t14d0  ONLINE   0 0 0
spares
  c2t15d0AVAIL   

errors: The following persistent errors have been detected:

  DATASET  OBJECT   RANGE
  55fe9784  lvl=0 blkid=40299
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss