Re: [zfs-discuss] Cause for data corruption?

2008-02-29 Thread Jeff Bonwick
 I thought RAIDZ would correct data errors automatically with the parity data.

Right.  However, if the data is corrupted while in memory (e.g. on a PC
with non-parity memory), there's nothing ZFS can do to detect that.
I mean, not even theoretically.  The best we could do would be to
narrow the windows of vulnerability by recomputing the checksum
every time we accessed an in-memory object, which would be terribly
expensive.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Cause for data corruption?

2008-02-28 Thread MC
 So I scrubbed the whole pool and it found a lot more corrupted files.

My condolences :)  

General questions and comments about ZFS and data corruption:

I thought RAIDZ would correct data errors automatically with the parity data.  
How wrong am I on that?  Perhaps a parity correction was already tried, and 
there was too much corruption to be successful, implying a very significant 
amount of data corruption?

Assuming the errors are being generated by bad hardware somewhere between the 
disk and the CPU (inclusively), how could ZFS be configured to handle these 
errors automatically?  Set data copies to equal 2, I think.  Anything else?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Cause for data corruption?

2008-02-28 Thread Sandro
Thanks for your reassuring post, loomy :)

I'm pretty sure the reason for all this is some bad hardware..
But I can't get VTS to work, looks like its not supported for this kind of 
hardware.

And in order to run some other stresstest software or something I would have to 
connect monitor, keyboard and dvd rom.. which I'm just so sick of doing :)

Hopefully I can motivate myself on the weekend .. I'll keep you all here 
updated when I find something.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Cause for data corruption?

2008-02-27 Thread Nicolas Szalay

Le mardi 26 février 2008 à 05:59 -0800, Sandro a écrit :
 Hey
 
 Thanks for your answers guys.
 
 I'll run VTS to stresstest cpu and memory.
 
 And I just checked the block diagram of my motherboard (Gigabyte M61P-S3).
 It doesn't even have 64bit pci slots.. just standard old 33mhz 32bit pci .. 
 and a couple of newer pci-e.
 But my two controllers are both the same vendor / version and are both 
 connected to the same pci bus.
 
looks like 32 bits  ZFS definitively hurts :D

-- 
Nicolas Szalay

Administrateur systèmes  réseaux

-- _
ASCII ribbon campaign ( )
 - against HTML email  X
  vCards / \


signature.asc
Description: Ceci est une partie de message	numériquement signée
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Cause for data corruption?

2008-02-27 Thread Sandro
haha very funny :D

Just the controllers are on a 32bit PCI bus.. solaris itself is running 64bit:

[EMAIL PROTECTED] /var/tmp/
 # isainfo 
amd64 i386

And besides, a lot of our customers are having serious problems with their 
thumpers and zfs and stuff...
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Cause for data corruption?

2008-02-26 Thread Nicolas Szalay
Le lundi 25 février 2008 à 11:05 -0800, Sandro a écrit :
 hi folks

Hi,

 I've been running my fileserver at home with linux for a couple of years and 
 last week I finally reinstalled it with solaris 10 u4.
 
 I borrowed a bunch of disks from a friend, copied over all the files, 
 reinstalled my fileserver and copied the data back.
 
 Everything went fine, but after a few days now, quite a lot of files got 
 corrupted.
 here's the output:
 
  # zpool status data
   pool: data
  state: ONLINE
 status: One or more devices has experienced an error resulting in data
 corruption.  Applications may be affected.
 action: Restore the file in question if possible.  Otherwise restore the
 entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
  scrub: scrub completed with 422 errors on Mon Feb 25 00:32:18 2008
 config:
 
 NAMESTATE READ WRITE CKSUM
 dataONLINE   0 0 5.52K
   raidz1ONLINE   0 0 5.52K
 c0t0d0  ONLINE   0 0 10.72
 c0t1d0  ONLINE   0 0 4.59K
 c0t2d0  ONLINE   0 0 5.18K
 c0t3d0  ONLINE   0 0 9.10K
 c1t0d0  ONLINE   0 0 7.64K
 c1t1d0  ONLINE   0 0 3.75K
 c1t2d0  ONLINE   0 0 4.39K
 c1t3d0  ONLINE   0 0 6.04K
 
 errors: 388 data errors, use '-v' for a list
 
 Last night I found out about this, it told me there were errors in like 50 
 files.
 So I scrubbed the whole pool and it found a lot more corrupted files.
 
 The temporary system which I used to hold the data while I'm installing 
 solaris on my fileserver is running nv build 80 and no errors on there.
 
 What could be the cause of these errors??
 I don't see any hw errors on my disks..
 
  # iostat -En | grep -i error
 c3d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
 c4d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
 c0t0d0   Soft Errors: 574 Hard Errors: 0 Transport Errors: 0
 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
 c1t0d0   Soft Errors: 549 Hard Errors: 0 Transport Errors: 0
 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
 c0t1d0   Soft Errors: 14 Hard Errors: 0 Transport Errors: 0
 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
 c0t2d0   Soft Errors: 549 Hard Errors: 0 Transport Errors: 0
 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
 c0t3d0   Soft Errors: 549 Hard Errors: 0 Transport Errors: 0
 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
 c1t1d0   Soft Errors: 548 Hard Errors: 0 Transport Errors: 0
 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
 c1t2d0   Soft Errors: 14 Hard Errors: 0 Transport Errors: 0
 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
 c1t3d0   Soft Errors: 548 Hard Errors: 0 Transport Errors: 0
 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
 
 although a lot of soft errors.
 Linux said that one disk had gone bad, but I figured the sata cable was 
 somehow broken, so I replaced that before installing solaris. And solaris 
 didn't and doesn't see any actual hw errors on the disks, does it?

I had the same symptoms recently. I also thought the disk were dying but
I was wrong. Suspected the RAM, no. Finally it was because I mixed raid
cards on different PCI buses : 2 64bits buses (no problem with these
ones) and 1 32 Bits PCI bus which caused *all* the checksum errors.

Kicked ou the card on the 32 bit PCI bus and all worked fine.

Hope it helps,

-- 
Nicolas Szalay

Administrateur systèmes  réseaux

-- _
ASCII ribbon campaign ( )
 - against HTML email  X
  vCards / \


signature.asc
Description: Ceci est une partie de message	numériquement signée
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Cause for data corruption?

2008-02-26 Thread Sandro
Hey

Thanks for your answers guys.

I'll run VTS to stresstest cpu and memory.

And I just checked the block diagram of my motherboard (Gigabyte M61P-S3).
It doesn't even have 64bit pci slots.. just standard old 33mhz 32bit pci .. and 
a couple of newer pci-e.
But my two controllers are both the same vendor / version and are both 
connected to the same pci bus.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Cause for data corruption?

2008-02-25 Thread Sandro
hi folks

I've been running my fileserver at home with linux for a couple of years and 
last week I finally reinstalled it with solaris 10 u4.

I borrowed a bunch of disks from a friend, copied over all the files, 
reinstalled my fileserver and copied the data back.

Everything went fine, but after a few days now, quite a lot of files got 
corrupted.
here's the output:

 # zpool status data
  pool: data
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: scrub completed with 422 errors on Mon Feb 25 00:32:18 2008
config:

NAMESTATE READ WRITE CKSUM
dataONLINE   0 0 5.52K
  raidz1ONLINE   0 0 5.52K
c0t0d0  ONLINE   0 0 10.72
c0t1d0  ONLINE   0 0 4.59K
c0t2d0  ONLINE   0 0 5.18K
c0t3d0  ONLINE   0 0 9.10K
c1t0d0  ONLINE   0 0 7.64K
c1t1d0  ONLINE   0 0 3.75K
c1t2d0  ONLINE   0 0 4.39K
c1t3d0  ONLINE   0 0 6.04K

errors: 388 data errors, use '-v' for a list

Last night I found out about this, it told me there were errors in like 50 
files.
So I scrubbed the whole pool and it found a lot more corrupted files.

The temporary system which I used to hold the data while I'm installing solaris 
on my fileserver is running nv build 80 and no errors on there.

What could be the cause of these errors??
I don't see any hw errors on my disks..

 # iostat -En | grep -i error
c3d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
c4d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
c0t0d0   Soft Errors: 574 Hard Errors: 0 Transport Errors: 0
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
c1t0d0   Soft Errors: 549 Hard Errors: 0 Transport Errors: 0
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
c0t1d0   Soft Errors: 14 Hard Errors: 0 Transport Errors: 0
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
c0t2d0   Soft Errors: 549 Hard Errors: 0 Transport Errors: 0
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
c0t3d0   Soft Errors: 549 Hard Errors: 0 Transport Errors: 0
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
c1t1d0   Soft Errors: 548 Hard Errors: 0 Transport Errors: 0
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
c1t2d0   Soft Errors: 14 Hard Errors: 0 Transport Errors: 0
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
c1t3d0   Soft Errors: 548 Hard Errors: 0 Transport Errors: 0
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0

although a lot of soft errors.
Linux said that one disk had gone bad, but I figured the sata cable was somehow 
broken, so I replaced that before installing solaris. And solaris didn't and 
doesn't see any actual hw errors on the disks, does it?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Cause for data corruption?

2008-02-25 Thread Nathan Kroenert
My guess is that you have some defective hardware in the system that's 
causing bit flips in the checksum or the data payload.

I'd suggest running some sort of system diagnostics for a few hours to 
see if you can locate the bad piece of hardware.

My suspicion would be your memory or CPU, but that's just a wild guess, 
based on the number of errors you have and the number of devices it's 
spread over.

Could it be that you have been corrupting data for some time and now 
known it?

Oh - And i'd also look around based on your disk controller and ensure 
that there are no newer patches for it, just in case it's one for which 
there was a known problem. (which was worked around in the driver)

I *think* there was an issue with at least one or two...

Cheers!

Nathan.

Sandro wrote:
 hi folks
 
 I've been running my fileserver at home with linux for a couple of years and 
 last week I finally reinstalled it with solaris 10 u4.
 
 I borrowed a bunch of disks from a friend, copied over all the files, 
 reinstalled my fileserver and copied the data back.
 
 Everything went fine, but after a few days now, quite a lot of files got 
 corrupted.
 here's the output:
 
  # zpool status data
   pool: data
  state: ONLINE
 status: One or more devices has experienced an error resulting in data
 corruption.  Applications may be affected.
 action: Restore the file in question if possible.  Otherwise restore the
 entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
  scrub: scrub completed with 422 errors on Mon Feb 25 00:32:18 2008
 config:
 
 NAMESTATE READ WRITE CKSUM
 dataONLINE   0 0 5.52K
   raidz1ONLINE   0 0 5.52K
 c0t0d0  ONLINE   0 0 10.72
 c0t1d0  ONLINE   0 0 4.59K
 c0t2d0  ONLINE   0 0 5.18K
 c0t3d0  ONLINE   0 0 9.10K
 c1t0d0  ONLINE   0 0 7.64K
 c1t1d0  ONLINE   0 0 3.75K
 c1t2d0  ONLINE   0 0 4.39K
 c1t3d0  ONLINE   0 0 6.04K
 
 errors: 388 data errors, use '-v' for a list
 
 Last night I found out about this, it told me there were errors in like 50 
 files.
 So I scrubbed the whole pool and it found a lot more corrupted files.
 
 The temporary system which I used to hold the data while I'm installing 
 solaris on my fileserver is running nv build 80 and no errors on there.
 
 What could be the cause of these errors??
 I don't see any hw errors on my disks..
 
  # iostat -En | grep -i error
 c3d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
 c4d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
 c0t0d0   Soft Errors: 574 Hard Errors: 0 Transport Errors: 0
 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
 c1t0d0   Soft Errors: 549 Hard Errors: 0 Transport Errors: 0
 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
 c0t1d0   Soft Errors: 14 Hard Errors: 0 Transport Errors: 0
 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
 c0t2d0   Soft Errors: 549 Hard Errors: 0 Transport Errors: 0
 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
 c0t3d0   Soft Errors: 549 Hard Errors: 0 Transport Errors: 0
 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
 c1t1d0   Soft Errors: 548 Hard Errors: 0 Transport Errors: 0
 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
 c1t2d0   Soft Errors: 14 Hard Errors: 0 Transport Errors: 0
 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
 c1t3d0   Soft Errors: 548 Hard Errors: 0 Transport Errors: 0
 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
 
 although a lot of soft errors.
 Linux said that one disk had gone bad, but I figured the sata cable was 
 somehow broken, so I replaced that before installing solaris. And solaris 
 didn't and doesn't see any actual hw errors on the disks, does it?
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss