Howdy,
 I have at several times had issues with consumer grade PC hardware and ZFS not 
getting along.  The problem is not the disks but the fact I dont have ECC and 
end to end checking on the datapath.  What is happening is that random memory 
errors and bit flips are written out to disk and when read back again ZFS 
reports it as a checksum failure:

  pool: myth
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        myth        ONLINE       0     0    48
          raidz1    ONLINE       0     0    48
            c7t1d0  ONLINE       0     0     0
            c7t3d0  ONLINE       0     0     0
            c6t1d0  ONLINE       0     0     0
            c6t2d0  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /myth/tv/1504_20080216203700.mpg
        /myth/tv/1509_20080217192700.mpg
 
Note there are no disk errors, just entire RAID errors.  I get the same thing 
on a mirror pool where both sides of the mirror have identical errors.  All I 
can assume is that it was corrupted after the checksum was calculated and 
flushed to disk like that.  In the past it was a motherboard capacitor that had 
popped - but it was enough to generate these errors under load.

At any rate ZFS is doing the right thing by telling me - what I dont like is 
that from that point on I cant convince ZFS to ignore it.  The data in question 
is video files - a bit flip here or there wont matter.  But if ZFS reads the 
affected block it returns and I/O error and until I restore the file I have no 
option but to try and make the application skip over it.  If it was UFS for 
example I would have never known, but ZFS makes a point of stopping anything 
using it - understandably, but annoyingly as well.

What I would like to see is an option to ZFS in the style of the 'onerror' for 
UFS i.e the ability to tell ZFS to join fight club - let what doesnt matter 
truely slide.  For example:

zfs set erroraction=[iofail|log|ignore]

This would default to the current action of "iofail" but in the event you 
wanted to try and recover or repair data, you could set log to say generate an 
FMA event that there is bad checksums, or ignore, to get on with your day.

As mentioned, I see this as mostly an option to help repair data after the 
issue is identified or repaired.  Of course its data specific, but if the 
application can allow it or handle it, why should ZFS get in the way?

Just a thought.

Cheers,
  Adrian

PS: And yes, I am now buying some ECC memory.
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to