An intermediate update to my recent post:
2011-11-30 21:01, Jim Klimov wrote:
I've finally upgraded my troublesome oi-148a home storage box to oi-151a about a week ago
(using pkg update method from the wiki page - i'm not certain if that repository is fixed
at release version or is a sliding "current" one).
After the OS upgrade i scrubbed my main pool - 6disk raidz2 - and some checksum
errors were discovered on individual disks, with one non-correctable error on
the raid level. It named a file which was indeed not readable (io errors) so i
deleted it. The dataset pool/media has no snapshots, and dedup was disabled on
it, so i hoped the error is gone.
I cleared the errors (this only zeroed the counters, but still complained that
there were some metadata errors in pool/media:0x4) and reran the scrub. While
the scrub was running, zpool status reported this error and metadata:0x0. The
computer got hung and reset during the scrub, but apparently resumed from the
same spot. When the operation completed, however, it had zero checksum errors
at both disk and raid levels, the pool/media error was gone, but metadata:0x0
error is still in place.
Searching the list archive i found a similar post relevant to snv134 and 135,
and at that time Victor Latushkin suggested that the pool must be recreated. I
have some unique data on the pool, so i'm reluctant to recreate it (besides,
it's problematic to back up 10tb of data at home, and it can take weeks to try
and upload it to my work - even if there were so much free space there, which
So far i cleared the errors and started a new scrub. I kinda hope that if the
box won't hang, it might discover that there are no actual errors indeed. I'll
see that in about 100 hours. The pool is now imported and automounted, and i
didn't yet try to export and reimport it.
The scrub is running slower this time, for a couple of days
now and only nearing 25% completion (last timings were 89
and 101 hours). However it seems to have confirmed some
raidz-/pool-level checksum errors (without known individual
disk errors); whar puzzles me more - there are 2 raidz-level
errors for the one pool-level error:
# zpool status -v
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
scan: scrub in progress since Wed Nov 30 19:38:47 2011
1.97T scanned out of 8.34T at 13.6M/s, 135h54m to go
0 repaired, 23.68% done
NAME STATE READ WRITE CKSUM
pool ONLINE 0 0 1
raidz2-0 ONLINE 0 0 2
c7t0d0 ONLINE 0 0 0
c7t1d0 ONLINE 0 0 0
c7t2d0 ONLINE 0 0 0
c7t3d0 ONLINE 0 0 0
c7t4d0 ONLINE 0 0 0
c7t5d0 ONLINE 0 0 0
c4t1d0p7 ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
My question still stands: is it possible to recover
from this error or somehow safely ignore it? ;)
I mean, without backing up data and recreating the
If the problem is in metadata but presumably the
pool still works, then this particular metadata
is either not critical or redundant, and somehow
can be forged and replaced by valid metadata.
Is this a rightful path of thought?
Are there any tools to remake such a metadata
Again, I did not try to export/reimport the pool
yet, except for that time 3 days ago when the
machine hung, was reset and imported the pool
and continued the scrub automatically...
I think it is now too late to do an export and
a rollback import, too...
Still, i'd like to estimate now what are my chances of living on without
recreating the pool nor losing data? Perhaps, some ways to actually check, fix
or forge the needed metadata? Also, previously a zdb walk found some
inconsistencies (allocated !- referred); can that be better diagnosed or
repaired? Can this discrepancy by a few sectors worth of size be a cause or be
caused by that reported metadata error?
// Jim Klimov
sent from a mobile, pardon any typos ,)
zfs-discuss mailing list