[zfs-discuss] Oddly-persistent file error on ZFS root pool

Lou Picciano Sat, 28 Jan 2012 09:54:36 -0800

Hello ZFS wizards,

Have an odd ZFS problem I'd like to run by you -

Root pool on this machine is a 'simple' mirror - just two disks. # zpool status

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 3
mirror-0 ONLINE 0 0 6
c2t0d0s0 ONLINE 0 0 6
c2t1d0s0 ONLINE 0 0 6

errors: Permanent errors have been detected in the following files:

rpool/ROOT/openindiana-userland-154@zfs-auto-snap_monthly-2011-11-22-09h19:/etc/svc/repository-boot-tmpEdaGba

... or similar; CKSUM counts have varied, but were always in that 1x - 2x ,
'symmetrical' pattern.

After working through the problems above, scrubbing and zfs destroying the
snapshot with 'permanent errors', the CKSUMS clear up, but vestiges of the file
remain as hex addresses:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c2t0d0s0 ONLINE 0 0 0
c2t1d0s0 ONLINE 0 0 0

errors: Permanent errors have been detected in the following files:

<0x18e73>:<0x78007>

I have no evidence that ZFS is itself the direct culprit here; it may just be
on the receiving end of one of the couple of problems we've recently worked
through on this machine:
1. a defective CPU, managed by the fault manager, but without a
fully-configured crashdump (now rectified), then
2. the SandyBridge 'interrupt storm' problem, which we seem to have now worked
around.

The storage pools are scrubbed pretty regularly, and we generally have no cksum
errors at all. At one point, vmstat reported 7+ _million+ interrupt faults over
5 seconds! I've attempted to clear stats on the pool as well (didn't expect
this to work, but worth a try, right?)

Important to note that Memtest+ had been run, last time for ~14 hrs, with no
error reported.

Don't think the storage controller is the culprit, either, as _all_ drives are
controlled by the P67A - and no other problems seen. And no errors reported via
smartctl.

Would welcome input from two perspectives:

1) Before I rebuild the pool/reinstall/whatever, is anyone here interested in
any diagnostic output which might still be available? Is any of this useful as
a bug report?
2) Then, would love to hear ideas on a solution.

Proposed solutions include:
1) creating new BE based on snap of root pool:
- Snapshot root pool
- (zfs send to datapool for safekeeping)
- Split rpool
- zpool create newpool (on Drive 'B')
- beadm -p create newpool NEWboot (being sure to use slice 0 of Drive 'B')

2) Simply deleting _all_ snapshots on the rpool.

3) complete re-install

Tks for feedback. Lou Picciano

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Oddly-persistent file error on ZFS root pool

Reply via email to