Today my production server crashed 4 times. THIS IS NIGHTMARE!
Self-healing file system?! For me ZFS is SELF-KILLING filesystem.
I cannot fsck it, there's no such tool.
I cannot scrub it, it crashes 30-40 minutes after scrub starts.
I cannot use it, it crashes a number of times every day! And with every crash
number of checksum failures is growing:
NAME STATE READ WRITE CKSUM
box5 ONLINE 0 0 0
...after a few hours...
box5 ONLINE 0 0 4
...after a few hours...
box5 ONLINE 0 0 62
...after another few hours...
box5 ONLINE 0 0 120
...crash! and we start again...
box5 ONLINE 0 0 0
...etc...
actually 120 is record, sometimes it crashed as soon as it boots.
and always there's a permanent error:
errors: Permanent errors have been detected in the following files:
box5:<0x0>
and very wise self-healing advice:
http://www.sun.com/msg/ZFS-8000-8A
Restore the file in question if possible. Otherwise restore the entire pool
from backup.
Thanks, but if I restore it from backup it won't be ZFS anymore, that's for
sure.
It's not I/O problem. AFAIK, default ZFS I/O error behavior is "wait" to repair
(i've 10U4, non-configurable). Then why it panics?
Recently there were discussions on failure of OpenSolaris community. Now it's
been more than half a month since I reported such an error. Nobody even posted
something like "RTFM". Come on guys, I know you are there and busy with
enterprise customers... but at least give me some troubleshooting ideas. i'm
totally lost.
just to remind, it's heavily loaded fs with 3-4 million files and folders.
Link to original post:
http://www.opensolaris.org/jive/thread.jspa?threadID=57425
--
This messages posted from opensolaris.org