System is a V880 running Solaris 8 (SunOS gsbims 5.8 Generic_117350-27 sun4u sparc SUNW,Sun-Fire-880). Patches are recent within a month or so.
Some background. c1t0d0 and c1t1d0 both have slices in disksuite mirror sets. Specifically, /, swap, /opt, /usr and /var. Additionally, c1t3d0 and c1t4d0 have a slice (s0) in a disksuite concat/stripe (not raid) mounted at /var/log/iplanet. I found problems when I was on the system trying to newfs an additional mail store from our SAN and newfs and even df and du went core on me. After looking in messages, I saw the following: messages: Oct 28 16:51:37 gsbims krtld: [ID 652239 kern.notice] /kernel/sys/sparcv9/kaio bad shndx messages: Oct 28 16:51:37 gsbims krtld: [ID 408501 kern.notice] kaio error reading symbols Then I ran metastat and found all the meta-devices in a state of maintenance. Looking through the messages, I saw several SCSI messages about c1t1d0 going offline/online then some B_FAILFAST messages then the system gave up on that disk and took SDS into 'maintenance'. So, to make sure the hardware was indeed bad I did a metareplace on all the md's to get them to resync. Thought was, if it goes out again I'll call support and get that drive replace. I restored /kernel/sys/sparcv9/kaio to see if the file was corrupt and sure enough, after being restored, newfs worked for me. Now, after doing the metareplace the following appeared over the next couple days: Oct 31 02:58:37 gsbims ufs: [ID 879645 kern.notice] NOTICE: /: unexpected free inode 60096, run fsck(1M) Oct 31 10:40:31 gsbims ufs: [ID 879645 kern.notice] NOTICE: /: unexpected free inode 7570, run fsck(1M) Oct 31 10:40:31 gsbims ufs: [ID 941273 kern.notice] NOTICE: /: bad dir ino 7607 at offset 0: mangled entry Oct 31 10:58:40 gsbims ufs: [ID 879645 kern.notice] NOTICE: /: unexpected free inode 277267, run fsck(1M) I wasn't given the chance to take the system down to fsck the filesystems until Wednesday (Nov 1) at 18:30 CST. However, at 17:23 CST the system threw an error to the console about /var/log/iplanet and went panic. Can't remember the exact message, but I believe it was "unexpected free space" or something similar. Shut down the power to the system and brought it back up on a jumpstart kernel in single user mode. Did a fsck on all the local file systems and watch hundreds of errror messages fly by. Mounted up c1t0d0s* slices locally and from a restore done to another node, replaced all the files in /etc, /kernel, /usr, /lib and /sbin with copies from way back in October (before the disk errors appeared in syslog). I then mounted up c1t0d0s0 and commented out the rootdev entry in the system file. Changed all the vfstab entries to point to the slices as opposed to the meta-devices and then rebooted to single user mode. On it's way up, the box kept going panic on the qlc driver so we took out what we thought to be the failed drive (c1t1d0) and rebooted in single user at which time it came back up. I then cleared out all the meta-devices except for the non-mirror/raid device. Cleaned up the meta-database (by removing the ones on the failed drive) and restarted the system. It came back up again fine at which point we had support move the call to hardware to replace the failed disk. Got the disk the next morning, placed it in the system, recreated all the metadevices (1 way mirrors) and rebooted over to them. Established the second side of the mirros and all was ok. Now, just before the reboot to finish the second half of the mirrors, I saw messages like these in syslog: Nov 2 02:14:15 gsbims ufs: [ID 913664 kern.warning] WARNING: /mailstores/facstore: unexpected allocated inode 1087732, run fsck(1M) So, when the box came up in single user mode I did a fsck on that filesystem 3 times. The first time it cleared out stuff then went fine 2 times. Brought the system up with no errors. Now, I'm still getting: Nov 4 09:26:06 gsbims ufs: [ID 913664 kern.warning] WARNING: /mailstores/facstore: unexpected allocated inode 1087732, run fsck(1M) And if I look for all those: Nov 2 02:14:15 gsbims ufs: [ID 913664 kern.warning] WARNING: /mailstores/facstore: unexpected allocated inode 1087732, run fsck(1M) Nov 2 02:50:43 gsbims ufs: [ID 913664 kern.warning] WARNING: /mailstores/facstore: unexpected allocated inode 3388360, run fsck(1M) Nov 2 17:26:31 gsbims ufs: [ID 913664 kern.warning] WARNING: /mailstores/facstore: unexpected allocated inode 3388360, run fsck(1M) Nov 4 09:26:06 gsbims ufs: [ID 913664 kern.warning] WARNING: /mailstores/facstore: unexpected allocated inode 1087732, run fsck(1M) It's always the same 2 inodes. Now, those filesystems aren't even local, they're on the SAN and not under the control of disksuite or anything (they're RAID-4 on the SAN). Somehow the local stuff was corrupted by whatever happened with the bad disk, however, now I'm still seeing ufs errors on other filesystems. Any thoughts on how to fix those? Unfortunately, with all the time down that box has had (main mail server) I am unable to bring it down unless it's deemed 'critical'. This message posted from opensolaris.org _______________________________________________ ufs-discuss mailing list [email protected]
