System is a V880 running Solaris 8 (SunOS gsbims 5.8 Generic_117350-27 sun4u 
sparc SUNW,Sun-Fire-880). Patches are recent within a month or so.

Some background. c1t0d0 and c1t1d0 both have slices in disksuite mirror sets. 
Specifically, /, swap, /opt, /usr and /var. Additionally, c1t3d0 and c1t4d0 
have a slice (s0) in a disksuite concat/stripe (not raid) mounted at 
/var/log/iplanet. I found problems when I was on the system trying to newfs an 
additional mail store from our SAN and newfs and even df and du went core on 
me. After looking in messages, I saw the following:

messages: Oct 28 16:51:37 gsbims krtld: [ID 652239 kern.notice] 
/kernel/sys/sparcv9/kaio bad shndx 
messages: Oct 28 16:51:37 gsbims krtld: [ID 408501 kern.notice] kaio error 
reading symbols

Then I ran metastat and found all the meta-devices in a state of maintenance. 
Looking through the messages, I saw several SCSI messages about c1t1d0 going 
offline/online then some B_FAILFAST messages then the system gave up on that 
disk and took SDS into 'maintenance'. So, to make sure the hardware was indeed 
bad I did a metareplace on all the md's to get them to resync. Thought was, if 
it goes out again I'll call support and get that drive replace. I restored 
/kernel/sys/sparcv9/kaio to see if the file was corrupt and sure enough, after 
being restored, newfs worked for me.

Now, after doing the metareplace the following appeared over the next couple 
days:

Oct 31 02:58:37 gsbims ufs: [ID 879645 kern.notice] NOTICE: /: unexpected free 
inode 60096, run fsck(1M)
Oct 31 10:40:31 gsbims ufs: [ID 879645 kern.notice] NOTICE: /: unexpected free 
inode 7570, run fsck(1M)
Oct 31 10:40:31 gsbims ufs: [ID 941273 kern.notice] NOTICE: /: bad dir ino 7607 
at offset 0: mangled entry
Oct 31 10:58:40 gsbims ufs: [ID 879645 kern.notice] NOTICE: /: unexpected free 
inode 277267, run fsck(1M)

I wasn't given the chance to take the system down to fsck the filesystems until 
Wednesday (Nov 1) at 18:30 CST. However, at 17:23 CST the system threw an error 
to the console about /var/log/iplanet and went panic. Can't remember the exact 
message, but I believe it was "unexpected free space" or something similar. 
Shut down the power to the system and brought it back up on a jumpstart kernel 
in single user mode.

Did a fsck on all the local file systems and watch hundreds of errror messages 
fly by. Mounted up c1t0d0s* slices locally and from a restore done to another 
node, replaced all the files in /etc, /kernel, /usr, /lib and /sbin with copies 
from way back in October (before the disk errors appeared in syslog). I then 
mounted up c1t0d0s0 and commented out the rootdev entry in the system file. 
Changed all the vfstab entries to point to the slices as opposed to the 
meta-devices and then rebooted to single user mode. On it's way up, the box 
kept going panic on the qlc driver so we took out what we thought to be the 
failed drive (c1t1d0) and rebooted in single user at which time it came back 
up. I then cleared out all the meta-devices except for the non-mirror/raid 
device. Cleaned up the meta-database (by removing the ones on the failed drive) 
and restarted the system. It came back up again fine at which point we had 
support move the call to hardware to replace the failed disk. Got the disk the 
next morning, placed it in the system, recreated all the metadevices (1 way 
mirrors) and rebooted over to them. Established the second side of the mirros 
and all was ok.

Now, just before the reboot to finish the second half of the mirrors, I saw 
messages like these in syslog:

Nov  2 02:14:15 gsbims ufs: [ID 913664 kern.warning] WARNING: 
/mailstores/facstore: unexpected allocated inode 1087732, run fsck(1M)

So, when the box came up in single user mode I did a fsck on that filesystem 3 
times. The first time it cleared out stuff then went fine 2 times. Brought the 
system up with no errors. Now, I'm still getting:

Nov  4 09:26:06 gsbims ufs: [ID 913664 kern.warning] WARNING: 
/mailstores/facstore: unexpected allocated inode 1087732, run fsck(1M)

And if I look for all those:

Nov  2 02:14:15 gsbims ufs: [ID 913664 kern.warning] WARNING: 
/mailstores/facstore: unexpected allocated inode 1087732, run fsck(1M)
Nov  2 02:50:43 gsbims ufs: [ID 913664 kern.warning] WARNING: 
/mailstores/facstore: unexpected allocated inode 3388360, run fsck(1M)
Nov  2 17:26:31 gsbims ufs: [ID 913664 kern.warning] WARNING: 
/mailstores/facstore: unexpected allocated inode 3388360, run fsck(1M)
Nov  4 09:26:06 gsbims ufs: [ID 913664 kern.warning] WARNING: 
/mailstores/facstore: unexpected allocated inode 1087732, run fsck(1M)

It's always the same 2 inodes.

Now, those filesystems aren't even local, they're on the SAN and not under the 
control of disksuite or anything (they're RAID-4 on the SAN). Somehow the local 
stuff was corrupted by whatever happened with the bad disk, however, now I'm 
still seeing ufs errors on other filesystems.

Any thoughts on how to fix those? Unfortunately, with all the time down that 
box has had (main mail server) I am unable to bring it down unless it's deemed 
'critical'.
This message posted from opensolaris.org
_______________________________________________
ufs-discuss mailing list
[email protected]

Reply via email to