[sysadmin-discuss] fmd grinding on zfs-diagnosis

Eric Sproul Wed, 22 Jul 2009 14:36:11 -0700

Hi,
After a catastrophic data loss on one of my zpools (see
http://opensolaris.org/jive/thread.jspa?threadID=108367), and subsequently
replacing all the pool's disks, I noticed that there was a lot of I/O activity
to my rpool, which was not affected or disturbed during the work.  I also
noticed 12% CPU usage by fmd, which seemed way too high on an idle system.


Running truss on it, I see a constant stream of:

/8:     open64("/var/fm/fmd/ckpt/zfs-diagnosis/zfs-diagnosis+",
O_WRONLY|O_CREAT|O_EXCL, 0400) = 4
/8:     write(4, "7F F C F010101\0\0\0\0\0".., 9774397) = 9774397
/8:     fdsync(4, FSYNC)                                = 0
/8:     close(4)                                        = 0
/8:     rename("/var/fm/fmd/ckpt/zfs-diagnosis/zfs-diagnosis+",
"/var/fm/fmd/ckpt/zfs-diagnosis/zfs-diagnosis") = 0
/8:     stat64("/var/fm/fmd/ckpt/zfs-diagnosis", 0xFD85E284) = 0
/8:     unlink("/var/fm/fmd/ckpt/zfs-diagnosis/zfs-diagnosis+") Err#2 ENOENT

It writes some data to the "zfs-diagnosis+" file then renames it to
"zfs-diagnosis" then repeats indefinitely.  I'm not using up any more disk space
but it is thrashing my rpool disks and I don't see how to stop it.

I had exported the old pool before shutting down for the disk replacement, but I
neglected to clear out all the faults from my disk catastrophe.  Maybe that had
something to do with it.  Here is my fmdump:

TIME                 UUID                                 SUNW-MSG-ID
May 22 12:21:00.7405 e80f8dd0-27a2-eef7-d198-ce789d103dcb ZFS-8000-D3
Jun 24 12:19:42.3976 e80f8dd0-27a2-eef7-d198-ce789d103dcb FMD-8000-4M Repaired
Jun 24 12:19:42.4141 e80f8dd0-27a2-eef7-d198-ce789d103dcb FMD-8000-6U Resolved
Jul 13 22:00:29.5417 e8c607ef-b766-c88d-89da-b34db6fec848 ZFS-8000-FD
Jul 13 22:36:25.8392 e8c607ef-b766-c88d-89da-b34db6fec848 FMD-8000-4M Repaired
Jul 13 22:36:25.9860 e8c607ef-b766-c88d-89da-b34db6fec848 FMD-8000-6U Resolved
Jul 20 15:52:36.0656 161774c5-ce84-61c0-8863-a030fc6ec618 ZFS-8000-D3
Jul 20 15:58:58.7773 81155050-a910-49c1-9e1f-fd430b3a9132 ZFS-8000-GH
Jul 20 15:59:09.5646 4af8dc97-8289-c4e2-82d2-db88393a0f3c ZFS-8000-GH
Jul 20 15:59:09.6315 cf5de150-c1ca-6b67-de15-8334156c4783 ZFS-8000-GH
Jul 20 15:59:09.6937 bd9f9921-df5d-c4b2-cf15-e44b759f4930 ZFS-8000-GH
Jul 20 15:59:09.7710 8fa209f1-4083-cc93-d50a-900720c02ec4 ZFS-8000-GH
Jul 20 15:59:10.0701 8924f4c5-1706-c2cc-b6a9-f009b7afd92e ZFS-8000-GH
Jul 21 02:33:11.7173 5f882655-2111-43ee-aeb9-87ef58a6bfee ZFS-8000-FD

The events from Jul 20 onward were unresolved when the old pool was exported.
There is now a new pool with the same name on different disks in the same
positions as the old (internal storage).  I cannot mark them repaired now since
they no longer show as faulty.

I've got fmd disabled now to stop the madness, but I could use some advice on
how to fix fmd.  Maybe 'fmadm acquit' on each of the UUIDs?  The man page gives
the ominous "should be used only at the direction of a documented Sun repair
procedure" warning for the acquit subcommand.

Thanks in advance,
Eric

-- 
Eric Sproul
Lead Site Reliability Engineer
OmniTI Computer Consulting, Inc.
Web Applications & Internet Architectures
http://omniti.com
P: +1.443.325.1357 x207   F: +1.410.872.4911
_______________________________________________
sysadmin-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/sysadmin-discuss

[sysadmin-discuss] fmd grinding on zfs-diagnosis

Reply via email to