On Wed, Mar 20, 2013 at 08:50:40AM -0700, Peter Wood wrote:
> I'm sorry. I should have mentioned it that I can't find any errors in the
> logs. The last entry in /var/adm/messages is that I removed the keyboard
> after the last reboot and then it shows the new boot up messages when I
> up the system after the crash. The BIOS log is empty. I'm not sure how to
> check the IPMI but IPMI is not configured and I'm not using it.
You definitely should! Plugin a cable into the dedicated network port
and configure it (easiest way for you is probably to jump into the BIOS
and assign the appropriate IP address etc.). Than, for a quick look,
point your browser to the given IP port 80 (default login is
ADMIN/ADMIN). Also you may now configure some other details
To track the problem, either write a script, which polls the parameters
in question periodically or just install the latest ipmiViewer and use
this to monitor your sensors ad hoc.
> Just another observation - the crashes are more intense the more data the
> system serves (NFS).
> I'm looking into FRMW upgrades for the LSI now.
Latest LSI FW should be P15, for this MB type 217 (2.17), MB-BIOS C28 (1.0b).
However, I doubt, that your problem has anything to do with the
SAS-ctrl or OI or ZFS.
My guess is, that either your MB is broken (we had an X9DRH-iF, which
instantly "disappeared" as soon as it got some real load) or you have
a heat problem (watch you cpu temp e.g. via ipmiviewer). With 2GHz
that's not very likely, but worth a try (socket placement on this board
is not really smart IMHO).
To test quickly
- disable all addtional, unneeded service in OI, which may put some
load on the machine (like NFS service, http and bla) and perhaps
even export unneeded pools (just to be sure)
- fire up your ipmiviewer and look at the sensors (set update to
10s) or refresh manually often
- start 'openssl speed -multi 32' and keep watching your cpu temp
sensors (with 2GHz I guess it takes ~ 12min)
I guess, your machine "disappears" before the CPUs getting really hot
(broken MB). If CPUs switch off (usually first CPU2 and a little bit
later CPU1) you have a cooling problem. If nothing happens, well, than
it could be an OI or ZFS problem ;-)
Otto-von-Guericke University http://www.cs.uni-magdeburg.de/
Department of Computer Science Geb. 29 R 027, Universitaetsplatz 2
39106 Magdeburg, Germany Tel: +49 391 67 52768
zfs-discuss mailing list