Great write up Jens.
The chance of two MB to be broken is probably low but overheating is a very
good point. It was on my to-do list to setup IPMI and seems that now is the
best time to do it.
On Wed, Mar 20, 2013 at 1:08 PM, Jens Elkner <jel+...@cs.uni-magdeburg.de>wrote:
> On Wed, Mar 20, 2013 at 08:50:40AM -0700, Peter Wood wrote:
> > I'm sorry. I should have mentioned it that I can't find any errors in
> > logs. The last entry in /var/adm/messages is that I removed the
> > after the last reboot and then it shows the new boot up messages when
> I boot
> > up the system after the crash. The BIOS log is empty. I'm not sure
> how to
> > check the IPMI but IPMI is not configured and I'm not using it.
> You definitely should! Plugin a cable into the dedicated network port
> and configure it (easiest way for you is probably to jump into the BIOS
> and assign the appropriate IP address etc.). Than, for a quick look,
> point your browser to the given IP port 80 (default login is
> ADMIN/ADMIN). Also you may now configure some other details
> To track the problem, either write a script, which polls the parameters
> in question periodically or just install the latest ipmiViewer and use
> this to monitor your sensors ad hoc.
> see ftp://ftp.supermicro.com/utility/IPMIView/
> > Just another observation - the crashes are more intense the more data
> > system serves (NFS).
> > I'm looking into FRMW upgrades for the LSI now.
> Latest LSI FW should be P15, for this MB type 217 (2.17), MB-BIOS C28
> However, I doubt, that your problem has anything to do with the
> SAS-ctrl or OI or ZFS.
> My guess is, that either your MB is broken (we had an X9DRH-iF, which
> instantly "disappeared" as soon as it got some real load) or you have
> a heat problem (watch you cpu temp e.g. via ipmiviewer). With 2GHz
> that's not very likely, but worth a try (socket placement on this board
> is not really smart IMHO).
> To test quickly
> - disable all addtional, unneeded service in OI, which may put some
> load on the machine (like NFS service, http and bla) and perhaps
> even export unneeded pools (just to be sure)
> - fire up your ipmiviewer and look at the sensors (set update to
> 10s) or refresh manually often
> - start 'openssl speed -multi 32' and keep watching your cpu temp
> sensors (with 2GHz I guess it takes ~ 12min)
> I guess, your machine "disappears" before the CPUs getting really hot
> (broken MB). If CPUs switch off (usually first CPU2 and a little bit
> later CPU1) you have a cooling problem. If nothing happens, well, than
> it could be an OI or ZFS problem ;-)
> Have fun,
> Otto-von-Guericke University http://www.cs.uni-magdeburg.de/
> Department of Computer Science Geb. 29 R 027, Universitaetsplatz 2
> 39106 Magdeburg, Germany Tel: +49 391 67 52768
> zfs-discuss mailing list
zfs-discuss mailing list