jason
Danny Yee wrote:
I have a Linux server that's been crashing (ir)regularly for more
than a year now, and I'm at my wits end to work out what's causing it.
The symptoms: every so often (uptimes range from half a day to twenty
days) the server crashes, hard - no response to ping, no response to
Ctrl-Alt-Del or Alt-Sysreq keyboard commands, nothing on the console.
The context: it seems to crash when there's a lot of network traffic
(ie, I try to copy a couple of gig to a Mac via netatalk, or to another
Linux box via rsync), or when I do a dump to an SDLT tape drive.
But it's not predictable - sometimes I can copy 6 gigabytes, or dump
100, without anything going wrong.
The machine: is running RedHat 7.3, with kernel 2.4.18-19.7.xsmp. But
the problem has persisted throughout updates from RedHat 7.1 -
it first appeared when I updated to kernel 2.4. I have very little
installed that isn't stock-RedHat installed -- ppr and chronolog are
the most notable. (The machine is a bit odd in that I've upgraded
it incrementally, I've never actually done an "upgrade" - but I've
run rpm -Va and looked at pretty much every file that's been modified.)
The hardware is an Intel SDS2 motherboard with an Adaptec DPT I20
Raid controller... But the problem has persisted through hardware
migration - it also occurred on an older machine with a Sapphire
motherboard and Symbios and Adapterc SCSI controllers. Both machines
had EE Pro 100 onboard ethernet controllers, I've also tried using
a 3com ethernet card without any improvement.
Like I said, I'm totally stymied by this. Does anyone have any
advice on how to do kernel debugging, or any ideas about what could
be causing this kind of problem? Any advice would be appreciated.
Danny.
-- SLUG - Sydney Linux User's Group - http://slug.org.au/ More Info: http://lists.slug.org.au/listinfo/slug
