I have a Linux server that's been crashing (ir)regularly for more than a year now, and I'm at my wits end to work out what's causing it.
The symptoms: every so often (uptimes range from half a day to twenty days) the server crashes, hard - no response to ping, no response to Ctrl-Alt-Del or Alt-Sysreq keyboard commands, nothing on the console. The context: it seems to crash when there's a lot of network traffic (ie, I try to copy a couple of gig to a Mac via netatalk, or to another Linux box via rsync), or when I do a dump to an SDLT tape drive. But it's not predictable - sometimes I can copy 6 gigabytes, or dump 100, without anything going wrong. The machine: is running RedHat 7.3, with kernel 2.4.18-19.7.xsmp. But the problem has persisted throughout updates from RedHat 7.1 - it first appeared when I updated to kernel 2.4. I have very little installed that isn't stock-RedHat installed -- ppr and chronolog are the most notable. (The machine is a bit odd in that I've upgraded it incrementally, I've never actually done an "upgrade" - but I've run rpm -Va and looked at pretty much every file that's been modified.) The hardware is an Intel SDS2 motherboard with an Adaptec DPT I20 Raid controller... But the problem has persisted through hardware migration - it also occurred on an older machine with a Sapphire motherboard and Symbios and Adapterc SCSI controllers. Both machines had EE Pro 100 onboard ethernet controllers, I've also tried using a 3com ethernet card without any improvement. Like I said, I'm totally stymied by this. Does anyone have any advice on how to do kernel debugging, or any ideas about what could be causing this kind of problem? Any advice would be appreciated. Danny. -- SLUG - Sydney Linux User's Group - http://slug.org.au/ More Info: http://lists.slug.org.au/listinfo/slug
