I have two servers hosting vservers. The first is a 1 GHz Duron with 1 GB memory and
an IDE disk, ext3. It's on 60+ days of uptime, and the last time it was down was to
upgrade the memory. It runs 9 vservers and some stuff in the root server also,
without a complaint. Many of the vserver clients are running mostly idle AOLserver
instances, so I have about 500MB of swap in use (2GB swap available) pretty regularly.
Loads are reasonable (about .7 during the day, often less than .2 overnight), the
server is peppy, and everyone is happy. This is a Redhat 7.2 server, with the
pre-built kernel (2.4.18-ctx12). That kernel isn't set up for highmem, so actually
I'm only using about 900MB of my 1GB.
Enter server #2. Server #2 is an P4 1.6GHz, with 1GB memory and RAID1, ext3. I
wanted a highmem kernel, so I compiled this one. This is a Redhat 7.3 server, with
2.4.19ctx-13, patched and compiled by yours truly. It has had 4-6 vservers running on
it, loads in the .1-.5 range, and little if any swap in use.
(Currently:
Mem: 1033596K av, 1019324K used, 14272K free, 0K shrd, 272916K buff
Swap: 2048276K av, 796K used, 2047480K free 194720K cached)
This server is very responsive for a while after a reboot. Days to maybe a week.
Then it will appear to hang. It doesn't respond to SSH or http requests (to either
root server or vserver), although it doesn't actually drop the packets. It remains
pingable. It doesn't run cron jobs. At the point where the problem starts, all
logging stops, but there's no indication of a problem on the horizon prior to the
cessation of logging. The server still responds at a console. Two times I've had the
data center tech run sar -u on it before rebooting. Once showed complete cpu usage,
once showed the cpu almost entirely idle. The vps run by the data center tech also
doesn't show anything unusual, although in both cases the server had been unresponsive
for a while before the sar and vps commands were run.
Further weirdness: when the server is told to shutdown at the console, it becomes
ssh-able again for a few moments during the shutdown process. This suggests to me
that there's some process running that causes the server to be unresponsive, and when
it's killed during the server shutdown, things revert to normal again. (Of course,
then the server reboots.) I *really* wish this server wasn't in a data center
half-way across the country!
The datacenter swapped out the network card, motherboard, and memory last week but
I've seen another server hang since.
I'm stumped. I think the next course of action is to try running the precompiled
kernel on this server, but that'll lose me the highmem features.
I realize that there are probably waaaay too many variables different between these
two servers for the source of the problem to be, but I wonder if anyone has seen
anything similar and might suggest a course of action. Do these symptoms sound at all
familiar? Trying to solve this sort of problem by experiment is wretched with a
server in a datacenter and a problem that isn't reliably reproducible!
Thanks in advance for any ideas, suggestions for further investigation, or
encouragement!
Cathy Sarisky
________________________________________________________________
Sent via the WebMail system at webmail.pioneernet.net