+1 for collecd (lack of familiarity with munin not withstanding). It
helped me to identify a home grown c++ program going into a busy spin
first time it happened after weeks of working flowlessly.
Monit is also excellent to try to keep things cool (prevention).
For this specific scenario, I'd
I would use collectd instead, it has a much better resolution and scales up
(which munin doesnt).
my 2cents,
Ohad
On 9/18/09, Shachar Shemesh shac...@shemesh.biz wrote:
Hetz Ben Hamo wrote:
So my question: What do you do in case you have the same scenario?
what steps do you take to prevent
Hi,
I have my own server which is located in US. It's running CentOS 5.3.
I tried to ssh it today.. no go. I could ping it, but none of the
services were accessible: http, ssh, etc..
I tried to connect using serial. I could see the welcome message, but
I couldn't login (timeout).
The only
On 17/09/2009, at 22:33, Hetz Ben Hamo wrote:
I tried to ssh it today.. no go. I could ping it, but none of the
services were accessible: http, ssh, etc..
Sep 17 12:58:11 hetz sendmail[2707]: rejecting connections on daemon
MTA: load average: 140
So my question: What do you do in case you
Sammy,
Watch is good and nice, but Watchdog main purpose is to reboot the
server if something wrong happens. Thats not what I'm looking for.
Hetz
On Thu, Sep 17, 2009 at 10:40 PM, sammy ominsky s...@avoidant.org wrote:
On 17/09/2009, at 22:33, Hetz Ben Hamo wrote:
I tried to ssh it today.. no
You can make it do other things too, like kill or restart processes.
--sambo
On 18/09/2009, at 00:29, Hetz Ben Hamo wrote:
Sammy,
Watch is good and nice, but Watchdog main purpose is to reboot the
server if something wrong happens. Thats not what I'm looking for.
Hetz
On Thu, Sep 17, 2009
I can write a simple script which will detect if a process goes crazy
and kill it/restart the service, thats not the issue..
The issue is about investigating a rebooted machine after a huge load,
for example: lets say it's not your well taken care machine but it's
your friends small web server
Hetz Ben Hamo wrote:
So my question: What do you do in case you have the same scenario?
what steps do you take to prevent things like that from happening?
I would focus less on prevention, and more on diagnostics. I usually use
munin (you can see a live example at