Re: hosted machine and load average

2009-09-19 Thread Amos Shapira
+1 for collecd (lack of familiarity with munin not withstanding). It helped me to identify a home grown c++ program going into a busy spin first time it happened after weeks of working flowlessly. Monit is also excellent to try to keep things cool (prevention). For this specific scenario, I'd

Re: hosted machine and load average

2009-09-18 Thread Ohad Levy
I would use collectd instead, it has a much better resolution and scales up (which munin doesnt). my 2cents, Ohad On 9/18/09, Shachar Shemesh shac...@shemesh.biz wrote: Hetz Ben Hamo wrote: So my question: What do you do in case you have the same scenario? what steps do you take to prevent

hosted machine and load average

2009-09-17 Thread Hetz Ben Hamo
Hi, I have my own server which is located in US. It's running CentOS 5.3. I tried to ssh it today.. no go. I could ping it, but none of the services were accessible: http, ssh, etc.. I tried to connect using serial. I could see the welcome message, but I couldn't login (timeout). The only

Re: hosted machine and load average

2009-09-17 Thread sammy ominsky
On 17/09/2009, at 22:33, Hetz Ben Hamo wrote: I tried to ssh it today.. no go. I could ping it, but none of the services were accessible: http, ssh, etc.. Sep 17 12:58:11 hetz sendmail[2707]: rejecting connections on daemon MTA: load average: 140 So my question: What do you do in case you

Re: hosted machine and load average

2009-09-17 Thread Hetz Ben Hamo
Sammy, Watch is good and nice, but Watchdog main purpose is to reboot the server if something wrong happens. Thats not what I'm looking for. Hetz On Thu, Sep 17, 2009 at 10:40 PM, sammy ominsky s...@avoidant.org wrote: On 17/09/2009, at 22:33, Hetz Ben Hamo wrote: I tried to ssh it today.. no

Re: hosted machine and load average

2009-09-17 Thread sammy ominsky
You can make it do other things too, like kill or restart processes. --sambo On 18/09/2009, at 00:29, Hetz Ben Hamo wrote: Sammy, Watch is good and nice, but Watchdog main purpose is to reboot the server if something wrong happens. Thats not what I'm looking for. Hetz On Thu, Sep 17, 2009

Re: hosted machine and load average

2009-09-17 Thread Hetz Ben Hamo
I can write a simple script which will detect if a process goes crazy and kill it/restart the service, thats not the issue.. The issue is about investigating a rebooted machine after a huge load, for example: lets say it's not your well taken care machine but it's your friends small web server

Re: hosted machine and load average

2009-09-17 Thread Shachar Shemesh
Hetz Ben Hamo wrote: So my question: What do you do in case you have the same scenario? what steps do you take to prevent things like that from happening? I would focus less on prevention, and more on diagnostics. I usually use munin (you can see a live example at