Re: [Shinken-devel] Problem with timeouts

David Good Tue, 12 May 2015 16:17:57 -0700

On 5/12/15 2:46 PM, Felipe openglx wrote:
> The devs will be able to give more specifics (maybe even confirm if 
> 2.4 performs better for your case?) but I faced similar issues with 
> timeout because of the time it took to "slice and dice" the amount of 
> objects.
> If you can enable debug mode on all nodes and provide some captures it 
> would be great.


OK -- I'll see about setting that up.
>
> How is the interdependency of your hosts? Is there a lot of parenting 
> relation between hosts? And the ratio of services/host?

There is some parenting, but not that much -- about 1000 of the hosts 
have parents (vm guests on hypervisors).  When I started having trouble, 
there were a total of 3300 hosts and 30000 services
>
> As far as I know from the internals, the number of objects isn't the 
> only factor: how they relate to eachother will increase the number of 
> "broks" that the daemons have to handle.
> In one of my production environments we have near 30k hosts with 
> ping-only, what brings the number of broks to be smaller than in a 
> environment with 3k hosts with 30k services (one of my other prod env).
>
>
> What I want to say is: your environment is considered mid-sized for 
> Shinken and you have some very beefy servers, so it looks like some 
> tuning to be done.
> If thread pool didn't help we need to check what the daemons are doing 
> that they "forget" to reply ping.
>
> Could you confirm OS, Python version, and anything else you think is 
> relevant?

Scientific Linux 6.5 (pretty much the same as RHEL or CentOS 6.5). 
Python 2.6.6 (installed from system-provided RPM - 
python-2.6.6-51.el6.x86_64).

We do have a *lot* of hostgroups included in host definitions. Probably 
averaging around 10 per host.  This would've probably at least doubled 
with the new host config generation scheme.

>
> Have you checked that firewall isn't blocking a return path? I 
> remember one case where a node received the ping request but the pong 
> reply wasn't being received due more rigorous firewalling...
> You mentioned that with 2k hosts it worked, just wanted to double 
> check that.
> Maybe bring the system up with 0 hosts just to see if everyone is happy?

No firewalls and the servers are all on the same switch and are even on 
adjacent ports.  None of them are running iptables.

It was happy before I made some major changes in how the host/service 
configuration is generated (host definitions are now generated from an 
internal inventory database) which resulted in adding another 1000 or so 
hosts.  It has since been decided that about 1000 servers don't need to 
be monitored so now I'm down to about 2200 servers.  We'll see if that 
helps any.
>
>
> (Sorry that my e-mail is a bit confusing, it's bed time but felt the 
> need to assist on the troubleshooting.... tomorrow I will try to make 
> it clearer).
>

No problem.  I really appreciate the help.




------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Shinken-devel mailing list
Shinken-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/shinken-devel

Re: [Shinken-devel] Problem with timeouts

Reply via email to