Re: [Shinken-devel] Problem with timeouts

David Good Thu, 14 May 2015 15:25:31 -0700

Here's the poller.ini file I'm using:

[daemon] #-- Global Configuration #user=shinken ; if not set then by default it's the current user. #group=shinken ; if not set then by default it's the current group. # Set to 0 if you want to make this daemon NOT run daemon_enabled=1 # Larger configurations need more threads (default is 8?) daemon_thread_pool_size=50 #-- Path Configuration # The daemon will chdir into the directory workdir when launched # paths variables values, if not absolute paths, are relative to workdir. # using default values for following config variables value: workdir = /var/run/shinken logdir = /var/log/shinken pidfile=%(workdir)s/pollerd.pid #-- Network configuration # host=0.0.0.0 # port=7771 # http_backend=auto # idontcareaboutsecurity=0 #-- SSL configuration -- use_ssl=0 # WARNING : Put full paths for certs #ca_cert=/etc/shinken/certs/ca.pem #server_cert=/etc/shinken/certs/server.cert #server_key=/etc/shinken/certs/server.key #hard_ssl_name_check=0 #-- Local log management -- # Enabled by default to ease troubleshooting use_local_log=1 local_log=%(logdir)s/pollerd.log # accepted log level values= DEBUG,INFO,WARNING,ERROR,CRITICAL log_level=INFO #log_level=DEBUG

And here's the poller.cfg file:

#=============================================================================== # POLLER (S1_Poller) #=============================================================================== # Description: The poller is responsible for: # - Active data acquisition # - Local passive data acquisition # https://shinken.readthedocs.org/en/latest/08_configobjects/poller.html #=============================================================================== define poller { poller_name poller-1 address shinken1.dc1.example.com port 7771 ## Optional spare 0 ; 1 = is a spare, 0 = is not a spare manage_sub_realms 0 ; Does it take jobs from schedulers of sub-Realms? min_workers 0 ; Starts with N processes (0 = 1 per CPU) max_workers 0 ; No more than N processes (0 = 1 per CPU) processes_by_worker 256 ; Each worker manages N checks polling_interval 1 ; Get jobs from schedulers each N seconds timeout 3 ; Ping timeout data_timeout 120 ; Data send timeout max_check_attempts 3 ; If ping fails N or more, then the node is dead check_interval 60 ; Ping node every N seconds ## Interesting modules that can be used: # - booster-nrpe = Replaces the check_nrpe binary. Therefore it # enhances performances when there are lot of NRPE # calls. # - named-pipe = Allow the poller to read a nagios.cmd named pipe. # This permits the use of distributed check_mk checks # should you desire it. # - SnmpBooster = Snmp bulk polling module modules named-pipe, booster-nrpe ## Advanced Features #passive 0 ; For DMZ monitoring, set to 1 so the connections ; will be from scheduler -> poller. # Poller tags are the tag that the poller will manage. Use None as tag name to manage # untaggued checks #poller_tags None # Enable https or not use_ssl 0 # enable certificate/hostname check, will avoid man in the middle attacks hard_ssl_name_check 0 realm All }

On 5/14/15 3:13 PM, David Good wrote:

Here's another example of what I'm seeing -- In the arbiter log I'll see something like this:

[1431641122] INFO: [Shinken] [All] Trying to send configuration to poller poller-1
[1431641242] ERROR: [Shinken] Failed sending configuration for poller-1: Connexion error to http://shinken1.dc1.example.com:7771/ : Operation timed out after 120001 milliseconds with 0 bytes received

And then just a few seconds later:

[1431641291] INFO: [Shinken] [All] Trying to send configuration to poller poller-1
[1431641291] INFO: [Shinken] [All] Dispatch OK of configuration 1 to poller poller-1

And this poller is on the same server as the arbiter. I see this happening sporadically for pretty much every daemon, causing the configuration to be constantly in the process of being re-dispatched. This is especially frustrating as I'm trying to test out some new configs adding and removing hosts and services from monitoring. If it can't finish dispatching it makes it hard to test :-/

On 5/14/15 2:49 PM, David Good wrote:
I doubt that was the case -- I was careful to make sure everything was stopped before restarting.

And now my problems have started up again. I may be forced to upgrade to 2.4 to see if it helps any. Very frustrating. If that doesn't fix it, I may be forced to fall back to nagios and gearman. It'd hate to do that as we had promised that Shinken would scale better than Nagios.

On 5/13/15 2:50 PM, Felipe openglx wrote:
Play the lotto just in case ;)

My suspicion would be that your previous "restart" to adjust the thread pool (or other testing) didn't kill all threads, hence why you had some very unusual situations going on.

Let us know how it goes, best luck on getting the project delivered!

Regards
On 13 May 2015 at 22:18, David Good <dg...@willingminds.com> wrote:
It was all hosts, but I just reloaded with a new config, so we'll see if my luck holds :-)
On 5/13/15 2:00 PM, Felipe openglx wrote:
I've noticed that Shinken 2 doesn't go easily with kill. I've always done "pkill -9 -f shinken-" when needing to restart them.

Glad to hear you got something working, David. All hosts or just a fraction of them?

Regards

On 13 May 2015 at 21:43, David Good <dg...@willingminds.com> wrote:

OK, things seem to be stable now. I discovered that several of the
schedulers were using massive amounts of memory (over 30GB) causing the
kernel to try to kill them or their children. I restarted them, then
restarted anything that showed up as a problem in the arbiter log and
since then it's been stable.

One odd thing though is that some of the daemons wouldn't die normally
-- I had to use 'kill -KILL' on them.

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Shinken-devel mailing list
Shinken-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/shinken-devel
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Shinken-devel mailing list
Shinken-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/shinken-devel
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Shinken-devel mailing list
Shinken-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/shinken-devel
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Shinken-devel mailing list
Shinken-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/shinken-devel
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Shinken-devel mailing list
Shinken-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/shinken-devel
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Shinken-devel mailing list
Shinken-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/shinken-devel

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y

_______________________________________________
Shinken-devel mailing list
Shinken-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/shinken-devel

Re: [Shinken-devel] Problem with timeouts

Reply via email to