Re: [Nagios-users] [Nagios-devel] RFC/RFP Nagios command workers

2011-07-01 Thread Andreas Ericsson
On 06/30/2011 11:58 PM, Adam Augustine wrote:

 There seem to be two issues, I think, that are getting mixed here. I think
 the accusation of Mozilla Firefox/Nagios Core feature stagnation is a
 separate issue from
 putting things in core as opposed to making them a NEB module.
 
 The Linux kernel is a good example of tons of features existing in modules
 and
 not being included in the core, yet not having the feature stagnation
 problem.
 The difference between those and FF/Nagios is that the modules are included
 in
 the /distribution/ of the code and many are active by default.
 

The kernel has far more compelling reasons to turn things into modules
though, since very far from every system is directly connected to a
token ring network, or hooked up to an atomic clock via the parallell
port or similar weird things that Linux supports but that is pretty
unusual.

 This has the major advantage that if someone doesn't like/need a particular
 module,
 it can be trivially removed. This helps performance tuning. etc. At the same
 time,
 it allows feature progression, by default.
 
 I think if this approach were taken, more modularization rather than hard
 coding things into the core, but including more widely accepted modules
 in the default Nagios Core distribution, that would keep everyone happy.
 

Possibly, but then we'd have collaboration issues between core coders and
module hackers instead, and new modules would need some way of getting
included that doesn't interfere with other modules.

 That being said, I think placing the networking socket code into the core is
 completely reasonable, since it is such an essential part of the
 architecture.

Since worker processes is intended to be the new default way of running
checks, everything that that functionality relies on must ofcourse also
be in the core. The fact that it makes life easier for a ton of modules
is just a happy accident, really.

-- 
Andreas Ericsson   andreas.erics...@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

--
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] [Nagios-devel] RFC/RFP Nagios command workers

2011-06-30 Thread Adam Augustine
On Wed, Jun 29, 2011 at 2:50 AM, Andreas Ericsson a...@op5.se wrote:

 On 06/28/2011 05:13 PM, Matthieu Kermagoret wrote:
  Hi list,
 
  First of all, sorry for the delayed response, last month was pretty
  crazy at work :-p
 
  On Mon, May 23, 2011 at 12:38 PM, Andreas Ericssona...@op5.se  wrote:
  On 05/23/2011 11:37 AM, Matthieu Kermagoret wrote:
  Because shipping an official module that does it would mean not only


[snip]


  For years it's been Nagios'
  development team's policy not to include features that could be
  written as modules. I liked it that way.
 

 Everything can be written as modules. The worker process thing will have
 the nice sideeffect that modules can register sockets that core Nagios
 will listen to events from, with a special callback when there's data
 available on the socket. This reduces complexity of a lot of modules by
 a fair bit. With worker-processes instead of multiple threads it's also
 trivial to write modules with regards to thread-safety, and potential
 leaks in worker modules (such as embedded perl) can be ignored, since
 we can just kill the worker process and spawn a new one once it's done
 some arbitrary number of checks. This is how Apache handles leaky
 modules and we could do far worse than using the world's most popular
 webserver as an example.

 There's also another thing. Mozilla Firefox has been accused of feature
 stagnation in the core since they let addon writers handle adding new
 features, and far from everybody uses modules. Google Chrome has taken
 a fair share of users from Firefox lately, partly because it implements
 some of the more popular modules directly in-core. Nagios has also been
 accused of feature stagnation, even though broker module development
 has flourished in recent years (nagios with modules is nothing like the
 old nagios without them), so it makes sense to add certain selected
 module capabilities to the core.


There seem to be two issues, I think, that are getting mixed here. I think
the accusation of Mozilla Firefox/Nagios Core feature stagnation is a
separate issue from
putting things in core as opposed to making them a NEB module.

The Linux kernel is a good example of tons of features existing in modules
and
not being included in the core, yet not having the feature stagnation
problem.
The difference between those and FF/Nagios is that the modules are included
in
the /distribution/ of the code and many are active by default.

This has the major advantage that if someone doesn't like/need a particular
module,
it can be trivially removed. This helps performance tuning. etc. At the same
time,
it allows feature progression, by default.

I think if this approach were taken, more modularization rather than hard
coding things into the core, but including more widely accepted modules
in the default Nagios Core distribution, that would keep everyone happy.

That being said, I think placing the networking socket code into the core is
completely reasonable, since it is such an essential part of the
architecture.
Although theoretically you could remove all networking from the Linux kernel
and still run...

Really, on that part at least, I think either way would work out fine.

Just a thought,
  Adam Augustine





--
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Re: [Nagios-users] [Nagios-devel] RFC/RFP Nagios command workers

2011-06-29 Thread Andreas Ericsson
On 06/28/2011 05:13 PM, Matthieu Kermagoret wrote:
 Hi list,
 
 First of all, sorry for the delayed response, last month was pretty
 crazy at work :-p
 
 On Mon, May 23, 2011 at 12:38 PM, Andreas Ericssona...@op5.se  wrote:
 On 05/23/2011 11:37 AM, Matthieu Kermagoret wrote:
 Because shipping an official module that does it would mean not only
 supporting the old complexity, but also the new one. Having a single
 default system for running checks would definitely be preferrable to
 supporting multiple ones.

 
 I agree with you when you say that a single system is better than two.
 However I fear that the worker system would need very more code than a
 simpler system (and less code usually means less bugs) and that the
 worker system would destabilize Nagios.

Quite the opposite, really. The amount of backflips we're doing right
now to make sure the core is threadsafe is huge, so it's likely this
patch will even reduce the LoC count in Nagios.

 For years it's been Nagios'
 development team's policy not to include features that could be
 written as modules. I liked it that way.
 

Everything can be written as modules. The worker process thing will have
the nice sideeffect that modules can register sockets that core Nagios
will listen to events from, with a special callback when there's data
available on the socket. This reduces complexity of a lot of modules by
a fair bit. With worker-processes instead of multiple threads it's also
trivial to write modules with regards to thread-safety, and potential
leaks in worker modules (such as embedded perl) can be ignored, since
we can just kill the worker process and spawn a new one once it's done
some arbitrary number of checks. This is how Apache handles leaky
modules and we could do far worse than using the world's most popular
webserver as an example.

There's also another thing. Mozilla Firefox has been accused of feature
stagnation in the core since they let addon writers handle adding new
features, and far from everybody uses modules. Google Chrome has taken
a fair share of users from Firefox lately, partly because it implements
some of the more popular modules directly in-core. Nagios has also been
accused of feature stagnation, even though broker module development
has flourished in recent years (nagios with modules is nothing like the
old nagios without them), so it makes sense to add certain selected
module capabilities to the core.

 1) Remove the multiple fork system to execute a command. The Nagios
 Core process forks directly the process that will exec the command
 (more or less sh's parsing of command line, don't really know if this
 could/should be integreted in the Core).


 This really can't be done without using multiple threads since the
 core can't wait() for input and children while at the same time
 issuing select() calls to multiplex the new output of currently
 running checks.

 
 What about a signal handler on SIGCHLD that would wait() terminated
 process and a select() on pipe FDs connected to child processes, with
 a timeout to kill non-responding checks ?
 

Highly impractical for shortlived children and with so many pipes to
listen to. It would mean we'd be iterating over the entire childstack
several hundred times per second just to read new output. We're forced
to do that, since pipes can't contain an infinite amount of data. The
child's write() call will fail when the pipe is full and the children
won't exit while waiting to write. Doing so many select() calls means
the scheduler will suffer greatly, along with modules that wish to run
code in the main thread every now and then.

With sockets, we can let each worker handle a smaller number of checks
at the time, and since they have no scheduling responsibilities the
master process is free to just await new input.

 2) The root process and the subprocess are connected with a pipe() so
 that the command output can be fetched by reading the pipe. Nagios
 will maintain a list of currently running commands.


 Pipes are limited in that they only guarantee 512 bytes of atomic
 writes and reads. TCP sockets don't have this problem. There's also
 
 It is my understanding of Posix that the core standard defines a
 512-byte minimal limit for atomic I/O operations but I cannot find any
 section enforcing atomicity on I/O operations on TCP sockets, so pipes
 would be better indeed. Were you refering to the XSI Streams or could
 you point me to the appropriate section ?
 

No. TCP sockets don't enforce atomicity beyond the 512 bytes already
specified, but they do enforce ordering, which pipes don't. This is
actually a real problem (although an unusual one) when several processes
tries to write data to Nagios' command pipe and one of them writes
more than the atomic limit on whatever system it's being written on.
The fact that pipes use fixed-size buffers for pipes (requiring a full
kernel recompile to change) and the fact that a program can change the
size of its socket buffers with a simple 

Re: [Nagios-users] [Nagios-devel] RFC/RFP Nagios command workers

2011-05-23 Thread Andreas Ericsson
On 05/23/2011 11:37 AM, Matthieu Kermagoret wrote:
 
 The idea to solve all of that is to fork() off a set of worker
 threads at startup that free()'s all possible memory and re-connects
 to the master process via a unix domain socket (or network socket
 that by default only listens to the localhost address) to receive
 requests to run commands and return the results of those commands.

 
 While I agree that distributing check execution among multiple
 processes can be a really good idea, I don't know if this should be
 implemented in the Core. This can add significant complexity to the
 code while not being useful to all Nagios users. The Core already have
 a proper API that allows modules to execute checks themselves, so why
 not rely on it for distribution and improve the existing command
 execution mechanism ?
 

Because shipping an official module that does it would mean not only
supporting the old complexity, but also the new one. Having a single
default system for running checks would definitely be preferrable to
supporting multiple ones.

 As you say, one of the root problem of the current implementation, is
 the use of temporary files, as this consumes much I/O when writing,
 scanning and reading them. Also the Nagios Core process is fork()ed
 multiple times and this might consume unnecessary CPU time. So I
 propose the following :
 
 1) Remove the multiple fork system to execute a command. The Nagios
 Core process forks directly the process that will exec the command
 (more or less sh's parsing of command line, don't really know if this
 could/should be integreted in the Core).
 

This really can't be done without using multiple threads since the
core can't wait() for input and children while at the same time
issuing select() calls to multiplex the new output of currently
running checks.


 2) The root process and the subprocess are connected with a pipe() so
 that the command output can be fetched by reading the pipe. Nagios
 will maintain a list of currently running commands.
 

Pipes are limited in that they only guarantee 512 bytes of atomic
writes and reads. TCP sockets don't have this problem. There's also
the fact that a lot of modules already use sockets, so we can get
rid of a lot of code in those modules and let them re-use Nagios'
main select() loop and get inbound events on their sockets as a
broker callback event. Much neater that way.

 3) The event loop will multiplex processes' I/O and process them as necessary.
 

That's what the worker processes will do and then feed the results
back to the nagios core through the sequential socket, which will
guarantee read and write operations large enough to never truncate
any of the data necessary for the master process to do proper book-
keeping.

 This has several benefits, although they're not immediately user
 visible.
 * I/O load will decrease significantly, leaving more disk throughput
   capacity for performance data graphing or status data database
   solutions.
 
 Still holds but to a smaller extent, as the problem of Nagios using a
 lot more copied memory per fork than it's supposed to is not solved.
 This could be solved with a module however, see below.
 

Not without the module also running external programs, which just means
more complexity inside the nagios core instead of less.

 * Scripting languages can be embedded regardless of memory leaks and
   whatnot, since worker daemons can be killed off and respawned every
   5 checks (or something), thus causing the kernel to clean up
   any and all leaked memory.
 
 There could be modules that override checks and forward them to
 interpreter daemons on a per-language basis for example.
 

Yup. I'd expect this to be a natural progression of how things work,
with Python being the first in queue to be embedded.

 * Nagios core can be single-threaded, which means higher portability,
   less memory usage and more robust code.
 
 Still holds.
 

Nope. It fails for all modules that require constantly poll()'ed
sockets.

 * Eventbroker modules that use a socket to communicate with an external
   daemon can instead register a handler for inbound packets and then
   simply own that connection and get all future packets from it
   forwarded as eventbroker events. This will ofcourse reduce the module
   complexity quite a bit for nearly all much-used modules today (Merlin,
   livestatus, DNX, mod_gearman, NDOUtils, etc...)
 
 Still holds, instead of multiplexing on socket FD, multiplex on pipe FD.
 

Worker processes will multiplex on pipe fd's. Nagios will just poll the
sockets of the workers (and modules) that have connected to it, and
that's basically it.

 * It becomes possible to receive responses from Nagios when submitting
   commands (the current FIFO pipe is one-way communication only).

 
 See discussion about the command pipe below.
 
 Drawbacks:
 * It's quite a large and invasive change to the nagios core which
   will require a lot of testing.

 
 This would be a less