Hello guys,

thank you for many tips. I have carefully read through e-mails from
Clayton, Risto, John, David, and Dusan, and I am summary reacting to all of
it with this e-mail, as more reactions were in similar spirit.

To my primary question: I am satisfied, that log messages are not lost on
"heavy load", but just (gradually) delayed (currently, delay is not the
problem in our case, as 100% CPU consumption is not all the time, and SEC
has the time "to catch the train in the next station"). These are
application logs and our messages load is variable, but not the order of as
high Dusan's is, and I am still not monitoring this flow (discussed here:
https://sourceforge.net/p/simple-evcorr/mailman/message/36910235/, medium
priority task in my Jira).

To my secondary questions about finding bottlenecks and optimizations, I'll
summarize the recommendations (with my comments), as some was mentioned by
more of you:

   - placing most "heavy-load" rules first (problem: unpredictability;
   potential solution: re-ordering rules dynamically, according to latest
   stats)
   - hierarchical / tree-like setup (Jump rule in Risto's examples, or GoTo
   section ended with Suppress, as in our case)
   - more instances of SEC (but there is also risk to consume not only 1
   CPU, but as many CPUs, as many SEC instances running)
   - preprocessing of log messages with some kind of classification /
   branching (out of SEC)
      - LogZilla (Clayton,
      http://demo.logzilla.net/help/event_correlation/intro_to_event_correlation
      )
      - rsyslog mmnormalize (David)
      - syslog-ng PatternDB (Dusan)
   - RegExp optimization
   - reducing count of contexts, active deletion where suitable

As some of you are interested in the design, I would need to ask for
permission of my employer, to publish it, as I am not the author of rules
design. I just created automation of their generation (topic discussed
here: https://sourceforge.net/p/simple-evcorr/mailman/message/36867012/)
and minor optimizations. Although I consulted a lot in this mailinglist,
mainly with Risto, original pre-conslultations designs remained unchanged.
But for now, I will at least try to high-level describe it, but at first,
it is important to mention, how it originated:

We were doing log monitoring migration from HPOM to open-source monitoring
tool, and using SEC for duplicate events flow reduction before passing to
monitoring agent, in the manner as HPOM agent with built-in correlations
was used, so the design of rules and correlations is tributary to how it
was implemented in HPOM. There were hundreds to thousands of pattern
conditions in HPOM per host, and the structure of their sections was as
follows:

   - HPOM: suppress unmatched conditions -> SEC: Suppress with NRegExp
   - HPOM: suppress matched conditions -> SEC: Suppress with RegExp
   - HPOM: message conditions (with configured time-based correlations) ->
   SEC: Single with RegExp and GoTo -> duplicate suppress time-based
   correlations, each consisting of 3-4 subsequent rules (Single,
   PairWithWindow, SingleWithSuppress, depending on duplicate suppress
   correlation type)

We decided to automate conversion of HPOM configurations to SEC rules, so
here was not too much space for conceptual improvements over HPOM concepts
(e.g. by doing deeper analysis of configurations and actual log traffic),
and we relied on the premise, that those HPOM configurations are OK, and
tuned by years of development and operations, so the automated conversion
was 1:1.

Cca 50 log files per host are of several types (according to message
structure), but each file was monitored in HPOM independently on each
other, therefore after 1:1 conversion also in SEC is each file monitoring
independently, however, there is some maybe uglier "configuration
redundancy" for log files of the same type, as it was in HPOM. The static
order of conditions in HPOM is preserved also in generated SEC rules.

I think, branching multi-level cascade structure with GoTo is OK. What is
harder is, that the majority of log files has multi-line messages, and some
logs are multi-file (discussed here:
https://sourceforge.net/p/simple-evcorr/mailman/message/36861921/).

Only one instance of SEC is running per host, so single CPU can be consumed
maximum. No other preprocessing/classification is used, SEC is doing all
the work.

Anyway, I am happy with Risto's answer, that 100% CPU utilization is not
causing losses, just delays, and currently migration is over, and we'll
see, if there will be demand for further optimizations in the future, so I
don't want to burden you guys with some rules analysis and review, and
concrete optimization advices.

But I am also interested in Risto's and Clayton's initiatives about rules
sharing. I think, that further step to building user community with rules
sharing could be some kind of formalization, "best practise rules
templates", "rulesets / templates catalogue", possibly with possibility to
choose correlation doing some specific task, fill in parameters, and
generate SEC rules automatically... I see wide potential here, to enable
SEC users not re-inventing wheels, but build on verified setups, as
discussed here:
https://sourceforge.net/p/simple-evcorr/mailman/message/36867012/. I may
suggest my employer to join such initiative with our designs. Maybe we
could make a teleconference call someday, as wanna-be participants of such
community, and brainstorm about possibilities, as publishing of rulesets on
GitHub or SourceForge is great, but maybe we could do yet more, and build
ineractive catalogue beyond file sharing.

Richard

st 25. 3. 2020 o 17:52 Risto Vaarandi <risto.vaara...@gmail.com> napísal(a):

> hi Richard,
>
> if CPU utilization has reached 100%, no rules or log file events would be
> skipped, but SEC would simply not be able to process events at their
> arrival rate and fall behind of events. If your events include timestamps,
> you would probably see events with past timestamps in dump file (among
> other data, SEC dump file reports the last processed event for each input
> file). As for debugging the reasons of high CPU utilization, I would
> recommend to have a look into rule match statistics, and make sure that
> rules with most matches appear in the top of their rule files (if
> possible). However, current versions of SEC do not report the CPU time
> spent on matching each pattern against events.
>
> Just out of curiosity -- how many rules are you currently having in your
> rule base, and are all these rules connected to each other? How many events
> are you currently receiving per second? Also, are all 50 input files
> containing the same event types (e.g., httpd events) that need to be
> processed by all rules? If this is not the case, and each input file
> contains different events which are processed by different rules, I would
> strongly recommend to consider a hierarchical setup for your rule files.
> The principles of hierarchical setup have been described in SEC official
> documentation, for example: http://simple-evcorr.github.io/man.html#lbBE.
> Also, there is a recent paper which provides a relevant example:
> https://ristov.github.io/publications/cogsima15-sec-web.pdf. In addition,
> you could also consider running several instances of SEC for your input
> files. For example, if some input files contain messages from a specific
> application which are processed by few specific rule files, a separate SEC
> process could be started for handling these messages with given rule files.
> In that way, it might be possible to divide the rule files and input files
> into several independent groups, and having a separate SEC process for each
> group allows to balance the load across several CPU's.
>
> hope this helps,
> risto
>
> Kontakt Richard Ostrochovský (<richard.ostrochov...@gmail.com>) kirjutas
> kuupäeval K, 25. märts 2020 kell 17:07:
>
>> Hello friends,
>>
>> I have SEC monitoring over 50 log files with various correlations, and it
>> is consuming 100% of single CPU (luckily on 10-CPU machine, so not whole
>> system affected, as SEC is single-CPU application).
>>
>> This could mean, that SEC does not prosecute processing of all rules, and
>> I am curious, what are possible effects, if this means increasing delays
>> (first in, processing, first out), or skipping some lines from input files,
>> or anything other (?).
>>
>> And how to troubleshoot, finding bottlenecks. I can see quantities of log
>> messages per contexts or log files in sec.dump, this is some indicator. Are
>> there also other indicators? Is it possible, somehow, see also processing
>> times of patterns (per rules)?
>>
>> Thank you in advance.
>>
>> Richard
>> _______________________________________________
>> Simple-evcorr-users mailing list
>> Simple-evcorr-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/simple-evcorr-users
>>
>
_______________________________________________
Simple-evcorr-users mailing list
Simple-evcorr-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/simple-evcorr-users

Reply via email to