Hello guys, thank you for many tips. I have carefully read through e-mails from Clayton, Risto, John, David, and Dusan, and I am summary reacting to all of it with this e-mail, as more reactions were in similar spirit.
To my primary question: I am satisfied, that log messages are not lost on "heavy load", but just (gradually) delayed (currently, delay is not the problem in our case, as 100% CPU consumption is not all the time, and SEC has the time "to catch the train in the next station"). These are application logs and our messages load is variable, but not the order of as high Dusan's is, and I am still not monitoring this flow (discussed here: https://sourceforge.net/p/simple-evcorr/mailman/message/36910235/, medium priority task in my Jira). To my secondary questions about finding bottlenecks and optimizations, I'll summarize the recommendations (with my comments), as some was mentioned by more of you: - placing most "heavy-load" rules first (problem: unpredictability; potential solution: re-ordering rules dynamically, according to latest stats) - hierarchical / tree-like setup (Jump rule in Risto's examples, or GoTo section ended with Suppress, as in our case) - more instances of SEC (but there is also risk to consume not only 1 CPU, but as many CPUs, as many SEC instances running) - preprocessing of log messages with some kind of classification / branching (out of SEC) - LogZilla (Clayton, http://demo.logzilla.net/help/event_correlation/intro_to_event_correlation ) - rsyslog mmnormalize (David) - syslog-ng PatternDB (Dusan) - RegExp optimization - reducing count of contexts, active deletion where suitable As some of you are interested in the design, I would need to ask for permission of my employer, to publish it, as I am not the author of rules design. I just created automation of their generation (topic discussed here: https://sourceforge.net/p/simple-evcorr/mailman/message/36867012/) and minor optimizations. Although I consulted a lot in this mailinglist, mainly with Risto, original pre-conslultations designs remained unchanged. But for now, I will at least try to high-level describe it, but at first, it is important to mention, how it originated: We were doing log monitoring migration from HPOM to open-source monitoring tool, and using SEC for duplicate events flow reduction before passing to monitoring agent, in the manner as HPOM agent with built-in correlations was used, so the design of rules and correlations is tributary to how it was implemented in HPOM. There were hundreds to thousands of pattern conditions in HPOM per host, and the structure of their sections was as follows: - HPOM: suppress unmatched conditions -> SEC: Suppress with NRegExp - HPOM: suppress matched conditions -> SEC: Suppress with RegExp - HPOM: message conditions (with configured time-based correlations) -> SEC: Single with RegExp and GoTo -> duplicate suppress time-based correlations, each consisting of 3-4 subsequent rules (Single, PairWithWindow, SingleWithSuppress, depending on duplicate suppress correlation type) We decided to automate conversion of HPOM configurations to SEC rules, so here was not too much space for conceptual improvements over HPOM concepts (e.g. by doing deeper analysis of configurations and actual log traffic), and we relied on the premise, that those HPOM configurations are OK, and tuned by years of development and operations, so the automated conversion was 1:1. Cca 50 log files per host are of several types (according to message structure), but each file was monitored in HPOM independently on each other, therefore after 1:1 conversion also in SEC is each file monitoring independently, however, there is some maybe uglier "configuration redundancy" for log files of the same type, as it was in HPOM. The static order of conditions in HPOM is preserved also in generated SEC rules. I think, branching multi-level cascade structure with GoTo is OK. What is harder is, that the majority of log files has multi-line messages, and some logs are multi-file (discussed here: https://sourceforge.net/p/simple-evcorr/mailman/message/36861921/). Only one instance of SEC is running per host, so single CPU can be consumed maximum. No other preprocessing/classification is used, SEC is doing all the work. Anyway, I am happy with Risto's answer, that 100% CPU utilization is not causing losses, just delays, and currently migration is over, and we'll see, if there will be demand for further optimizations in the future, so I don't want to burden you guys with some rules analysis and review, and concrete optimization advices. But I am also interested in Risto's and Clayton's initiatives about rules sharing. I think, that further step to building user community with rules sharing could be some kind of formalization, "best practise rules templates", "rulesets / templates catalogue", possibly with possibility to choose correlation doing some specific task, fill in parameters, and generate SEC rules automatically... I see wide potential here, to enable SEC users not re-inventing wheels, but build on verified setups, as discussed here: https://sourceforge.net/p/simple-evcorr/mailman/message/36867012/. I may suggest my employer to join such initiative with our designs. Maybe we could make a teleconference call someday, as wanna-be participants of such community, and brainstorm about possibilities, as publishing of rulesets on GitHub or SourceForge is great, but maybe we could do yet more, and build ineractive catalogue beyond file sharing. Richard st 25. 3. 2020 o 17:52 Risto Vaarandi <risto.vaara...@gmail.com> napísal(a): > hi Richard, > > if CPU utilization has reached 100%, no rules or log file events would be > skipped, but SEC would simply not be able to process events at their > arrival rate and fall behind of events. If your events include timestamps, > you would probably see events with past timestamps in dump file (among > other data, SEC dump file reports the last processed event for each input > file). As for debugging the reasons of high CPU utilization, I would > recommend to have a look into rule match statistics, and make sure that > rules with most matches appear in the top of their rule files (if > possible). However, current versions of SEC do not report the CPU time > spent on matching each pattern against events. > > Just out of curiosity -- how many rules are you currently having in your > rule base, and are all these rules connected to each other? How many events > are you currently receiving per second? Also, are all 50 input files > containing the same event types (e.g., httpd events) that need to be > processed by all rules? If this is not the case, and each input file > contains different events which are processed by different rules, I would > strongly recommend to consider a hierarchical setup for your rule files. > The principles of hierarchical setup have been described in SEC official > documentation, for example: http://simple-evcorr.github.io/man.html#lbBE. > Also, there is a recent paper which provides a relevant example: > https://ristov.github.io/publications/cogsima15-sec-web.pdf. In addition, > you could also consider running several instances of SEC for your input > files. For example, if some input files contain messages from a specific > application which are processed by few specific rule files, a separate SEC > process could be started for handling these messages with given rule files. > In that way, it might be possible to divide the rule files and input files > into several independent groups, and having a separate SEC process for each > group allows to balance the load across several CPU's. > > hope this helps, > risto > > Kontakt Richard Ostrochovský (<richard.ostrochov...@gmail.com>) kirjutas > kuupäeval K, 25. märts 2020 kell 17:07: > >> Hello friends, >> >> I have SEC monitoring over 50 log files with various correlations, and it >> is consuming 100% of single CPU (luckily on 10-CPU machine, so not whole >> system affected, as SEC is single-CPU application). >> >> This could mean, that SEC does not prosecute processing of all rules, and >> I am curious, what are possible effects, if this means increasing delays >> (first in, processing, first out), or skipping some lines from input files, >> or anything other (?). >> >> And how to troubleshoot, finding bottlenecks. I can see quantities of log >> messages per contexts or log files in sec.dump, this is some indicator. Are >> there also other indicators? Is it possible, somehow, see also processing >> times of patterns (per rules)? >> >> Thank you in advance. >> >> Richard >> _______________________________________________ >> Simple-evcorr-users mailing list >> Simple-evcorr-users@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/simple-evcorr-users >> >
_______________________________________________ Simple-evcorr-users mailing list Simple-evcorr-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/simple-evcorr-users