Re: [Simple-evcorr-users] SEC CPU utilization
> > I mentiond as an offhand remark to Risto a profile mode that would > count not only every rule that lead to an action, but every time the > rule executed its regular expression. Having some sort of profile mode > (not to be run in production) would help identify these sorts of > issues. > > I have a question about how to potentially implement this feature -- would it be useful to calculate the total CPU time spent for processing the rule in profile mode? That would not only include the CPU time for matching the pattern against the event, but also the CPU time for evaluating the context expression or executing the action list. Total CPU time would be more useful if one is dealing with frequently matching rules (say, the ones which match 80-90% of incoming events), and wants to identify most expensive rule. Also, some pattern types like Cached and TValue are known to have low computational cost, but they are sometimes used with more complex context expressions (e.g., involving the execution of custom Perl code). Total CPU time would allow to identify such bottlenecks. I've added some alpha-level code into SEC and here is a small example case for a rule with badly written regular expression: type=single ptype=regexp pattern=^([0-9]+,?)*[[:punct:]]$ desc=test pattern action=write - test pattern has matched The regular expression expects an event which is consisting of numbers that are optionally separated with commas, and the sequence of numbers must be followed by a single punctuation mark. However, when one provides a very long string consisting of digits only without terminating punctuation mark, the regular expression engine will attempt to divide the sequence into all possible sub-sequences during backtracking phase. Here is a fragment for SEC dump file that highlights this issue: Performance statistics: Run time: 142 seconds User time: 63.92 seconds System time: 0.05 seconds Child user time: 0 seconds Child system time: 0 seconds Processed input lines: 1 Rule usage statistics: Statistics for the rules from test3.sec (loaded at Fri Apr 3 18:23:03 2020) Rule 1 line 1 matched 0 events (test pattern) CPU time: 63.71 As one can see, SEC has processed just one event which has not matched its only rule, but unsuccessful matching process took more than a minute. Of course, it is possible to disable unnecessary backtracking by using possessive quantifier: ^([0-9]+,?)*+[[:punct:]]$ Finding CPU time spent for processing a rule has one caveat -- processing of Jump rules is a recursive process that might involve further jumps into other rulesets, and CPU time of the Jump rule would indicate the cost of entire recursive process. There is no easy way to subtract the CPU time of child rules, since they might be invoked by other Jump rules that have matched different events. Does the idea outlined in this e-mail sound reasonable? I have also looked into other implementation ideas, but the current one looks most useful and involving least amount of overhead in the code base. kind regards, risto ___ Simple-evcorr-users mailing list Simple-evcorr-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/simple-evcorr-users
Re: [Simple-evcorr-users] SEC CPU utilization
hi Richard, > We were doing log monitoring migration from HPOM to open-source monitoring > tool, and using SEC for duplicate events flow reduction before passing to > monitoring agent, in the manner as HPOM agent with built-in correlations > was used, so the design of rules and correlations is tributary to how it > was implemented in HPOM. There were hundreds to thousands of pattern > conditions in HPOM per host, and the structure of their sections was as > follows: > >- HPOM: suppress unmatched conditions -> SEC: Suppress with NRegExp >- HPOM: suppress matched conditions -> SEC: Suppress with RegExp >- HPOM: message conditions (with configured time-based correlations) >-> SEC: Single with RegExp and GoTo -> duplicate suppress time-based >correlations, each consisting of 3-4 subsequent rules (Single, >PairWithWindow, SingleWithSuppress, depending on duplicate suppress >correlation type) > > We decided to automate conversion of HPOM configurations to SEC rules, so > here was not too much space for conceptual improvements over HPOM concepts > (e.g. by doing deeper analysis of configurations and actual log traffic), > and we relied on the premise, that those HPOM configurations are OK, and > tuned by years of development and operations, so the automated conversion > was 1:1. > > Cca 50 log files per host are of several types (according to message > structure), but each file was monitored in HPOM independently on each > other, therefore after 1:1 conversion also in SEC is each file monitoring > independently, however, there is some maybe uglier "configuration > redundancy" for log files of the same type, as it was in HPOM. The static > order of conditions in HPOM is preserved also in generated SEC rules. > > Since I have used HPOM in the past, perhaps I can offer some comments and advise here. As far as I remember, HPOM agent does not support hierarchical arrangement of log file monitoring policies and rules. You mentioned that existing HPOM configuration was converted on 1:1 basis -- does that mean that SEC is configured to use a number of rule files, where each rule file corresponds to some HPOM policy, and all rules are applied against all input messages from 50 input files? Since such use of rules is not very efficient, perhaps you could introduce hierarchical rule arrangement in the following way. First, assign each rule file to a configuration file set with Options rule. For example, if you have an HPOM policy called "sshd", you can use the following Options rule in the beginning of corresponding rule file: type=Options joincfset=sshd procallin=no Secondly, run SEC with --intcontexts flag which will set an internal context for each file (e.g., the file /var/log/secure will have the context _FILE_EVENT_/var/log/secure by default, but you can also set a custom context name in command line). Finally, create a special rule file (e.g., main.sec) which will route messages to relevant rule files. For example, suppose that HPOM policies "sshd" and "sudo" have been used for /var/log/secure, and there are two SEC rule files that contain Options rules as described above. For handling messages from /var/log/secure, you can enter the following Jump rule into main.sec rule file: type=Jump context=[ _FILE_EVENT_/var/log/secure ] ptype=TValue pattern=True desc=direct /var/log/secure events to relevant rules files cfset=sshd sudo Such configuration ensures that messages of /var/log/secure are being matched against relevant rules only, and rule files not associated with /var/log/secure will not be applied. Also, only the Jump rules from main.sec are applied against the entire event stream, while rules from other files receive their input from Jump rules in main.sec and are thus applied only for relevant messages. Perhaps you have already applied this technique, but if not, I would definitely recommend to use this optimization, since it is very likely to reduce the CPU load. hope this helps, risto ___ Simple-evcorr-users mailing list Simple-evcorr-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/simple-evcorr-users
Re: [Simple-evcorr-users] SEC CPU utilization
re: multi-line rules, my thought is to convert the newlines to some escape sequence and then treat the results as a single line. I tend to not want to run SEC on each system as it can impact that system (both CPU utilization and disk I/O). Instead I make SEC part of the log processing infrastructure, ideally on a dedicated system (or set of systems) so having multiple instances, each eating a core, is a feature not a bug ;-) David Lang On Thu, 2 Apr 2020, Richard Ostrochovský wrote: Date: Thu, 2 Apr 2020 00:15:08 +0200 From: Richard Ostrochovský To: simple-evcorr-users@lists.sourceforge.net Subject: Re: [Simple-evcorr-users] SEC CPU utilization Hello guys, thank you for many tips. I have carefully read through e-mails from Clayton, Risto, John, David, and Dusan, and I am summary reacting to all of it with this e-mail, as more reactions were in similar spirit. To my primary question: I am satisfied, that log messages are not lost on "heavy load", but just (gradually) delayed (currently, delay is not the problem in our case, as 100% CPU consumption is not all the time, and SEC has the time "to catch the train in the next station"). These are application logs and our messages load is variable, but not the order of as high Dusan's is, and I am still not monitoring this flow (discussed here: https://sourceforge.net/p/simple-evcorr/mailman/message/36910235/, medium priority task in my Jira). To my secondary questions about finding bottlenecks and optimizations, I'll summarize the recommendations (with my comments), as some was mentioned by more of you: - placing most "heavy-load" rules first (problem: unpredictability; potential solution: re-ordering rules dynamically, according to latest stats) - hierarchical / tree-like setup (Jump rule in Risto's examples, or GoTo section ended with Suppress, as in our case) - more instances of SEC (but there is also risk to consume not only 1 CPU, but as many CPUs, as many SEC instances running) - preprocessing of log messages with some kind of classification / branching (out of SEC) - LogZilla (Clayton, http://demo.logzilla.net/help/event_correlation/intro_to_event_correlation ) - rsyslog mmnormalize (David) - syslog-ng PatternDB (Dusan) - RegExp optimization - reducing count of contexts, active deletion where suitable As some of you are interested in the design, I would need to ask for permission of my employer, to publish it, as I am not the author of rules design. I just created automation of their generation (topic discussed here: https://sourceforge.net/p/simple-evcorr/mailman/message/36867012/) and minor optimizations. Although I consulted a lot in this mailinglist, mainly with Risto, original pre-conslultations designs remained unchanged. But for now, I will at least try to high-level describe it, but at first, it is important to mention, how it originated: We were doing log monitoring migration from HPOM to open-source monitoring tool, and using SEC for duplicate events flow reduction before passing to monitoring agent, in the manner as HPOM agent with built-in correlations was used, so the design of rules and correlations is tributary to how it was implemented in HPOM. There were hundreds to thousands of pattern conditions in HPOM per host, and the structure of their sections was as follows: - HPOM: suppress unmatched conditions -> SEC: Suppress with NRegExp - HPOM: suppress matched conditions -> SEC: Suppress with RegExp - HPOM: message conditions (with configured time-based correlations) -> SEC: Single with RegExp and GoTo -> duplicate suppress time-based correlations, each consisting of 3-4 subsequent rules (Single, PairWithWindow, SingleWithSuppress, depending on duplicate suppress correlation type) We decided to automate conversion of HPOM configurations to SEC rules, so here was not too much space for conceptual improvements over HPOM concepts (e.g. by doing deeper analysis of configurations and actual log traffic), and we relied on the premise, that those HPOM configurations are OK, and tuned by years of development and operations, so the automated conversion was 1:1. Cca 50 log files per host are of several types (according to message structure), but each file was monitored in HPOM independently on each other, therefore after 1:1 conversion also in SEC is each file monitoring independently, however, there is some maybe uglier "configuration redundancy" for log files of the same type, as it was in HPOM. The static order of conditions in HPOM is preserved also in generated SEC rules. I think, branching multi-level cascade structure with GoTo is OK. What is harder is, that the majority of log files has multi-line messages, and some logs are multi-file (discussed here: https://sourceforge.net/p/simple-evcorr/mailman/message/36861921/). Only one instance of SEC is running per host, so single CPU can be consumed maximum
Re: [Simple-evcorr-users] SEC CPU utilization
Hello guys, thank you for many tips. I have carefully read through e-mails from Clayton, Risto, John, David, and Dusan, and I am summary reacting to all of it with this e-mail, as more reactions were in similar spirit. To my primary question: I am satisfied, that log messages are not lost on "heavy load", but just (gradually) delayed (currently, delay is not the problem in our case, as 100% CPU consumption is not all the time, and SEC has the time "to catch the train in the next station"). These are application logs and our messages load is variable, but not the order of as high Dusan's is, and I am still not monitoring this flow (discussed here: https://sourceforge.net/p/simple-evcorr/mailman/message/36910235/, medium priority task in my Jira). To my secondary questions about finding bottlenecks and optimizations, I'll summarize the recommendations (with my comments), as some was mentioned by more of you: - placing most "heavy-load" rules first (problem: unpredictability; potential solution: re-ordering rules dynamically, according to latest stats) - hierarchical / tree-like setup (Jump rule in Risto's examples, or GoTo section ended with Suppress, as in our case) - more instances of SEC (but there is also risk to consume not only 1 CPU, but as many CPUs, as many SEC instances running) - preprocessing of log messages with some kind of classification / branching (out of SEC) - LogZilla (Clayton, http://demo.logzilla.net/help/event_correlation/intro_to_event_correlation ) - rsyslog mmnormalize (David) - syslog-ng PatternDB (Dusan) - RegExp optimization - reducing count of contexts, active deletion where suitable As some of you are interested in the design, I would need to ask for permission of my employer, to publish it, as I am not the author of rules design. I just created automation of their generation (topic discussed here: https://sourceforge.net/p/simple-evcorr/mailman/message/36867012/) and minor optimizations. Although I consulted a lot in this mailinglist, mainly with Risto, original pre-conslultations designs remained unchanged. But for now, I will at least try to high-level describe it, but at first, it is important to mention, how it originated: We were doing log monitoring migration from HPOM to open-source monitoring tool, and using SEC for duplicate events flow reduction before passing to monitoring agent, in the manner as HPOM agent with built-in correlations was used, so the design of rules and correlations is tributary to how it was implemented in HPOM. There were hundreds to thousands of pattern conditions in HPOM per host, and the structure of their sections was as follows: - HPOM: suppress unmatched conditions -> SEC: Suppress with NRegExp - HPOM: suppress matched conditions -> SEC: Suppress with RegExp - HPOM: message conditions (with configured time-based correlations) -> SEC: Single with RegExp and GoTo -> duplicate suppress time-based correlations, each consisting of 3-4 subsequent rules (Single, PairWithWindow, SingleWithSuppress, depending on duplicate suppress correlation type) We decided to automate conversion of HPOM configurations to SEC rules, so here was not too much space for conceptual improvements over HPOM concepts (e.g. by doing deeper analysis of configurations and actual log traffic), and we relied on the premise, that those HPOM configurations are OK, and tuned by years of development and operations, so the automated conversion was 1:1. Cca 50 log files per host are of several types (according to message structure), but each file was monitored in HPOM independently on each other, therefore after 1:1 conversion also in SEC is each file monitoring independently, however, there is some maybe uglier "configuration redundancy" for log files of the same type, as it was in HPOM. The static order of conditions in HPOM is preserved also in generated SEC rules. I think, branching multi-level cascade structure with GoTo is OK. What is harder is, that the majority of log files has multi-line messages, and some logs are multi-file (discussed here: https://sourceforge.net/p/simple-evcorr/mailman/message/36861921/). Only one instance of SEC is running per host, so single CPU can be consumed maximum. No other preprocessing/classification is used, SEC is doing all the work. Anyway, I am happy with Risto's answer, that 100% CPU utilization is not causing losses, just delays, and currently migration is over, and we'll see, if there will be demand for further optimizations in the future, so I don't want to burden you guys with some rules analysis and review, and concrete optimization advices. But I am also interested in Risto's and Clayton's initiatives about rules sharing. I think, that further step to building user community with rules sharing could be some kind of formalization, "best practise rules templates", "rulesets / templates catalogue", possibly with possibility to choose
Re: [Simple-evcorr-users] SEC CPU utilization
I set up SEC using the omprog option in rsyslog, so rsyslog started SEC and fed the logs in via stdin rather than writing to disk. On Thu, 26 Mar 2020, Dusan Sovic wrote: I using similar approach as David mention. I processing syslog messages from network devices (various Vendors like Arista, Cisco, Juniper etc.). We have about 7k patterns defined for match. As SEC cannot handle such regexp volume, I using syslog-ng PatternDB for pattern matching, classification and tagging. Than all matched logs are written in normalized format into file what monitored by SEC and pre-defined correlation schema is applied. This help me to reduce flow into SEC to 30 msg/s while incoming syslog message flow is about ~20k msg/s (from about 20k unique network devices). My SEC instance is eating ~ 20 – 30 % of CPU (with about 500 active event correlation operations with ~ 4k active contexts in avg). Od: David Lang Odoslané: štvrtok 26. marca 2020 1:24 Komu: Richard Ostrochovský Kópia: simple-evcorr-users@lists.sourceforge.net Predmet: Re: [Simple-evcorr-users] SEC CPU utilization Much of the time you can trivially split your rules and then run multiple copies of SEC, each processing a subnet of the rules on the logs that will match them. Rsyslog has a very efficient parsing capability (mmnormalize), you can use it to just classify the logs, or you can go a step further and have it extract the relevant fields and pass them to SEC in ways that make it cheaper to parse and process. where you do have rules that correlate across different types of logs, you may be able to just combine those logs into one ruleset, or you can have parallel instances output messages to say that their part of the correlation has tested true and then have another instance that processes these partial correlation messages and decide if the combined correlation is matched. David Lang ___ Simple-evcorr-users mailing list Simple-evcorr-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/simple-evcorr-users ___ Simple-evcorr-users mailing list Simple-evcorr-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/simple-evcorr-users
Re: [Simple-evcorr-users] SEC CPU utilization
I using similar approach as David mention. I processing syslog messages from network devices (various Vendors like Arista, Cisco, Juniper etc.). We have about 7k patterns defined for match. As SEC cannot handle such regexp volume, I using syslog-ng PatternDB for pattern matching, classification and tagging. Than all matched logs are written in normalized format into file what monitored by SEC and pre-defined correlation schema is applied. This help me to reduce flow into SEC to 30 msg/s while incoming syslog message flow is about ~20k msg/s (from about 20k unique network devices). My SEC instance is eating ~ 20 – 30 % of CPU (with about 500 active event correlation operations with ~ 4k active contexts in avg). Od: David Lang Odoslané: štvrtok 26. marca 2020 1:24 Komu: Richard Ostrochovský Kópia: simple-evcorr-users@lists.sourceforge.net Predmet: Re: [Simple-evcorr-users] SEC CPU utilization Much of the time you can trivially split your rules and then run multiple copies of SEC, each processing a subnet of the rules on the logs that will match them. Rsyslog has a very efficient parsing capability (mmnormalize), you can use it to just classify the logs, or you can go a step further and have it extract the relevant fields and pass them to SEC in ways that make it cheaper to parse and process. where you do have rules that correlate across different types of logs, you may be able to just combine those logs into one ruleset, or you can have parallel instances output messages to say that their part of the correlation has tested true and then have another instance that processes these partial correlation messages and decide if the combined correlation is matched. David Lang ___ Simple-evcorr-users mailing list Simple-evcorr-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/simple-evcorr-users ___ Simple-evcorr-users mailing list Simple-evcorr-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/simple-evcorr-users
Re: [Simple-evcorr-users] SEC CPU utilization
In message , =?UTF-8?Q?Richard_Ostrochovsk=C3=BD?= writes: >Hello friends, > >I have SEC monitoring over 50 log files with various correlations, and it >is consuming 100% of single CPU (luckily on 10-CPU machine, so not whole >system affected, as SEC is single-CPU application). > >This could mean, that SEC does not prosecute processing of all rules, and I >am curious, what are possible effects, if this means increasing delays >(first in, processing, first out), or skipping some lines from input files, >or anything other (?). > >And how to troubleshoot, finding bottlenecks. I can see quantities of log >messages per contexts or log files in sec.dump, this is some indicator. Are >there also other indicators? Is it possible, somehow, see also processing >times of patterns (per rules)? I have found two operations that are large users of cpu in SEC. 1) If you have a lot of contexts, context lookups can eat cpu. I solved a problem where sec used 50% of a cpu by actively deleting contexts as soon as I could rather than lettng them time out. You can use the USR1 signal (IIRC see manpage) to get a dump of all contexts. I don't remember how many I had but it was a lot. Once I started actively deleting contexts the cpu dropped to 1-2% with ~100-500 events per second though the system. 2) Inefficient regexps that require a lot of backtracking can kill your cpu. I don't know how you would find these. You don't say how your rulesets are constructed, but if you have say 100 rules and the most often hit rules are at position 99 and 100, you are going to waste a lot of time applying regexps (even if they are efficient). Again a dump file will tell you which rules are being hit. Move the most used rules so they are applied earlier in you rulesets. Also you can create rules so they are applied in a tree like structure. Look at the manpage for rulesets. This structure allows you to have a layer of filter rules (e.g. look for ssh in the log line) and only apply rulesets for ssh to the log event. This can cut down dramaticaly on the number of regexps that have to be applied and discarded. Also you can parse the event in one rule and then use more efficient means to trigger rules (matching on a subset of the original event). I mentiond as an offhand remark to Risto a profile mode that would count not only every rule that lead to an action, but every time the rule executed its regular expression. Having some sort of profile mode (not to be run in production) would help identify these sorts of issues. Good luck and when you figure it out let us know how you found and fixed it. -- -- rouilj John Rouillard === My employers don't acknowledge my existence much less my opinions. ___ Simple-evcorr-users mailing list Simple-evcorr-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/simple-evcorr-users