germanlancioni, I am a N00b!#; would appreciate if a knowledgeable member answer this query. Thank you
On Tuesday, March 8, 2022 at 6:05:09 PM UTC-6 Gman wrote: > Hi, > > I'm trying to get the maximum possible performance out of YARA, and for > that goal I've been studying the code and algorithms to ensure everything > is contemplated: > > 1) My understanding is that the Aho-Corasick algorithm helps build the > Atoms tree to then efficiently apply just the rules that have Atoms > matching the scanned file. This is a great start because not all the rules > will be executed for each file. > 2) I also believe there is a short-circuit logic capability so that once a > condition is not satisfied, the subsequent ones will not even try to > execute. > 3) The -f option (as seen in the command line tool) will also run in > "fast" mode and report the first occurrence, without wasting time on > subsequent checks/rules. > 4) Precompiling rules is a good practice as it saves time, given that the > scanner won't need to compile them before starting a scan. > 5) Writing the rules in smart ways yields better performance, including: > using non-trivial hex sequences, replacing some strings with hex > representations, re-writing regexs to be more efficient, (sorting the > conditions?), etc. > 6) You can run YARA in multi-thread mode. There is a drastic difference > between running with 1 thread vs running with 16 threads (most likely as it > also takes advantage of I/O vs CPU-bound operations). > > With these in mind, I tried to measure the performance of YARA for > scanning a given directory (e.g. containing 10k assorted files) using an > artificial set of 5k, 10k, 20k and even 40k rules. To my surprise, YARA is > quite fast up to 5k rules, and after that performance degrades drastically > (almost in a linear fashion). Note: I run the benchmark multiple times to > eliminate the effect of hard disk I/O (hence, having everything in > cache/memory). > > - Am I missing any possible optimization trick or Best-Known-Method? > - Does YARA suffers from some limitation in terms of performance related > to # of rules or # of files? > - Based on my basic understanding of the source code, the modules such as > "pe" and "dotnet" are actually parsing the entire file (within the module > Load) regardless of the rules actually using these modules. Let's say a > rule just needs to do the check pe.is_pe, do we need to parse the entire > file just for that? Aren't the imported/exported functions or certificates > parsing slowing down the scan unnecessarily? (I'm not even sure this is the > reason for performance degradation, just a thought). > > Any tip or suggestion is much appreciated, and happy to contribute back if > there is an opportunity to do so. > > Regards, > -- You received this message because you are subscribed to the Google Groups "YARA" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/yara-project/c5c3050e-c2e7-4a8d-83b0-342f0be1e8c4n%40googlegroups.com.
