In my experience doing mostly in memory YARA scanning: I use up to as many threads as the system has physical cores. I doubt logical HT/SMT cores are going to help much (from the following observations).
With mostly simple rules that just do hex (byte run) scanning with no wildcards and a single condition extra threads only give about a 10% performance increase. This makes sense because the efficient intensive Aho-Corasick algorithm is memory constrained. The extra threads are just maxing out the memory bandwidth. At this point there is little difference between using one core or 64 because memory bandwidth hits it's limit. For more complex rule sets, I see about a 20% to 30% performance increase. This makes sense too as now some of the memory bandwidth gets traded with increased compute (from the extra rules logic). Here this work gets distributed to other cores that can simultaneously compute. And of course memory bandwidth still gets maxed out. See: https://github.com/Neo23x0/YARA-Performance-Guidelines/ There is a problem with your logic. Caching doesn't eliminate drive I/0 overhead. You'd have to preload all of your files into memory for that. And are you sure your OS cache is big enough to hold all of your files to begin with? There is still a lot going on in trips from UM to KM and back again. At minimal memcpy() (even if in DMA hardware) from the OS cache into your process et al. A key thing here is you say "cache". This means "memory". You are competing with the OS use of it's file buffer memory (and CPU cache too) vs the intense Aho-Corasick scanning thread(s). Hopefully someday, a new memory technology will come along that will match the CPU in clock speed (in whole values not just bits).. Reads like you are on the right track though. You have to find where the actual bottlenecks actually are (using system and process performance tools) and mitigate them the best you can. On Tuesday, March 8, 2022 at 7:05:09 PM UTC-5 Gman wrote: > Hi, > > I'm trying to get the maximum possible performance out of YARA, and for > that goal I've been studying the code and algorithms to ensure everything > is contemplated: > > 1) My understanding is that the Aho-Corasick algorithm helps build the > Atoms tree to then efficiently apply just the rules that have Atoms > matching the scanned file. This is a great start because not all the rules > will be executed for each file. > 2) I also believe there is a short-circuit logic capability so that once a > condition is not satisfied, the subsequent ones will not even try to > execute. > 3) The -f option (as seen in the command line tool) will also run in > "fast" mode and report the first occurrence, without wasting time on > subsequent checks/rules. > 4) Precompiling rules is a good practice as it saves time, given that the > scanner won't need to compile them before starting a scan. > 5) Writing the rules in smart ways yields better performance, including: > using non-trivial hex sequences, replacing some strings with hex > representations, re-writing regexs to be more efficient, (sorting the > conditions?), etc. > 6) You can run YARA in multi-thread mode. There is a drastic difference > between running with 1 thread vs running with 16 threads (most likely as it > also takes advantage of I/O vs CPU-bound operations). > > With these in mind, I tried to measure the performance of YARA for > scanning a given directory (e.g. containing 10k assorted files) using an > artificial set of 5k, 10k, 20k and even 40k rules. To my surprise, YARA is > quite fast up to 5k rules, and after that performance degrades drastically > (almost in a linear fashion). Note: I run the benchmark multiple times to > eliminate the effect of hard disk I/O (hence, having everything in > cache/memory). > > - Am I missing any possible optimization trick or Best-Known-Method? > - Does YARA suffers from some limitation in terms of performance related > to # of rules or # of files? > - Based on my basic understanding of the source code, the modules such as > "pe" and "dotnet" are actually parsing the entire file (within the module > Load) regardless of the rules actually using these modules. Let's say a > rule just needs to do the check pe.is_pe, do we need to parse the entire > file just for that? Aren't the imported/exported functions or certificates > parsing slowing down the scan unnecessarily? (I'm not even sure this is the > reason for performance degradation, just a thought). > > Any tip or suggestion is much appreciated, and happy to contribute back if > there is an opportunity to do so. > > Regards, > -- You received this message because you are subscribed to the Google Groups "YARA" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/yara-project/57fd4fb3-be8f-4edd-b180-3f5a98a87d51n%40googlegroups.com.
