Re: What's YARA Maximum Achievable Performance

PC Houseboy Tue, 08 Mar 2022 21:17:20 -0800

germanlancioni,

I am a N00b!#;  would appreciate if a knowledgeable member answer this 
query. Thank you


On Tuesday, March 8, 2022 at 6:05:09 PM UTC-6 Gman wrote:

> Hi,
>
> I'm trying to get the maximum possible performance out of YARA, and for 
> that goal I've been studying the code and algorithms to ensure everything 
> is contemplated:
>
> 1) My understanding is that the Aho-Corasick algorithm helps build the 
> Atoms tree to then efficiently apply just the rules that have Atoms 
> matching the scanned file. This is a great start because not all the rules 
> will be executed for each file.
> 2) I also believe there is a short-circuit logic capability so that once a 
> condition is not satisfied, the subsequent ones will not even try to 
> execute.
> 3) The -f option (as seen in the command line tool) will also run in 
> "fast" mode and report the first occurrence, without wasting time on 
> subsequent checks/rules.
> 4) Precompiling rules is a good practice as it saves time, given that the 
> scanner won't need to compile them before starting a scan.
> 5) Writing the rules in smart ways yields better performance, including: 
> using non-trivial hex sequences, replacing some strings with hex 
> representations, re-writing regexs to be more efficient, (sorting the 
> conditions?), etc.
> 6) You can run YARA in multi-thread mode. There is a drastic difference 
> between running with 1 thread vs running with 16 threads (most likely as it 
> also takes advantage of I/O vs CPU-bound operations).
>
> With these in mind, I tried to measure the performance of YARA for 
> scanning a given directory (e.g. containing 10k assorted files) using an 
> artificial set of 5k, 10k, 20k and even 40k rules. To my surprise, YARA is 
> quite fast up to 5k rules, and after that performance degrades drastically 
> (almost in a linear fashion). Note: I run the benchmark multiple times to 
> eliminate the effect of hard disk I/O (hence, having everything in 
> cache/memory).
>
> - Am I missing any possible optimization trick or Best-Known-Method? 
> - Does YARA suffers from some limitation in terms of performance related 
> to # of rules or # of files?
> - Based on my basic understanding of the source code, the modules such as 
> "pe" and "dotnet" are actually parsing the entire file (within the module 
> Load) regardless of the rules actually using these modules. Let's say a 
> rule just needs to do the check pe.is_pe, do we need to parse the entire 
> file just for that? Aren't the imported/exported functions or certificates 
> parsing slowing down the scan unnecessarily? (I'm not even sure this is the 
> reason for performance degradation, just a thought).
>
> Any tip or suggestion is much appreciated, and happy to contribute back if 
> there is an opportunity to do so.
>
> Regards,
>

-- 
You received this message because you are subscribed to the Google Groups 
"YARA" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/yara-project/c5c3050e-c2e7-4a8d-83b0-342f0be1e8c4n%40googlegroups.com.

Re: What's YARA Maximum Achievable Performance

Reply via email to