Re: What's YARA Maximum Achievable Performance

Gman Mon, 21 Mar 2022 16:03:10 -0700

Thanks. I believe I have tried all the suggestions already. Magic doesn't 
help, as not only is slow and will run anyways, but it's also not supported 
in Windows. You can achieve a similar effect by simply checking for 
specific bytes using the uint notation, but the problem remains in the fact 
that at the moment you include a module such as "PE" or "DOTNET", these 
modules will parse the entire file anyways. This means that the following 
scenario is never optimized:


import "pe"
condition:
    filesize <= 0xFF0000 and pe.number_of_sections == 8

You may think that when the filesize (or magic or a single byte check) 
condition is False it will short-circuit and exit. While this is correct, 
it has also wasted all the time so you don't get any performance benefit. 
This is because the PE module parsed the entire file anyways!

I have also tried the custom-scanning front-end approach we discussed 
before, having a subset of pre-compiled rules for PNG, another for PE, 
another for DOCX. To my surprise, this resulted on even slower scan times. 
This might be related to the fact that the Ahocorasick automaton is already 
as good as it gets when all the rules are combined together, which means 
when you split into 3 "scanners/rules sets", you don't get much benefit and 
in the contrary, you may introduce more overhead on the filetype 
checking/conditional scanning logic.

I would love to hear from the original YARA authors, to at least understand 
if we are incorrectly using YARA or this is indeed a (very common?) 
scenario that is not covered/not optimized for.

Thanks
On Friday, March 18, 2022 at 9:31:07 PM UTC-7 Joe Neighbor wrote:

> Also looking through the manual again, there is a "magic" module.
> "The Magic module allows you to identify the type of the file based on the 
> output of file, the standard Unix command."
> Then apparently you can make rule conditions like "magic.mime_type() == 
> "application/pdf"
>
> But still each rule in a set is going to be loaded and compiled regardless 
> like you say.
> So for max performance potential and control I still think you will want 
> to go with a custom scanning front end.
>
> And the sky the limit here, you could have a central controlling app 
> (again maybe made using Python) and then send off scan jobs off to machines 
> on your lan or cloud instances, etc.
> Sure, a lot more development work for this this kind of scheme though.
>
>
> On Saturday, March 19, 2022 at 12:04:29 AM UTC-4 Joe Neighbor wrote:
>
>> Have you considered making a custom version of the console/front end?
>>
>> Maybe you can use "Rule tags" (yara manual page 27)?
>> You could have a "DLL" tag and/or a "EXE" tag then you can seperate them 
>> at least.
>> Although it says: "Those tags can be used later to filter YARA’s
>> output". Which might mean they all get scanned anyhow, but then you 
>> filter them in the Yara match callback which is not going to help 
>> performance.
>>
>> I know, it really slows down depending on the rules and the size of the 
>> search.
>>
>> Here is what I think will be a good solution for you:
>> 1) Separate your rules into a series of files.
>> A set of rules for PNG files, DLLs, DOCX, etc.
>> Compiling is fast but you could sort and batch your inputs for a set of 
>> compiled rule set.
>>
>>
>> You might have to tool up a solution to help manage the rules if you have 
>> a lot of them.
>> Might have to put them into a DB, etc., so that if one to change a rule 
>> that needs to be duplicated to different sets the changes could be 
>> automatic.
>> Completely automated, it could help you zero on bad/slow rules too.
>> If you need a lot of control, you could do something like putting all 
>> your rules in a JSON format then the YARA rule files can be created on the 
>> fly.
>> At first though I'd just set them all up by manually for development.
>>
>> 2) In your controlling app (can still be a console/terminal still) you 
>> first enumerate all the files to scan and put them in buckets.
>> 3) Then you scan each set in bulk with the per type rule set and only 
>> having to compile once per.
>> As you find matches you either dump it to the screen or save them in a 
>> JSON for later processing et al.
>>
>> There is:  https://github.com/VirusTotal/yara-python
>> Which I think you can get this sort of setup up and running pretty fast 
>> with Python.
>> And since yara-python is just a binary wrapper around libyara you are 
>> still going to get just sub C/C++ performance.
>> The Python part is just mostly to do the file wrangling. Unless you need 
>> additional features.
>>
>> On Friday, March 18, 2022 at 12:45:13 PM UTC-4 Gman wrote:
>>
>>> Thanks for your insights.
>>>
>>> I'm not too concerned about drive I/O overhead (this part I consider the 
>>> benchmark pre-stage, so not measuring it). In my tests, I'm simply loading 
>>> all the files first (e.g. file mapping) and hence the only performance 
>>> bottleneck I observe when profiling is the pure CPU-bound YVM execution of 
>>> the instructions.
>>>
>>> One of the things I have noticed is that the way in which YARA is 
>>> designed, it will process each sample entirely regardless of having rules 
>>> that apply to them. I have seen many GitHub threads explaining why this is 
>>> a "long term investment" (as you are likely to have 1 rule that will need 
>>> it anyways), but I don't fully agree with the justification. While there 
>>> are scenarios in which you will have a set of files and a set of rules that 
>>> will mix well together, in my experience this is more like an exception. 
>>> Let me explain with an example:
>>>
>>> I have 5000 PE-EXE files to scan, and 300 PE rules for DLLs only. Now 
>>> imagine how the scan goes...
>>>
>>> a) The Ahocorasick tree will fit together 300 rules' atoms. All the PE 
>>> rules start with the "is_pe / is_dll" condition that can quickly exit via 
>>> short-circuit.
>>> b) YARA will open and fully parse 5k EXE files, regardless of how many 
>>> "potential" rules I have to apply.
>>> c) Since all my files are EXE and NOT DLL, no PE rule will actually 
>>> match. However, I have wasted a lot of time fully parsing 5k files (because 
>>> the PE module will parse the entire file anyways).
>>>
>>> You can quickly observe how the performance drops significantly on such 
>>> scenarios. Now consider that you have an unbalanced mix of files such as 
>>> PE, PNG and DOCX. And then, think about an unproportioned number of rules 
>>> (e.g. 1 PE rule and 100 PNG rules but no DOCX rules). It is clear at this 
>>> point that some rules will "heavily penalize" the overall performance of 
>>> YARA simply because they will unnecessarily "overload" the scan by running 
>>> on "will never match" filetypes (in this example, 100 PNG rules that are 
>>> heavy on strings will unnecessarily run on 20k DOCX files). Even if you 
>>> implement something like the "Magic" module to identify it, YARA will go 
>>> over the file anyways trying to find the atoms.
>>>
>>> Real world scenarios are in my experience a lot more like the case I'm 
>>> describing: you have an assorted and unpredictable collection of files to 
>>> be scanned, and it's likely that your rules will only ever matter for a 
>>> small portion of your files. If your collection of files is big and 
>>> "misaligned" with the "filetypes" that your rules are designed for, then 
>>> you have a performance nightmare in your hands.
>>>
>>> So I guess I'm describing two different issues or improvement 
>>> opportunities (or asking if there is another way to address the described 
>>> scenarios):
>>>
>>> 1) It gives me the impression that YARA is missing an essential 
>>> "selective tree" mechanism. For example, imagine if you could have one 
>>> Ahocorasick Tree for PE files, but a different one for PNG files. That 
>>> would drastically reduce the scan time by simply focusing on a "this makes 
>>> sense to be scanned" selector and skip the rest. Traversing a set of 5 
>>> atoms is not the same as traversing a set of 5000 atoms...
>>> 2) Alternatively or complementarily, it seems YARA would drastically 
>>> benefit from a "pre-flight" check for example for PE files. What if all my 
>>> rules are for DLL files but I'm scanning 100 EXE files? Why should I waste 
>>> precious time fully parsing each entire PE file if I could simply check the 
>>> header/first few bytes to determine that the file is a DLL?
>>>
>>> Thanks,
>>> On Friday, March 11, 2022 at 10:11:03 PM UTC-8 Joe Neighbor wrote:
>>>
>>>> In my experience doing mostly in memory YARA scanning:
>>>> I use up to as many threads as the system has physical cores.
>>>> I doubt logical HT/SMT cores are going to help much (from the following 
>>>> observations).
>>>>
>>>> With mostly simple rules that just do hex (byte run) scanning with no 
>>>> wildcards and a single condition extra threads only give about a 10% 
>>>> performance increase.
>>>> This makes sense because the efficient intensive Aho-Corasick algorithm 
>>>> is memory constrained. The extra threads are just maxing out the memory 
>>>> bandwidth.
>>>> At this point there is little difference between using one core or 64 
>>>> because memory bandwidth hits it's limit.
>>>>
>>>> For more complex rule sets, I see about a 20% to 30% performance 
>>>> increase. 
>>>> This makes sense too as now some of the memory bandwidth gets traded 
>>>> with increased compute (from the extra rules logic).
>>>> Here this work gets distributed to other cores that can simultaneously 
>>>> compute. And of course memory bandwidth still gets maxed out.
>>>>
>>>> See:
>>>> https://github.com/Neo23x0/YARA-Performance-Guidelines/
>>>>
>>>> There is a problem with your logic.
>>>> Caching doesn't eliminate drive I/0 overhead. You'd have to preload all 
>>>> of your files into memory for that.
>>>> And are you sure your OS cache is big enough to hold all of your files 
>>>> to begin with?
>>>> There is still a lot going on in trips from UM to KM and back again. At 
>>>> minimal memcpy() (even if in DMA hardware) from the OS cache into your 
>>>> process et al.
>>>> A key thing here is you say "cache". This means "memory". You are 
>>>> competing with the OS use of it's file buffer memory (and CPU cache too) 
>>>> vs 
>>>> the intense Aho-Corasick scanning thread(s).
>>>>
>>>> Hopefully someday, a new memory technology will come along that will 
>>>> match the CPU in clock speed (in whole values not just bits)..
>>>>
>>>> Reads like you are on the right track though. 
>>>> You have to find where the actual bottlenecks actually are (using 
>>>> system and process performance tools) and mitigate them the best you can. 
>>>>
>>>> On Tuesday, March 8, 2022 at 7:05:09 PM UTC-5 Gman wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm trying to get the maximum possible performance out of YARA, and 
>>>>> for that goal I've been studying the code and algorithms to ensure 
>>>>> everything is contemplated:
>>>>>
>>>>> 1) My understanding is that the Aho-Corasick algorithm helps build the 
>>>>> Atoms tree to then efficiently apply just the rules that have Atoms 
>>>>> matching the scanned file. This is a great start because not all the 
>>>>> rules 
>>>>> will be executed for each file.
>>>>> 2) I also believe there is a short-circuit logic capability so that 
>>>>> once a condition is not satisfied, the subsequent ones will not even try 
>>>>> to 
>>>>> execute.
>>>>> 3) The -f option (as seen in the command line tool) will also run in 
>>>>> "fast" mode and report the first occurrence, without wasting time on 
>>>>> subsequent checks/rules.
>>>>> 4) Precompiling rules is a good practice as it saves time, given that 
>>>>> the scanner won't need to compile them before starting a scan.
>>>>> 5) Writing the rules in smart ways yields better performance, 
>>>>> including: using non-trivial hex sequences, replacing some strings with 
>>>>> hex 
>>>>> representations, re-writing regexs to be more efficient, (sorting the 
>>>>> conditions?), etc.
>>>>> 6) You can run YARA in multi-thread mode. There is a drastic 
>>>>> difference between running with 1 thread vs running with 16 threads (most 
>>>>> likely as it also takes advantage of I/O vs CPU-bound operations).
>>>>>
>>>>> With these in mind, I tried to measure the performance of YARA for 
>>>>> scanning a given directory (e.g. containing 10k assorted files) using an 
>>>>> artificial set of 5k, 10k, 20k and even 40k rules. To my surprise, YARA 
>>>>> is 
>>>>> quite fast up to 5k rules, and after that performance degrades 
>>>>> drastically 
>>>>> (almost in a linear fashion). Note: I run the benchmark multiple times to 
>>>>> eliminate the effect of hard disk I/O (hence, having everything in 
>>>>> cache/memory).
>>>>>
>>>>> - Am I missing any possible optimization trick or Best-Known-Method? 
>>>>> - Does YARA suffers from some limitation in terms of performance 
>>>>> related to # of rules or # of files?
>>>>> - Based on my basic understanding of the source code, the modules such 
>>>>> as "pe" and "dotnet" are actually parsing the entire file (within the 
>>>>> module Load) regardless of the rules actually using these modules. Let's 
>>>>> say a rule just needs to do the check pe.is_pe, do we need to parse the 
>>>>> entire file just for that? Aren't the imported/exported functions or 
>>>>> certificates parsing slowing down the scan unnecessarily? (I'm not even 
>>>>> sure this is the reason for performance degradation, just a thought).
>>>>>
>>>>> Any tip or suggestion is much appreciated, and happy to contribute 
>>>>> back if there is an opportunity to do so.
>>>>>
>>>>> Regards,
>>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"YARA" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/yara-project/328613b7-acd5-4abd-be86-164668c41463n%40googlegroups.com.

Re: What's YARA Maximum Achievable Performance

Reply via email to