Re: What's YARA Maximum Achievable Performance

Joe Neighbor Sat, 26 Mar 2022 14:39:32 -0700

Your welcome.
As a final note, speaking from experience here, again the rules themself 
can make a huge difference.


A real world example:
In my setup I have about 1200 mostly simple rules.
By simple I mean they are direct binary search conditions with no wildcards 
(think a memcmp() kind of search).

Example:
rule CRC_8_TABLE
{
        meta:
                description = "CRC-8 table"                
        strings:
                $c0 = { 00 57 AE F9 0B 5C A5 F2 16 41 B8 EF 1D 4A B3 E4 2C 
7B 82 D5 27 70 89 DE 3A 6D 94 C3 31 66 9F C8  58 0F F6 A1 53 04 FD AA 4E 19 
E0 B7 45 12 EB BC 74 23 DA 8D 7F 28 D1 86 62 35 CC 9B 69 3E C7 90 
                                  B0 E7 1E 49 BB EC 15 42 A6 F1 08 5F AD FA 
03 54 9C CB 32 65 97 C0 39 6E 8A DD 24 73 81 D6 2F 78 E8 BF 46 11 E3 B4 4D 
1A FE A9 50 07 F5 A2 5B 0C C4 93 6A 3D CF 98 61 36 D2 85 7C 2B D9 8E 77 20 
                                  37 60 99 CE 3C 6B 92 C5 21 76 8F D8 2A 7D 
84 D3 1B 4C B5 E2 10 47 BE E9 0D 5A A3 F4 06 51 A8 FF 6F 38 C1 96 64 33 CA 
9D 79 2E D7 80 72 25 DC 8B 43 14 ED BA 48 1F E6 B1 55 02 FB AC 5E 09 F0 A7 
                                  87 D0 29 7E 8C DB 22 75 91 C6 3F 68 9A CD 
34 63 AB FC 05 52 A0 F7 0E 59 BD EA 13 44 B6 E1 18 4F  DF 88 71 26 D4 83 7A 
2D C9 9E 67 30 C2 95 6C 3B F3 A4 5D 0A F8 AF 56 01 E5 B2 4B 1C EE B9 40 17 }
        condition:
                $c0
}

95% of the rules are like this. Some large like the CRC32 table is 1024 
bytes, etc.
The remaining 5% are a bit more complex that require a custom "module" I 
added to libyara.
On a very large executable, it takes ~3 seconds to scan through all these 
rules.

Now compare this to:  
https://github.com/Yara-Rules/rules/blob/master/crypto/crypto_signatures.yar
There are about 125 rules in total.
While some our simple like the example above, the majority are relatively 
complex using regex, etc.

Example:
rule Big_Numbers0
{
        meta:
                author = "_pusher_"
                description = "Looks for big numbers 20:sized"
                date = "2016-07"
        strings:
                $c0 = /[0-9a-fA-F]{20}/ fullword ascii
        condition:
                $c0
}

*In the same setup these 125 rules took 1.45 minutes!*

So even those the more complex set is only ~10% of the count of simple 
rules it took 28 times longer to scan.
Also note these signatures had way too many false positives which could be 
a major factor (without isolating further) in why it takes so long to scan.

Note: Not knocking YARA in any way.
Is there another public/open source technology like YARA, and anything else 
that is the de facto signature system for Malware, etc., other than YARA?
Not that I know of.. It's not perfect, but overall it's great IMHO.

On Monday, March 21, 2022 at 7:03:02 PM UTC-4 Gman wrote:

> Thanks. I believe I have tried all the suggestions already. Magic doesn't 
> help, as not only is slow and will run anyways, but it's also not supported 
> in Windows. You can achieve a similar effect by simply checking for 
> specific bytes using the uint notation, but the problem remains in the fact 
> that at the moment you include a module such as "PE" or "DOTNET", these 
> modules will parse the entire file anyways. This means that the following 
> scenario is never optimized:
>
> import "pe"
> condition:
>     filesize <= 0xFF0000 and pe.number_of_sections == 8
>
> You may think that when the filesize (or magic or a single byte check) 
> condition is False it will short-circuit and exit. While this is correct, 
> it has also wasted all the time so you don't get any performance benefit. 
> This is because the PE module parsed the entire file anyways!
>
> I have also tried the custom-scanning front-end approach we discussed 
> before, having a subset of pre-compiled rules for PNG, another for PE, 
> another for DOCX. To my surprise, this resulted on even slower scan times. 
> This might be related to the fact that the Ahocorasick automaton is already 
> as good as it gets when all the rules are combined together, which means 
> when you split into 3 "scanners/rules sets", you don't get much benefit and 
> in the contrary, you may introduce more overhead on the filetype 
> checking/conditional scanning logic.
>
> I would love to hear from the original YARA authors, to at least 
> understand if we are incorrectly using YARA or this is indeed a (very 
> common?) scenario that is not covered/not optimized for.
>
> Thanks
> On Friday, March 18, 2022 at 9:31:07 PM UTC-7 Joe Neighbor wrote:
>
>> Also looking through the manual again, there is a "magic" module.
>> "The Magic module allows you to identify the type of the file based on 
>> the output of file, the standard Unix command."
>> Then apparently you can make rule conditions like "magic.mime_type() == 
>> "application/pdf"
>>
>> But still each rule in a set is going to be loaded and compiled 
>> regardless like you say.
>> So for max performance potential and control I still think you will want 
>> to go with a custom scanning front end.
>>
>> And the sky the limit here, you could have a central controlling app 
>> (again maybe made using Python) and then send off scan jobs off to machines 
>> on your lan or cloud instances, etc.
>> Sure, a lot more development work for this this kind of scheme though.
>>
>>
>> On Saturday, March 19, 2022 at 12:04:29 AM UTC-4 Joe Neighbor wrote:
>>
>>> Have you considered making a custom version of the console/front end?
>>>
>>> Maybe you can use "Rule tags" (yara manual page 27)?
>>> You could have a "DLL" tag and/or a "EXE" tag then you can seperate them 
>>> at least.
>>> Although it says: "Those tags can be used later to filter YARA’s
>>> output". Which might mean they all get scanned anyhow, but then you 
>>> filter them in the Yara match callback which is not going to help 
>>> performance.
>>>
>>> I know, it really slows down depending on the rules and the size of the 
>>> search.
>>>
>>> Here is what I think will be a good solution for you:
>>> 1) Separate your rules into a series of files.
>>> A set of rules for PNG files, DLLs, DOCX, etc.
>>> Compiling is fast but you could sort and batch your inputs for a set of 
>>> compiled rule set.
>>>
>>>
>>> You might have to tool up a solution to help manage the rules if you 
>>> have a lot of them.
>>> Might have to put them into a DB, etc., so that if one to change a rule 
>>> that needs to be duplicated to different sets the changes could be 
>>> automatic.
>>> Completely automated, it could help you zero on bad/slow rules too.
>>> If you need a lot of control, you could do something like putting all 
>>> your rules in a JSON format then the YARA rule files can be created on the 
>>> fly.
>>> At first though I'd just set them all up by manually for development.
>>>
>>> 2) In your controlling app (can still be a console/terminal still) you 
>>> first enumerate all the files to scan and put them in buckets.
>>> 3) Then you scan each set in bulk with the per type rule set and only 
>>> having to compile once per.
>>> As you find matches you either dump it to the screen or save them in a 
>>> JSON for later processing et al.
>>>
>>> There is:  https://github.com/VirusTotal/yara-python
>>> Which I think you can get this sort of setup up and running pretty fast 
>>> with Python.
>>> And since yara-python is just a binary wrapper around libyara you are 
>>> still going to get just sub C/C++ performance.
>>> The Python part is just mostly to do the file wrangling. Unless you need 
>>> additional features.
>>>
>>> On Friday, March 18, 2022 at 12:45:13 PM UTC-4 Gman wrote:
>>>
>>>> Thanks for your insights.
>>>>
>>>> I'm not too concerned about drive I/O overhead (this part I consider 
>>>> the benchmark pre-stage, so not measuring it). In my tests, I'm simply 
>>>> loading all the files first (e.g. file mapping) and hence the only 
>>>> performance bottleneck I observe when profiling is the pure CPU-bound YVM 
>>>> execution of the instructions.
>>>>
>>>> One of the things I have noticed is that the way in which YARA is 
>>>> designed, it will process each sample entirely regardless of having rules 
>>>> that apply to them. I have seen many GitHub threads explaining why this is 
>>>> a "long term investment" (as you are likely to have 1 rule that will need 
>>>> it anyways), but I don't fully agree with the justification. While there 
>>>> are scenarios in which you will have a set of files and a set of rules 
>>>> that 
>>>> will mix well together, in my experience this is more like an exception. 
>>>> Let me explain with an example:
>>>>
>>>> I have 5000 PE-EXE files to scan, and 300 PE rules for DLLs only. Now 
>>>> imagine how the scan goes...
>>>>
>>>> a) The Ahocorasick tree will fit together 300 rules' atoms. All the PE 
>>>> rules start with the "is_pe / is_dll" condition that can quickly exit via 
>>>> short-circuit.
>>>> b) YARA will open and fully parse 5k EXE files, regardless of how many 
>>>> "potential" rules I have to apply.
>>>> c) Since all my files are EXE and NOT DLL, no PE rule will actually 
>>>> match. However, I have wasted a lot of time fully parsing 5k files 
>>>> (because 
>>>> the PE module will parse the entire file anyways).
>>>>
>>>> You can quickly observe how the performance drops significantly on such 
>>>> scenarios. Now consider that you have an unbalanced mix of files such as 
>>>> PE, PNG and DOCX. And then, think about an unproportioned number of rules 
>>>> (e.g. 1 PE rule and 100 PNG rules but no DOCX rules). It is clear at this 
>>>> point that some rules will "heavily penalize" the overall performance of 
>>>> YARA simply because they will unnecessarily "overload" the scan by running 
>>>> on "will never match" filetypes (in this example, 100 PNG rules that are 
>>>> heavy on strings will unnecessarily run on 20k DOCX files). Even if you 
>>>> implement something like the "Magic" module to identify it, YARA will go 
>>>> over the file anyways trying to find the atoms.
>>>>
>>>> Real world scenarios are in my experience a lot more like the case I'm 
>>>> describing: you have an assorted and unpredictable collection of files to 
>>>> be scanned, and it's likely that your rules will only ever matter for a 
>>>> small portion of your files. If your collection of files is big and 
>>>> "misaligned" with the "filetypes" that your rules are designed for, then 
>>>> you have a performance nightmare in your hands.
>>>>
>>>> So I guess I'm describing two different issues or improvement 
>>>> opportunities (or asking if there is another way to address the described 
>>>> scenarios):
>>>>
>>>> 1) It gives me the impression that YARA is missing an essential 
>>>> "selective tree" mechanism. For example, imagine if you could have one 
>>>> Ahocorasick Tree for PE files, but a different one for PNG files. That 
>>>> would drastically reduce the scan time by simply focusing on a "this makes 
>>>> sense to be scanned" selector and skip the rest. Traversing a set of 5 
>>>> atoms is not the same as traversing a set of 5000 atoms...
>>>> 2) Alternatively or complementarily, it seems YARA would drastically 
>>>> benefit from a "pre-flight" check for example for PE files. What if all my 
>>>> rules are for DLL files but I'm scanning 100 EXE files? Why should I waste 
>>>> precious time fully parsing each entire PE file if I could simply check 
>>>> the 
>>>> header/first few bytes to determine that the file is a DLL?
>>>>
>>>> Thanks,
>>>> On Friday, March 11, 2022 at 10:11:03 PM UTC-8 Joe Neighbor wrote:
>>>>
>>>>> In my experience doing mostly in memory YARA scanning:
>>>>> I use up to as many threads as the system has physical cores.
>>>>> I doubt logical HT/SMT cores are going to help much (from the 
>>>>> following observations).
>>>>>
>>>>> With mostly simple rules that just do hex (byte run) scanning with no 
>>>>> wildcards and a single condition extra threads only give about a 10% 
>>>>> performance increase.
>>>>> This makes sense because the efficient intensive Aho-Corasick 
>>>>> algorithm is memory constrained. The extra threads are just maxing out 
>>>>> the 
>>>>> memory bandwidth.
>>>>> At this point there is little difference between using one core or 64 
>>>>> because memory bandwidth hits it's limit.
>>>>>
>>>>> For more complex rule sets, I see about a 20% to 30% performance 
>>>>> increase. 
>>>>> This makes sense too as now some of the memory bandwidth gets traded 
>>>>> with increased compute (from the extra rules logic).
>>>>> Here this work gets distributed to other cores that can simultaneously 
>>>>> compute. And of course memory bandwidth still gets maxed out.
>>>>>
>>>>> See:
>>>>> https://github.com/Neo23x0/YARA-Performance-Guidelines/
>>>>>
>>>>> There is a problem with your logic.
>>>>> Caching doesn't eliminate drive I/0 overhead. You'd have to preload 
>>>>> all of your files into memory for that.
>>>>> And are you sure your OS cache is big enough to hold all of your files 
>>>>> to begin with?
>>>>> There is still a lot going on in trips from UM to KM and back again. 
>>>>> At minimal memcpy() (even if in DMA hardware) from the OS cache into your 
>>>>> process et al.
>>>>> A key thing here is you say "cache". This means "memory". You are 
>>>>> competing with the OS use of it's file buffer memory (and CPU cache too) 
>>>>> vs 
>>>>> the intense Aho-Corasick scanning thread(s).
>>>>>
>>>>> Hopefully someday, a new memory technology will come along that will 
>>>>> match the CPU in clock speed (in whole values not just bits)..
>>>>>
>>>>> Reads like you are on the right track though. 
>>>>> You have to find where the actual bottlenecks actually are (using 
>>>>> system and process performance tools) and mitigate them the best you can. 
>>>>>
>>>>> On Tuesday, March 8, 2022 at 7:05:09 PM UTC-5 Gman wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm trying to get the maximum possible performance out of YARA, and 
>>>>>> for that goal I've been studying the code and algorithms to ensure 
>>>>>> everything is contemplated:
>>>>>>
>>>>>> 1) My understanding is that the Aho-Corasick algorithm helps build 
>>>>>> the Atoms tree to then efficiently apply just the rules that have Atoms 
>>>>>> matching the scanned file. This is a great start because not all the 
>>>>>> rules 
>>>>>> will be executed for each file.
>>>>>> 2) I also believe there is a short-circuit logic capability so that 
>>>>>> once a condition is not satisfied, the subsequent ones will not even try 
>>>>>> to 
>>>>>> execute.
>>>>>> 3) The -f option (as seen in the command line tool) will also run in 
>>>>>> "fast" mode and report the first occurrence, without wasting time on 
>>>>>> subsequent checks/rules.
>>>>>> 4) Precompiling rules is a good practice as it saves time, given that 
>>>>>> the scanner won't need to compile them before starting a scan.
>>>>>> 5) Writing the rules in smart ways yields better performance, 
>>>>>> including: using non-trivial hex sequences, replacing some strings with 
>>>>>> hex 
>>>>>> representations, re-writing regexs to be more efficient, (sorting the 
>>>>>> conditions?), etc.
>>>>>> 6) You can run YARA in multi-thread mode. There is a drastic 
>>>>>> difference between running with 1 thread vs running with 16 threads 
>>>>>> (most 
>>>>>> likely as it also takes advantage of I/O vs CPU-bound operations).
>>>>>>
>>>>>> With these in mind, I tried to measure the performance of YARA for 
>>>>>> scanning a given directory (e.g. containing 10k assorted files) using an 
>>>>>> artificial set of 5k, 10k, 20k and even 40k rules. To my surprise, YARA 
>>>>>> is 
>>>>>> quite fast up to 5k rules, and after that performance degrades 
>>>>>> drastically 
>>>>>> (almost in a linear fashion). Note: I run the benchmark multiple times 
>>>>>> to 
>>>>>> eliminate the effect of hard disk I/O (hence, having everything in 
>>>>>> cache/memory).
>>>>>>
>>>>>> - Am I missing any possible optimization trick or Best-Known-Method? 
>>>>>> - Does YARA suffers from some limitation in terms of performance 
>>>>>> related to # of rules or # of files?
>>>>>> - Based on my basic understanding of the source code, the modules 
>>>>>> such as "pe" and "dotnet" are actually parsing the entire file (within 
>>>>>> the 
>>>>>> module Load) regardless of the rules actually using these modules. Let's 
>>>>>> say a rule just needs to do the check pe.is_pe, do we need to parse the 
>>>>>> entire file just for that? Aren't the imported/exported functions or 
>>>>>> certificates parsing slowing down the scan unnecessarily? (I'm not even 
>>>>>> sure this is the reason for performance degradation, just a thought).
>>>>>>
>>>>>> Any tip or suggestion is much appreciated, and happy to contribute 
>>>>>> back if there is an opportunity to do so.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"YARA" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/yara-project/b211db06-ee78-4bde-9580-b4c4c8fad778n%40googlegroups.com.

Re: What's YARA Maximum Achievable Performance

Reply via email to