I think this is expected behavior, though if it is optimal behavior or not is 
obviously open for debate. ;)

Here's what I think is happening. Every time you scan a file your 
"yara_callback" function is called once for each rule, even if the rule didn't 
match. So if you scan 10 files with 5 rules your callback will be called 50 
times, even if none of those 5 rules match. You can check this by adding the 
following right after you compile your rules:

print(len([rule for rule in self.rules]))

That will print you the number of rules that compiled. I'm guessing if you take 
that number and multiply it by the number of files scanned they should be equal.

Now, you could certainly argue that there should be a flag to the match 
function which indicates if you want your callback called for matches, 
non-matches or both. That would likely eliminate a lot of extra work to 
translate things into python objects. This also explains why the "native" yara 
is much faster. It doesn't have to do any of the C to python conversion for the 
data and it also has a "don't show me non-matches" flag on by default.

If Victor agrees, you could make this an issue on github so it doesn't get 
lost. I've got some experience in this area and may be able to take it on too.

-- WXS

> On May 17, 2017, at 4:59 PM, [email protected] wrote:
> 
> The profiling results were not added correctly : 
> 
> 
> 
> 
> Total time: 928.533 s
> 
> 
> File: /app/filters/yaraPOC.py
> 
> 
> Function: match_rules at line 70
> 
> 
> 
> 
> 
> Line #      Hits         Time  Per Hit   % Time  Line Contents
> 
> 
> ==============================================================
> 
> 
>     70                                               @profile
> 
> 
>     71                                               def 
> match_rules(self,file):
> 
> 
>     72                                                   """
> 
> 
>     73                                                   Matches yara rules 
> against the file
> 
> 
>     74                                                   :param file: 
> relative path to the files_folder specified for the YaraFilter
> 
> 
>     75                                                   :return: returns 
> dictionary with matching information
> 
> 
>     76                                                   """
> 
> 
>     77     12319        12086      1.0      0.0          
> self.matching_results = []
> 
> 
>     78     12319         8847      0.7      0.0          if not self.rules:
> 
> 
>     79                                                       print("Rules not 
> initialised")
> 
> 
>     80                                                       return
> 
> 
>     81     12319         4209      0.3      0.0          try:
> 
> 
>     82     12319    928508227  75372.0    100.0              
> self.rules.match( str(file),callback=self.yara_callback, fast = True)
> 
> 
>     83                                           
> 
> 
>     84                                                   except Exception as 
> e :
> 
> 
>     85                                                     print("Error 
> occured trying to match yara rules on file " + str(file) + ':' +  str(e))
> 
> 
> 
> 
> 
> Total time: 351.386 s
> 
> 
> File: /app/filters/yaraPOC.py
> 
> 
> Function: yara_callback at line 87
> 
> 
> 
> 
> 
> Line #      Hits         Time  Per Hit   % Time  Line Contents
> 
> 
> ==============================================================
> 
> 
>     87                                               @profile
> 
> 
>     88                                               def 
> yara_callback(self,matching_data):
> 
> 
>     89                                                   """
> 
> 
>     90                                                   Callback function 
> that gets called for yara rule that matches
> 
> 
>     91                                                   :param matching_data:
> 
> 
>     92                                                   :return:
> 
> 
>     93                                                   """
> 
> 
>     94                                                   # Currently we do 
> not add the strings from the matching rule
> 
> 
>     95 151991822     43182861      0.3     12.3          if 
> matching_data['matches'] :
> 
> 
>     96        27         1777     65.8      0.0              print ('%s 
> matches %s' %(matching_data['rule'],self.current_file))
> 
> 
>     97                                           
> 
> 
>     98 151991822    308201707      2.0     87.7          
> yara.CALLBACK_CONTINUE
> 
> 
> 
> 
> 
> Op woensdag 17 mei 2017 22:56:41 UTC+2 schreef [email protected]:
> Hey Wesley , 
> thanks for your reply.
> 
> Here's a trimmed down version of my code but the profiling of this function 
> gives me the same results if applied to the same set of files. 
> After the code I've added some profiling results.
> Most of the rules I'm using come from the public repository : 
> https://github.com/Yara-Rules/rules
> 
> FYI My yara-python is dynamically linked against libyara from my 'native' 
> yara install.
> I did some testing with native yara and there is no comparison in speed , 
> it's way faster ...
> 
> 
> import yara
> import os
> import logging
> class YaraPOC():
>     ALLOWED_EXTENSIONS = (r".yar",r".yara")
> 
>     def __init__(self):
>         self.current_file = ""
> 
>     def walk_directory_tree(self,directory, extension_filter=None, 
> recursive=True):
>         file_list_res = []
>         if not recursive:
>             file_list_res = [os.path.join(directory, f) for f in 
> os.listdir(directory) if
>                              os.path.isfile(os.path.join(directory, f))]
>         else:
>             for path, subdirs, files in os.walk(directory):
>                 for name in files:
>                     file_list_res.append(os.path.join(path, name))
> 
>         if not extension_filter is None:
>             file_list_res = [f for f in file_list_res if 
> f.endswith(extension_filter)]
> 
>         return file_list_res
> 
>     def load_rules(self, rules_folder):
> 
>         print("Loading yararules from: %s" %rules_folder)
>         rules_file_list = 
> self.walk_directory_tree(rules_folder,YaraPOC.ALLOWED_EXTENSIONS,recursive=True)
>         # For each rule we want the path relative to our main folder to use 
> as a namespace in yara
>         namespaces = []
>         remove_index = rules_folder.rfind(os.sep) + 1
>         # For the namespaces we remove this "prefix" from all our paths,and 
> create a seperate list for it
>         for rule in rules_file_list:
>             namespaces.append(rule[remove_index::])
> 
>         filepaths_dict = {}
>         for indx, namespace in enumerate(namespaces):
>             filepaths_dict[namespace] = rules_file_list[indx]
>         try:
>             self.rules = yara.compile(filepaths=filepaths_dict)
>         except Exception as e:
>             print("Compilation error in Yara rules. Are you missing an import 
> ? ")
>             print(str(e))
> 
>         print("Loaded %s Yararules" % str(len(namespaces)))
> 
> 
>     @profile
>     def match_rules(self,file):
>         self.matching_results = []
>         if not self.rules:
>             print("Rules not initialised")
>             return
> 
>         self.rules.match( str(file),callback=self.yara_callback, fast = True)
> 
>     @profile
>     def yara_callback(self,matching_data):
>         if matching_data['matches'] :
>             print ('%s matches %s' %(matching_data['rule'],self.current_file))
> 
>         yara.CALLBACK_CONTINUE
> 
> # Entrypoint
> if __name__ == "__main__":
>     yaraPoc = YaraPOC()
>     yaraPoc.load_rules("/rules/yara")
>     for file in os.listdir("/files"):
>         yaraPoc.current_file = file
>         yaraPoc.match_rules("/files/" + str(file))
> 
> 
> Total time: 928.533 s
> 
> 
> File<span style="color: #660;" class="sty
> 

-- 
You received this message because you are subscribed to the Google Groups 
"YARA" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to