Wesley's description of the issue is accurate. If you are interested only
in the matching rules you better use the results from "match" function
instead of using the callback. Allowing to configure in which cases the
callback is called would be a nice addition though.

On Thu, May 18, 2017 at 10:24 AM, <tofbaas...@gmail.com> wrote:

> I did the test and the number of times the callback function gets called
> is indeed total number of rules X total number of files (12345 X 12319).
> Which gives about a billion function calls + additional data conversions
> for only 30 hits. So adding the described functionality would definitely be
> great ! I'l make an issue on GitHub about it ...
> thx
>
>
>
> Op donderdag 18 mei 2017 03:56:33 UTC+2 schreef Wesley Shields:
>>
>> I think this is expected behavior, though if it is optimal behavior or
>> not is obviously open for debate. ;)
>>
>> Here's what I think is happening. Every time you scan a file your
>> "yara_callback" function is called once for each rule, even if the rule
>> didn't match. So if you scan 10 files with 5 rules your callback will be
>> called 50 times, even if none of those 5 rules match. You can check this by
>> adding the following right after you compile your rules:
>>
>> print(len([rule for rule in self.rules]))
>>
>> That will print you the number of rules that compiled. I'm guessing if
>> you take that number and multiply it by the number of files scanned they
>> should be equal.
>>
>> Now, you could certainly argue that there should be a flag to the match
>> function which indicates if you want your callback called for matches,
>> non-matches or both. That would likely eliminate a lot of extra work to
>> translate things into python objects. This also explains why the "native"
>> yara is much faster. It doesn't have to do any of the C to python
>> conversion for the data and it also has a "don't show me non-matches" flag
>> on by default.
>>
>> If Victor agrees, you could make this an issue on github so it doesn't
>> get lost. I've got some experience in this area and may be able to take it
>> on too.
>>
>> -- WXS
>>
>> > On May 17, 2017, at 4:59 PM, tofba...@gmail.com wrote:
>> >
>> > The profiling results were not added correctly :
>> >
>> >
>> >
>> >
>> > Total time: 928.533 s
>> >
>> >
>> > File: /app/filters/yaraPOC.py
>> >
>> >
>> > Function: match_rules at line 70
>> >
>> >
>> >
>> >
>> >
>> > Line #      Hits         Time  Per Hit   % Time  Line Contents
>> >
>> >
>> > ==============================================================
>> >
>> >
>> >     70                                               @profile
>> >
>> >
>> >     71                                               def
>> match_rules(self,file):
>> >
>> >
>> >     72                                                   """
>> >
>> >
>> >     73                                                   Matches yara
>> rules against the file
>> >
>> >
>> >     74                                                   :param file:
>> relative path to the files_folder specified for the YaraFilter
>> >
>> >
>> >     75                                                   :return:
>> returns dictionary with matching information
>> >
>> >
>> >     76                                                   """
>> >
>> >
>> >     77     12319        12086      1.0      0.0
>>  self.matching_results = []
>> >
>> >
>> >     78     12319         8847      0.7      0.0          if not
>> self.rules:
>> >
>> >
>> >     79
>> print("Rules not initialised")
>> >
>> >
>> >     80                                                       return
>> >
>> >
>> >     81     12319         4209      0.3      0.0          try:
>> >
>> >
>> >     82     12319    928508227  75372.0    100.0
>>  self.rules.match( str(file),callback=self.yara_callback, fast = True)
>> >
>> >
>> >     83
>> >
>> >
>> >     84                                                   except
>> Exception as e :
>> >
>> >
>> >     85                                                     print("Error
>> occured trying to match yara rules on file " + str(file) + ':' +  str(e))
>> >
>> >
>> >
>> >
>> >
>> > Total time: 351.386 s
>> >
>> >
>> > File: /app/filters/yaraPOC.py
>> >
>> >
>> > Function: yara_callback at line 87
>> >
>> >
>> >
>> >
>> >
>> > Line #      Hits         Time  Per Hit   % Time  Line Contents
>> >
>> >
>> > ==============================================================
>> >
>> >
>> >     87                                               @profile
>> >
>> >
>> >     88                                               def
>> yara_callback(self,matching_data):
>> >
>> >
>> >     89                                                   """
>> >
>> >
>> >     90                                                   Callback
>> function that gets called for yara rule that matches
>> >
>> >
>> >     91                                                   :param
>> matching_data:
>> >
>> >
>> >     92                                                   :return:
>> >
>> >
>> >     93                                                   """
>> >
>> >
>> >     94                                                   # Currently we
>> do not add the strings from the matching rule
>> >
>> >
>> >     95 151991822     43182861      0.3     12.3          if
>> matching_data['matches'] :
>> >
>> >
>> >     96        27         1777     65.8      0.0              print ('%s
>> matches %s' %(matching_data['rule'],self.current_file))
>> >
>> >
>> >     97
>> >
>> >
>> >     98 151991822    308201707      2.0     87.7
>>  yara.CALLBACK_CONTINUE
>> >
>> >
>> >
>> >
>> >
>> > Op woensdag 17 mei 2017 22:56:41 UTC+2 schreef tofba...@gmail.com:
>> > Hey Wesley ,
>> > thanks for your reply.
>> >
>> > Here's a trimmed down version of my code but the profiling of this
>> function gives me the same results if applied to the same set of files.
>> > After the code I've added some profiling results.
>> > Most of the rules I'm using come from the public repository :
>> https://github.com/Yara-Rules/rules
>> >
>> > FYI My yara-python is dynamically linked against libyara from my
>> 'native' yara install.
>> > I did some testing with native yara and there is no comparison in speed
>> , it's way faster ...
>> >
>> >
>> > import yara
>> > import os
>> > import logging
>> > class YaraPOC():
>> >     ALLOWED_EXTENSIONS = (r".yar",r".yara")
>> >
>> >     def __init__(self):
>> >         self.current_file = ""
>> >
>> >     def walk_directory_tree(self,directory, extension_filter=None,
>> recursive=True):
>> >         file_list_res = []
>> >         if not recursive:
>> >             file_list_res = [os.path.join(directory, f) for f in
>> os.listdir(directory) if
>> >                              os.path.isfile(os.path.join(directory,
>> f))]
>> >         else:
>> >             for path, subdirs, files in os.walk(directory):
>> >                 for name in files:
>> >                     file_list_res.append(os.path.join(path, name))
>> >
>> >         if not extension_filter is None:
>> >             file_list_res = [f for f in file_list_res if
>> f.endswith(extension_filter)]
>> >
>> >         return file_list_res
>> >
>> >     def load_rules(self, rules_folder):
>> >
>> >         print("Loading yararules from: %s" %rules_folder)
>> >         rules_file_list = self.walk_directory_tree(rules
>> _folder,YaraPOC.ALLOWED_EXTENSIONS,recursive=True)
>> >         # For each rule we want the path relative to our main folder to
>> use as a namespace in yara
>> >         namespaces = []
>> >         remove_index = rules_folder.rfind(os.sep) + 1
>> >         # For the namespaces we remove this "prefix" from all our
>> paths,and create a seperate list for it
>> >         for rule in rules_file_list:
>> >             namespaces.append(rule[remove_index::])
>> >
>> >         filepaths_dict = {}
>> >         for indx, namespace in enumerate(namespaces):
>> >             filepaths_dict[namespace] = rules_file_list[indx]
>> >         try:
>> >             self.rules = yara.compile(filepaths=filepaths_dict)
>> >         except Exception as e:
>> >             print("Compilation error in Yara rules. Are you missing an
>> import ? ")
>> >             print(str(e))
>> >
>> >         print("Loaded %s Yararules" % str(len(namespaces)))
>> >
>> >
>> >     @profile
>> >     def match_rules(self,file):
>> >         self.matching_results = []
>> >         if not self.rules:
>> >             print("Rules not initialised")
>> >             return
>> >
>> >         self.rules.match( str(file),callback=self.yara_callback, fast
>> = True)
>> >
>> >     @profile
>> >     def yara_callback(self,matching_data):
>> >         if matching_data['matches'] :
>> >             print ('%s matches %s' 
>> > %(matching_data['rule'],self.current_file))
>>
>> >
>> >         yara.CALLBACK_CONTINUE
>> >
>> > # Entrypoint
>> > if __name__ == "__main__":
>> >     yaraPoc = YaraPOC()
>> >     yaraPoc.load_rules("/rules/yara")
>> >     for file in os.listdir("/files"):
>> >         yaraPoc.current_file = file
>> >         yaraPoc.match_rules("/files/" + str(file))
>> >
>> >
>> > Total time: 928.533 s
>> >
>> >
>> > File<span style="color: #660;" class="sty
>> >
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "YARA" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to yara-project+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"YARA" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to yara-project+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to