I did the test and the number of times the callback function gets called is indeed total number of rules X total number of files (12345 X 12319). Which gives about a billion function calls + additional data conversions for only 30 hits. So adding the described functionality would definitely be great ! I'l make an issue on GitHub about it ... thx
Op donderdag 18 mei 2017 03:56:33 UTC+2 schreef Wesley Shields: > > I think this is expected behavior, though if it is optimal behavior or not > is obviously open for debate. ;) > > Here's what I think is happening. Every time you scan a file your > "yara_callback" function is called once for each rule, even if the rule > didn't match. So if you scan 10 files with 5 rules your callback will be > called 50 times, even if none of those 5 rules match. You can check this by > adding the following right after you compile your rules: > > print(len([rule for rule in self.rules])) > > That will print you the number of rules that compiled. I'm guessing if you > take that number and multiply it by the number of files scanned they should > be equal. > > Now, you could certainly argue that there should be a flag to the match > function which indicates if you want your callback called for matches, > non-matches or both. That would likely eliminate a lot of extra work to > translate things into python objects. This also explains why the "native" > yara is much faster. It doesn't have to do any of the C to python > conversion for the data and it also has a "don't show me non-matches" flag > on by default. > > If Victor agrees, you could make this an issue on github so it doesn't get > lost. I've got some experience in this area and may be able to take it on > too. > > -- WXS > > > On May 17, 2017, at 4:59 PM, tofba...@gmail.com <javascript:> wrote: > > > > The profiling results were not added correctly : > > > > > > > > > > Total time: 928.533 s > > > > > > File: /app/filters/yaraPOC.py > > > > > > Function: match_rules at line 70 > > > > > > > > > > > > Line # Hits Time Per Hit % Time Line Contents > > > > > > ============================================================== > > > > > > 70 @profile > > > > > > 71 def > match_rules(self,file): > > > > > > 72 """ > > > > > > 73 Matches yara > rules against the file > > > > > > 74 :param file: > relative path to the files_folder specified for the YaraFilter > > > > > > 75 :return: > returns dictionary with matching information > > > > > > 76 """ > > > > > > 77 12319 12086 1.0 0.0 > self.matching_results = [] > > > > > > 78 12319 8847 0.7 0.0 if not > self.rules: > > > > > > 79 > print("Rules not initialised") > > > > > > 80 return > > > > > > 81 12319 4209 0.3 0.0 try: > > > > > > 82 12319 928508227 75372.0 100.0 > self.rules.match( str(file),callback=self.yara_callback, fast = True) > > > > > > 83 > > > > > > 84 except > Exception as e : > > > > > > 85 print("Error > occured trying to match yara rules on file " + str(file) + ':' + str(e)) > > > > > > > > > > > > Total time: 351.386 s > > > > > > File: /app/filters/yaraPOC.py > > > > > > Function: yara_callback at line 87 > > > > > > > > > > > > Line # Hits Time Per Hit % Time Line Contents > > > > > > ============================================================== > > > > > > 87 @profile > > > > > > 88 def > yara_callback(self,matching_data): > > > > > > 89 """ > > > > > > 90 Callback > function that gets called for yara rule that matches > > > > > > 91 :param > matching_data: > > > > > > 92 :return: > > > > > > 93 """ > > > > > > 94 # Currently we > do not add the strings from the matching rule > > > > > > 95 151991822 43182861 0.3 12.3 if > matching_data['matches'] : > > > > > > 96 27 1777 65.8 0.0 print ('%s > matches %s' %(matching_data['rule'],self.current_file)) > > > > > > 97 > > > > > > 98 151991822 308201707 2.0 87.7 > yara.CALLBACK_CONTINUE > > > > > > > > > > > > Op woensdag 17 mei 2017 22:56:41 UTC+2 schreef tofba...@gmail.com: > > Hey Wesley , > > thanks for your reply. > > > > Here's a trimmed down version of my code but the profiling of this > function gives me the same results if applied to the same set of files. > > After the code I've added some profiling results. > > Most of the rules I'm using come from the public repository : > https://github.com/Yara-Rules/rules > > > > FYI My yara-python is dynamically linked against libyara from my > 'native' yara install. > > I did some testing with native yara and there is no comparison in speed > , it's way faster ... > > > > > > import yara > > import os > > import logging > > class YaraPOC(): > > ALLOWED_EXTENSIONS = (r".yar",r".yara") > > > > def __init__(self): > > self.current_file = "" > > > > def walk_directory_tree(self,directory, extension_filter=None, > recursive=True): > > file_list_res = [] > > if not recursive: > > file_list_res = [os.path.join(directory, f) for f in > os.listdir(directory) if > > os.path.isfile(os.path.join(directory, f))] > > else: > > for path, subdirs, files in os.walk(directory): > > for name in files: > > file_list_res.append(os.path.join(path, name)) > > > > if not extension_filter is None: > > file_list_res = [f for f in file_list_res if > f.endswith(extension_filter)] > > > > return file_list_res > > > > def load_rules(self, rules_folder): > > > > print("Loading yararules from: %s" %rules_folder) > > rules_file_list = > self.walk_directory_tree(rules_folder,YaraPOC.ALLOWED_EXTENSIONS,recursive=True) > > > > # For each rule we want the path relative to our main folder to > use as a namespace in yara > > namespaces = [] > > remove_index = rules_folder.rfind(os.sep) + 1 > > # For the namespaces we remove this "prefix" from all our > paths,and create a seperate list for it > > for rule in rules_file_list: > > namespaces.append(rule[remove_index::]) > > > > filepaths_dict = {} > > for indx, namespace in enumerate(namespaces): > > filepaths_dict[namespace] = rules_file_list[indx] > > try: > > self.rules = yara.compile(filepaths=filepaths_dict) > > except Exception as e: > > print("Compilation error in Yara rules. Are you missing an > import ? ") > > print(str(e)) > > > > print("Loaded %s Yararules" % str(len(namespaces))) > > > > > > @profile > > def match_rules(self,file): > > self.matching_results = [] > > if not self.rules: > > print("Rules not initialised") > > return > > > > self.rules.match( str(file),callback=self.yara_callback, fast = > True) > > > > @profile > > def yara_callback(self,matching_data): > > if matching_data['matches'] : > > print ('%s matches %s' > %(matching_data['rule'],self.current_file)) > > > > yara.CALLBACK_CONTINUE > > > > # Entrypoint > > if __name__ == "__main__": > > yaraPoc = YaraPOC() > > yaraPoc.load_rules("/rules/yara") > > for file in os.listdir("/files"): > > yaraPoc.current_file = file > > yaraPoc.match_rules("/files/" + str(file)) > > > > > > Total time: 928.533 s > > > > > > File<span style="color: #660;" class="sty > > > > -- You received this message because you are subscribed to the Google Groups "YARA" group. To unsubscribe from this group and stop receiving emails from it, send an email to yara-project+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.