Hey Victor , Thank you for your answer. In fact you've solved my problem :) I wasn't aware that the results from the match function were objects containing fields like meta, strings,namespace etc ... . To be honest : this is not clearly stated in the documentation (I had to let python show me all the fields it contained) , but it allows me to match without using the callback and the related delay and still have all the info I need !
Thanks ! Op donderdag 18 mei 2017 15:20:19 UTC+2 schreef Víctor Manuel Álvarez García: > > Wesley's description of the issue is accurate. If you are interested only > in the matching rules you better use the results from "match" function > instead of using the callback. Allowing to configure in which cases the > callback is called would be a nice addition though. > > On Thu, May 18, 2017 at 10:24 AM, <[email protected] <javascript:>> > wrote: > >> I did the test and the number of times the callback function gets called >> is indeed total number of rules X total number of files (12345 X 12319). >> Which gives about a billion function calls + additional data conversions >> for only 30 hits. So adding the described functionality would definitely be >> great ! I'l make an issue on GitHub about it ... >> thx >> >> >> >> Op donderdag 18 mei 2017 03:56:33 UTC+2 schreef Wesley Shields: >>> >>> I think this is expected behavior, though if it is optimal behavior or >>> not is obviously open for debate. ;) >>> >>> Here's what I think is happening. Every time you scan a file your >>> "yara_callback" function is called once for each rule, even if the rule >>> didn't match. So if you scan 10 files with 5 rules your callback will be >>> called 50 times, even if none of those 5 rules match. You can check this by >>> adding the following right after you compile your rules: >>> >>> print(len([rule for rule in self.rules])) >>> >>> That will print you the number of rules that compiled. I'm guessing if >>> you take that number and multiply it by the number of files scanned they >>> should be equal. >>> >>> Now, you could certainly argue that there should be a flag to the match >>> function which indicates if you want your callback called for matches, >>> non-matches or both. That would likely eliminate a lot of extra work to >>> translate things into python objects. This also explains why the "native" >>> yara is much faster. It doesn't have to do any of the C to python >>> conversion for the data and it also has a "don't show me non-matches" flag >>> on by default. >>> >>> If Victor agrees, you could make this an issue on github so it doesn't >>> get lost. I've got some experience in this area and may be able to take it >>> on too. >>> >>> -- WXS >>> >>> > On May 17, 2017, at 4:59 PM, [email protected] wrote: >>> > >>> > The profiling results were not added correctly : >>> > >>> > >>> > >>> > >>> > Total time: 928.533 s >>> > >>> > >>> > File: /app/filters/yaraPOC.py >>> > >>> > >>> > Function: match_rules at line 70 >>> > >>> > >>> > >>> > >>> > >>> > Line # Hits Time Per Hit % Time Line Contents >>> > >>> > >>> > ============================================================== >>> > >>> > >>> > 70 @profile >>> > >>> > >>> > 71 def >>> match_rules(self,file): >>> > >>> > >>> > 72 """ >>> > >>> > >>> > 73 Matches yara >>> rules against the file >>> > >>> > >>> > 74 :param file: >>> relative path to the files_folder specified for the YaraFilter >>> > >>> > >>> > 75 :return: >>> returns dictionary with matching information >>> > >>> > >>> > 76 """ >>> > >>> > >>> > 77 12319 12086 1.0 0.0 >>> self.matching_results = [] >>> > >>> > >>> > 78 12319 8847 0.7 0.0 if not >>> self.rules: >>> > >>> > >>> > 79 >>> print("Rules not initialised") >>> > >>> > >>> > 80 return >>> > >>> > >>> > 81 12319 4209 0.3 0.0 try: >>> > >>> > >>> > 82 12319 928508227 75372.0 100.0 >>> self.rules.match( str(file),callback=self.yara_callback, fast = True) >>> > >>> > >>> > 83 >>> > >>> > >>> > 84 except >>> Exception as e : >>> > >>> > >>> > 85 >>> print("Error occured trying to match yara rules on file " + str(file) + ':' >>> + str(e)) >>> > >>> > >>> > >>> > >>> > >>> > Total time: 351.386 s >>> > >>> > >>> > File: /app/filters/yaraPOC.py >>> > >>> > >>> > Function: yara_callback at line 87 >>> > >>> > >>> > >>> > >>> > >>> > Line # Hits Time Per Hit % Time Line Contents >>> > >>> > >>> > ============================================================== >>> > >>> > >>> > 87 @profile >>> > >>> > >>> > 88 def >>> yara_callback(self,matching_data): >>> > >>> > >>> > 89 """ >>> > >>> > >>> > 90 Callback >>> function that gets called for yara rule that matches >>> > >>> > >>> > 91 :param >>> matching_data: >>> > >>> > >>> > 92 :return: >>> > >>> > >>> > 93 """ >>> > >>> > >>> > 94 # Currently >>> we do not add the strings from the matching rule >>> > >>> > >>> > 95 151991822 43182861 0.3 12.3 if >>> matching_data['matches'] : >>> > >>> > >>> > 96 27 1777 65.8 0.0 print >>> ('%s matches %s' %(matching_data['rule'],self.current_file)) >>> > >>> > >>> > 97 >>> > >>> > >>> > 98 151991822 308201707 2.0 87.7 >>> yara.CALLBACK_CONTINUE >>> > >>> > >>> > >>> > >>> > >>> > Op woensdag 17 mei 2017 22:56:41 UTC+2 schreef [email protected]: >>> > Hey Wesley , >>> > thanks for your reply. >>> > >>> > Here's a trimmed down version of my code but the profiling of this >>> function gives me the same results if applied to the same set of files. >>> > After the code I've added some profiling results. >>> > Most of the rules I'm using come from the public repository : >>> https://github.com/Yara-Rules/rules >>> > >>> > FYI My yara-python is dynamically linked against libyara from my >>> 'native' yara install. >>> > I did some testing with native yara and there is no comparison in >>> speed , it's way faster ... >>> > >>> > >>> > import yara >>> > import os >>> > import logging >>> > class YaraPOC(): >>> > ALLOWED_EXTENSIONS = (r".yar",r".yara") >>> > >>> > def __init__(self): >>> > self.current_file = "" >>> > >>> > def walk_directory_tree(self,directory, extension_filter=None, >>> recursive=True): >>> > file_list_res = [] >>> > if not recursive: >>> > file_list_res = [os.path.join(directory, f) for f in >>> os.listdir(directory) if >>> > os.path.isfile(os.path.join(directory, >>> f))] >>> > else: >>> > for path, subdirs, files in os.walk(directory): >>> > for name in files: >>> > file_list_res.append(os.path.join(path, name)) >>> > >>> > if not extension_filter is None: >>> > file_list_res = [f for f in file_list_res if >>> f.endswith(extension_filter)] >>> > >>> > return file_list_res >>> > >>> > def load_rules(self, rules_folder): >>> > >>> > print("Loading yararules from: %s" %rules_folder) >>> > rules_file_list = >>> self.walk_directory_tree(rules_folder,YaraPOC.ALLOWED_EXTENSIONS,recursive=True) >>> >>> >>> > # For each rule we want the path relative to our main folder >>> to use as a namespace in yara >>> > namespaces = [] >>> > remove_index = rules_folder.rfind(os.sep) + 1 >>> > # For the namespaces we remove this "prefix" from all our >>> paths,and create a seperate list for it >>> > for rule in rules_file_list: >>> > namespaces.append(rule[remove_index::]) >>> > >>> > filepaths_dict = {} >>> > for indx, namespace in enumerate(namespaces): >>> > filepaths_dict[namespace] = rules_file_list[indx] >>> > try: >>> > self.rules = yara.compile(filepaths=filepaths_dict) >>> > except Exception as e: >>> > print("Compilation error in Yara rules. Are you missing an >>> import ? ") >>> > print(str(e)) >>> > >>> > print("Loaded %s Yararules" % str(len(namespaces))) >>> > >>> > >>> > @profile >>> > def match_rules(self,file): >>> > self.matching_results = [] >>> > if not self.rules: >>> > print("Rules not initialised") >>> > return >>> > >>> > self.rules.match( str(file),callback=self.yara_callback, fast >>> = True) >>> > >>> > @profile >>> > def yara_callback(self,matching_data): >>> > if matching_data['matches'] : >>> > print ('%s matches %s' >>> %(matching_data['rule'],self.current_file)) >>> > >>> > yara.CALLBACK_CONTINUE >>> > >>> > # Entrypoint >>> > if __name__ == "__main__": >>> > yaraPoc = YaraPOC() >>> > yaraPoc.load_rules("/rules/yara") >>> > for file in os.listdir("/files"): >>> > yaraPoc.current_file = file >>> > yaraPoc.match_rules("/files/" + str(file)) >>> > >>> > >>> > Total time: 928.533 s >>> > >>> > >>> > File<span style="color: #660;" class="sty >>> > >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "YARA" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "YARA" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
