Hey Wesley , 
thanks for your reply.

Here's a trimmed down version of my code but the profiling of this function 
gives me the same results if applied to the same set of files. 
After the code I've added some profiling results.
Most of the rules I'm using come from the public repository 
: https://github.com/Yara-Rules/rules

FYI My yara-python is dynamically linked against libyara from my 'native' 
yara install.
I did some testing with native yara and there is no comparison in speed , 
it's way faster ...


import yara
import os
import logging
class YaraPOC():
    ALLOWED_EXTENSIONS = (r".yar",r".yara")

    def __init__(self):
        self.current_file = ""

    def walk_directory_tree(self,directory, extension_filter=None, 
recursive=True):
        file_list_res = []
        if not recursive:
            file_list_res = [os.path.join(directory, f) for f in 
os.listdir(directory) if
                             os.path.isfile(os.path.join(directory, f))]
        else:
            for path, subdirs, files in os.walk(directory):
                for name in files:
                    file_list_res.append(os.path.join(path, name))

        if not extension_filter is None:
            file_list_res = [f for f in file_list_res if 
f.endswith(extension_filter)]

        return file_list_res

    def load_rules(self, rules_folder):

        print("Loading yararules from: %s" %rules_folder)
        rules_file_list = 
self.walk_directory_tree(rules_folder,YaraPOC.ALLOWED_EXTENSIONS,recursive=True)
        # For each rule we want the path relative to our main folder to use as 
a namespace in yara
        namespaces = []
        remove_index = rules_folder.rfind(os.sep) + 1
        # For the namespaces we remove this "prefix" from all our paths,and 
create a seperate list for it
        for rule in rules_file_list:
            namespaces.append(rule[remove_index::])

        filepaths_dict = {}
        for indx, namespace in enumerate(namespaces):
            filepaths_dict[namespace] = rules_file_list[indx]
        try:
            self.rules = yara.compile(filepaths=filepaths_dict)
        except Exception as e:
            print("Compilation error in Yara rules. Are you missing an import ? 
")
            print(str(e))

        print("Loaded %s Yararules" % str(len(namespaces)))


    @profile
    def match_rules(self,file):
        self.matching_results = []
        if not self.rules:
            print("Rules not initialised")
            return

        self.rules.match( str(file),callback=self.yara_callback, fast = True)

    @profile
    def yara_callback(self,matching_data):
        if matching_data['matches'] :
            print ('%s matches %s' %(matching_data['rule'],self.current_file))

        yara.CALLBACK_CONTINUE

# Entrypoint
if __name__ == "__main__":
    yaraPoc = YaraPOC()
    yaraPoc.load_rules("/rules/yara")
    for file in os.listdir("/files"):
        yaraPoc.current_file = file
        yaraPoc.match_rules("/files/" + str(file))



Total time: 928.533 s

File: /app/filters/yaraPOC.py

Function: match_rules at line 70



Line #      Hits         Time  Per Hit   % Time  Line Contents

==============================================================

    70                                               @profile

    71                                               def match_rules(self,
file):

    72                                                   """

    73                                                   Matches yara rules 
against the file

    74                                                   :param file: 
relative path to the files_folder specified for the YaraFilter

    75                                                   :return: returns 
dictionary with matching information

    76                                                   """

    77     12319        12086      1.0      0.0          self.matching_results 
= []

    78     12319         8847      0.7      0.0          if not self.rules:

    79                                                       print("Rules 
not initialised")

    80                                                       return

    81     12319         4209      0.3      0.0          try:

    82     12319    928508227  75372.0    100.0              self.rules.
match( str(file),callback=self.yara_callback, fast = True)

    83                                           

    84                                                   except Exception as 
e :

    85                                                     print("Error 
occured trying to match yara rules on file " + str(file) + ':' +  str(e))



Total time: 351.386 s

File: /app/filters/yaraPOC.py

Function: yara_callback at line 87



Line #      Hits         Time  Per Hit   % Time  Line Contents

==============================================================

    87                                               @profile

    88                                               def yara_callback(self,
matching_data):

    89                                                   """

    90                                                   Callback function 
that gets called for yara rule that matches

    91                                                   :param 
matching_data:

    92                                                   :return:

    93                                                   """

    94                                                   # Currently we do 
not add the strings from the matching rule

    95 151991822     43182861      0.3     12.3          if matching_data[
'matches'] :

    96        27         1777     65.8      0.0              print ('%s 
matches %s' %(matching_data['rule'],self.current_file))

    97                                           

    98 151991822    308201707      2.0     87.7          yara.
CALLBACK_CONTINUE






Op woensdag 17 mei 2017 16:30:51 UTC+2 schreef Wesley Shields:
>
> Based upon my understanding I don't think this is expected behavior. Can 
> you share a minimal proof of concept which shows this happening? 
>
> -- WXS 
>
> > On May 17, 2017, at 8:18 AM, [email protected] <javascript:> wrote: 
> > 
> > Hello again , 
> > 
> > I'm using yara python to match rules against a lot of files . The 
> problem is when the number of files gets big the performance is really 
> horrible . 
> > 
> > When doing some profiling I noticed that when running 12319 files 
> against about a 1000 to 1500 rules the yara callback function gets called 
> 151991822 times ? 
> > Does the callback function get called for each matching string of a rule 
> ? 
> > In my test I only had 28 matches so it's not that I'm doing any heavy 
> lifting if there's a match 
> > Is there anything I can do about this , or is this behaviour to be 
> expected with this number of files ? I haven't compared against native yara 
> yet ... 
> > 
> > Thank you 
> > 
> > 
> >   
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups "YARA" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to [email protected] <javascript:>. 
> > For more options, visit https://groups.google.com/d/optout. 
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"YARA" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to