Hey Wesley ,
thanks for your reply.
Here's a trimmed down version of my code but the profiling of this function
gives me the same results if applied to the same set of files.
After the code I've added some profiling results.
Most of the rules I'm using come from the public repository
: https://github.com/Yara-Rules/rules
FYI My yara-python is dynamically linked against libyara from my 'native'
yara install.
I did some testing with native yara and there is no comparison in speed ,
it's way faster ...
import yara
import os
import logging
class YaraPOC():
ALLOWED_EXTENSIONS = (r".yar",r".yara")
def __init__(self):
self.current_file = ""
def walk_directory_tree(self,directory, extension_filter=None,
recursive=True):
file_list_res = []
if not recursive:
file_list_res = [os.path.join(directory, f) for f in
os.listdir(directory) if
os.path.isfile(os.path.join(directory, f))]
else:
for path, subdirs, files in os.walk(directory):
for name in files:
file_list_res.append(os.path.join(path, name))
if not extension_filter is None:
file_list_res = [f for f in file_list_res if
f.endswith(extension_filter)]
return file_list_res
def load_rules(self, rules_folder):
print("Loading yararules from: %s" %rules_folder)
rules_file_list =
self.walk_directory_tree(rules_folder,YaraPOC.ALLOWED_EXTENSIONS,recursive=True)
# For each rule we want the path relative to our main folder to use as
a namespace in yara
namespaces = []
remove_index = rules_folder.rfind(os.sep) + 1
# For the namespaces we remove this "prefix" from all our paths,and
create a seperate list for it
for rule in rules_file_list:
namespaces.append(rule[remove_index::])
filepaths_dict = {}
for indx, namespace in enumerate(namespaces):
filepaths_dict[namespace] = rules_file_list[indx]
try:
self.rules = yara.compile(filepaths=filepaths_dict)
except Exception as e:
print("Compilation error in Yara rules. Are you missing an import ?
")
print(str(e))
print("Loaded %s Yararules" % str(len(namespaces)))
@profile
def match_rules(self,file):
self.matching_results = []
if not self.rules:
print("Rules not initialised")
return
self.rules.match( str(file),callback=self.yara_callback, fast = True)
@profile
def yara_callback(self,matching_data):
if matching_data['matches'] :
print ('%s matches %s' %(matching_data['rule'],self.current_file))
yara.CALLBACK_CONTINUE
# Entrypoint
if __name__ == "__main__":
yaraPoc = YaraPOC()
yaraPoc.load_rules("/rules/yara")
for file in os.listdir("/files"):
yaraPoc.current_file = file
yaraPoc.match_rules("/files/" + str(file))
Total time: 928.533 s
File: /app/filters/yaraPOC.py
Function: match_rules at line 70
Line # Hits Time Per Hit % Time Line Contents
==============================================================
70 @profile
71 def match_rules(self,
file):
72 """
73 Matches yara rules
against the file
74 :param file:
relative path to the files_folder specified for the YaraFilter
75 :return: returns
dictionary with matching information
76 """
77 12319 12086 1.0 0.0 self.matching_results
= []
78 12319 8847 0.7 0.0 if not self.rules:
79 print("Rules
not initialised")
80 return
81 12319 4209 0.3 0.0 try:
82 12319 928508227 75372.0 100.0 self.rules.
match( str(file),callback=self.yara_callback, fast = True)
83
84 except Exception as
e :
85 print("Error
occured trying to match yara rules on file " + str(file) + ':' + str(e))
Total time: 351.386 s
File: /app/filters/yaraPOC.py
Function: yara_callback at line 87
Line # Hits Time Per Hit % Time Line Contents
==============================================================
87 @profile
88 def yara_callback(self,
matching_data):
89 """
90 Callback function
that gets called for yara rule that matches
91 :param
matching_data:
92 :return:
93 """
94 # Currently we do
not add the strings from the matching rule
95 151991822 43182861 0.3 12.3 if matching_data[
'matches'] :
96 27 1777 65.8 0.0 print ('%s
matches %s' %(matching_data['rule'],self.current_file))
97
98 151991822 308201707 2.0 87.7 yara.
CALLBACK_CONTINUE
Op woensdag 17 mei 2017 16:30:51 UTC+2 schreef Wesley Shields:
>
> Based upon my understanding I don't think this is expected behavior. Can
> you share a minimal proof of concept which shows this happening?
>
> -- WXS
>
> > On May 17, 2017, at 8:18 AM, [email protected] <javascript:> wrote:
> >
> > Hello again ,
> >
> > I'm using yara python to match rules against a lot of files . The
> problem is when the number of files gets big the performance is really
> horrible .
> >
> > When doing some profiling I noticed that when running 12319 files
> against about a 1000 to 1500 rules the yara callback function gets called
> 151991822 times ?
> > Does the callback function get called for each matching string of a rule
> ?
> > In my test I only had 28 matches so it's not that I'm doing any heavy
> lifting if there's a match
> > Is there anything I can do about this , or is this behaviour to be
> expected with this number of files ? I haven't compared against native yara
> yet ...
> >
> > Thank you
> >
> >
> >
> >
> > --
> > You received this message because you are subscribed to the Google
> Groups "YARA" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to [email protected] <javascript:>.
> > For more options, visit https://groups.google.com/d/optout.
>
>
--
You received this message because you are subscribed to the Google Groups
"YARA" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.