Taras, On Wed, Aug 18, 2010 at 7:52 AM, Taras <ox...@oxdef.info> wrote: > Andres, > > Tactic to increase performance of grep plugins by changing parsing engine > works only for > plugins in which such parsed (not plain) data is reqested (by calling > getDocumentParserFor()), isn't it?
This new approach has two different implications. The first one, is directly related to plugins like this [0] where I'm doing the following: - Removing the ugly regular expressions "that tried to match tags" and were applied to the HTML response: self._object = re.compile(r'< *object([^>]*)>', re.IGNORECASE) self._applet = re.compile(r'< *applet([^>]*)>', re.IGNORECASE) - Adding nicer code, which is less false positive/negative prone and faster: self._tag_names.append('object') self._tag_names.append('applet') ... dom = response.getDOM() if dom != None: for tag_name in self._tag_names: element_list = dom.findall( tag_name ) At this moment, only four plugins use the new DOM feature provided by the framework: - ajax.py - feeds.py - fileUpload.py - objects.py The idea is that in the future, we'll use this in most situations were we now use regular expressions. In some cases, (of course!) we won't be able to replace the regular expressions, but we're going to be able to reduce the length of the string where it's applied [1] and the regex complexity (more complex == more false positives == more frustraded/mad users). > But in same time we have only 4 such plugins: > $grep -R getDocumentParserFor *.py > findComments.py: dp = dpCache.dpc.getDocumentParserFor( > response ) > getMails.py: dp = dpCache.dpc.getDocumentParserFor( response ) > metaTags.py: dp = dpCache.dpc.getDocumentParserFor( > response ) > strangeParameters.py: dp = dpCache.dpc.getDocumentParserFor( > response ) > > So is it enought count (4 plugins) to change parsing engine? The second part of the change, is that now instead of using BeautifulSoup inside the htmlParser in order to sanitize the potentially broken HTML, we're using libxml2's "recover" feature. In order to achieve this, I changed the htmlParser to look like this: def _preParse( self, httpResponse ): ''' @parameter httpResponse: The HTTP response document that contains the HTML document inside its body. ''' assert self._baseUrl != '', 'The base URL must be setted.' HTMLDocument = httpResponse.getBody() if self._normalizeMarkup: # In some cases, the parsing library could fail. if httpResponse.getDOM() != None: HTMLDocument = etree.tostring( httpResponse.getDOM() ) # Now we are ready to work self._parse ( HTMLDocument ) And this is the implementation of the getDOM() method: def getDOM( self ): ''' I don't want to calculate the soup for all responses, only for those which are needed. This method will first calculate the soup, and then save it for other calls to this method. @return: The soup, or None if the HTML normalization failed. ''' if self._dom == None: try: parser = etree.XMLParser(recover=True) self._dom = etree.fromstring( self._body, parser ) except Exception, e: print e msg = 'The HTTP body for "' + self.getURL() + '" could NOT be' msg += ' parsed by libxml2.' om.out.debug( msg ) return self._dom I still need to check if it's better to use the XMLParser or the HTMLParser in this case. This second part of the change will make the whole discovery/grep process faster, because all the plugins that used getDocumentParserFor() before, were using the slow BeautifulSoup. What do you guys think about the change? Have you been able to test it? Any problems with the new dependency (apt-get install python-libxml2) ? [0] http://w3af.svn.sourceforge.net/viewvc/w3af/trunk/plugins/grep/objects.py?r1=3142&r2=3509 [1] http://w3af.svn.sourceforge.net/viewvc/w3af/trunk/plugins/grep/ajax.py?r1=3142&r2=3509 Regards, >> I've been working on the performance of the grep plugins, I >> basically found that some of them used regular expressions heavily and >> those regular expressions were far from being fast. After some hours >> of trying to enhance the performance of each particular regex, I >> decided to move on and change the tactic. I tried the following: >> >> 1- Load the HTML into a xml.minidom >> 2- Load the HTML into a BeautifulSoup >> 3- Load the HTML into a libxml2 >> >> The first one was fast, but... it failed to parse broken HTML. The >> second one was GREAT at handling broken HTML, but made my tests run in >> DOUBLE the time! Finally, libxml2 gave us a good balance between speed >> and broken HTML handling. With #3 I reduced my test time from 10sec to >> 4sec. The attached file shows the functions that consume the most CPU >> time. Tomorrow I'll be working on enhancing the grep plugins even >> more, if you want to help, please join the #w3af IRC channel, and >> we'll work together! > > > -- > Taras > http://oxdef.info > -- Andrés Riancho Founder, Bonsai - Information Security http://www.bonsai-sec.com/ http://w3af.sf.net/ ------------------------------------------------------------------------------ This SF.net email is sponsored by Make an app they can't live without Enter the BlackBerry Developer Challenge http://p.sf.net/sfu/RIM-dev2dev _______________________________________________ W3af-develop mailing list W3af-develop@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/w3af-develop