Taras,

On Wed, Aug 18, 2010 at 7:52 AM, Taras <ox...@oxdef.info> wrote:
> Andres,
>
> Tactic to increase performance of grep plugins by changing parsing engine 
> works only for
> plugins in which such parsed (not plain) data is reqested (by calling 
> getDocumentParserFor()), isn't it?

    This new approach has two different implications. The first one,
is directly related to plugins like this [0] where I'm doing the
following:

- Removing the ugly regular expressions "that tried to match tags" and
were applied to the HTML response:

         self._object = re.compile(r'< *object([^>]*)>', re.IGNORECASE)         
         self._applet = re.compile(r'< *applet([^>]*)>', re.IGNORECASE)

- Adding nicer code, which is less false positive/negative prone and faster:

        self._tag_names.append('object')
        self._tag_names.append('applet')
        ...
        dom = response.getDOM()
        if dom != None:
            for tag_name in self._tag_names:
                element_list = dom.findall( tag_name )

At this moment, only four plugins use the new DOM feature provided by
the framework:
    - ajax.py
    - feeds.py
    - fileUpload.py
    - objects.py

The idea is that in the future, we'll use this in most situations were
we now use regular expressions. In some cases, (of course!) we won't
be able to replace the regular expressions, but we're going to be able
to reduce the length of the string where it's applied [1] and the
regex complexity (more complex == more false positives == more
frustraded/mad users).

> But in same time we have only 4 such plugins:
> $grep -R getDocumentParserFor *.py
> findComments.py:                    dp = dpCache.dpc.getDocumentParserFor( 
> response )
> getMails.py:            dp = dpCache.dpc.getDocumentParserFor( response )
> metaTags.py:                    dp = dpCache.dpc.getDocumentParserFor( 
> response )
> strangeParameters.py:            dp = dpCache.dpc.getDocumentParserFor( 
> response )
>
> So is it enought count (4 plugins) to change parsing engine?

    The second part of the change, is that now instead of using
BeautifulSoup inside the htmlParser in order to sanitize the
potentially broken HTML, we're using libxml2's  "recover" feature. In
order to achieve this, I changed the htmlParser to look like this:

    def _preParse( self, httpResponse ):
        '''
        @parameter httpResponse: The HTTP response document that
contains the HTML
        document inside its body.
        '''
        assert self._baseUrl != '', 'The base URL must be setted.'

        HTMLDocument = httpResponse.getBody()

        if self._normalizeMarkup:
            # In some cases, the parsing library could fail.
            if httpResponse.getDOM() != None:
                HTMLDocument = etree.tostring( httpResponse.getDOM() )

        # Now we are ready to work
        self._parse ( HTMLDocument )

And this is the implementation of the getDOM() method:

    def getDOM( self ):
        '''
        I don't want to calculate the soup for all responses, only for
those which are needed.
        This method will first calculate the soup, and then save it
for other calls to this method.

        @return: The soup, or None if the HTML normalization failed.
        '''
        if self._dom == None:
            try:
                parser = etree.XMLParser(recover=True)
                self._dom = etree.fromstring( self._body, parser )
            except Exception, e:
                print e
                msg = 'The HTTP body for "' + self.getURL() + '" could NOT be'
                msg += ' parsed by libxml2.'
                om.out.debug( msg )
        return self._dom

I still need to check if it's better to use the XMLParser or the
HTMLParser in this case.

This second part of the change will make the whole discovery/grep
process faster, because all the plugins that used
getDocumentParserFor() before, were using the slow BeautifulSoup.

What do you guys think about the change? Have you been able to test
it? Any problems with the new dependency (apt-get install
python-libxml2) ?

[0] 
http://w3af.svn.sourceforge.net/viewvc/w3af/trunk/plugins/grep/objects.py?r1=3142&r2=3509
[1] 
http://w3af.svn.sourceforge.net/viewvc/w3af/trunk/plugins/grep/ajax.py?r1=3142&r2=3509

Regards,

>>     I've been working on the performance of the grep plugins, I
>> basically found that some of them used regular expressions heavily and
>> those regular expressions were far from being fast. After some hours
>> of trying to enhance the performance of each particular regex, I
>> decided to move on and change the tactic. I tried the following:
>>
>> 1- Load the HTML into a xml.minidom
>> 2- Load the HTML into a BeautifulSoup
>> 3- Load the HTML into a libxml2
>>
>>     The first one was fast, but... it failed to parse broken HTML. The
>> second one was GREAT at handling broken HTML, but made my tests run in
>> DOUBLE the time! Finally, libxml2 gave us a good balance between speed
>> and broken HTML handling. With #3 I reduced my test time from 10sec to
>> 4sec. The attached file shows the functions that consume the most CPU
>> time. Tomorrow I'll be working on enhancing the grep plugins even
>> more, if you want to help, please join the #w3af IRC channel, and
>> we'll work together!
>
>
> --
> Taras
> http://oxdef.info
>



-- 
Andrés Riancho
Founder, Bonsai - Information Security
http://www.bonsai-sec.com/
http://w3af.sf.net/

------------------------------------------------------------------------------
This SF.net email is sponsored by 

Make an app they can't live without
Enter the BlackBerry Developer Challenge
http://p.sf.net/sfu/RIM-dev2dev 
_______________________________________________
W3af-develop mailing list
W3af-develop@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/w3af-develop

Reply via email to