Javier,

On Tue, Jul 19, 2011 at 12:41 PM, Javier Andalia
<javier_anda...@rapid7.com> wrote:
> On 07/19/2011 04:29 PM, Andres Riancho wrote:
>>
>> Javier,
>>
>> On Tue, Jul 19, 2011 at 12:18 PM, Javier Andalia
>> <javier_anda...@rapid7.com>  wrote:
>>>
>>> On 07/19/2011 02:54 PM, Andres Riancho wrote:
>>>>
>>>> Javier,
>>>>
>>>> On Tue, Jul 19, 2011 at 10:21 AM, Javier Andalia
>>>> <javier_anda...@rapid7.com>    wrote:
>>>>>
>>>>> List,
>>>>>
>>>>> This is my attempt to improve the performance of the xpath evaluation
>>>>> given
>>>>> a DOM Element.
>>>>> The original (and current) version is in httpResponse.py. Examples of
>>>>> how
>>>>> this is used can be found at:
>>>>> ajax.py, fileUpload.py, formAutocomplete.py, etc
>>>>>
>>>>>
>>>>>    def getDOM2(self):
>>>>>
>>>>>        '''
>>>>>
>>>>>        TODO: Put docstring here
>>>>>
>>>>>        '''
>>>>>
>>>>>        class DOM(object):
>>>>>
>>>>>            def xpath(self, tag, xpathpredicate='.'):
>>>>>
>>>>>                xpath = etree.XPath(xpathpredicate)
>>>>>
>>>>>                root = etree.fromstring(self.body,
>>>>>
>>>>>                                        etree.HTMLParser(recover=True))
>>>>>
>>>>>
>>>>>                context = etree.iterwalk(root, events=('start',),
>>>>> tag=tag)
>>>>>
>>>>>                try:
>>>>>
>>>>>                    for evt, elem in context:
>>>>>
>>>>>                        if xpath(elem):
>>>>>
>>>>>                            yield elem
>>>>>
>>>>>                        while elem.getprevious() is not None:
>>>>>
>>>>>                            del elem.getparent()[0]
>>>>>
>>>>>                except etree.XPathSyntaxError:
>>>>>
>>>>>                        om.out.debug('Invalid XPath expression: "%s"' %
>>>>>
>>>>>                                     xpathpredicate)
>>>>>
>>>>>                        raise
>>>>>
>>>>>                del context
>>>>
>>>>     Are you sure that this is equivalent to the old implementation?
>>>
>>> What do you mean? It is certainly a little more complex but still
>>> equivalent.
>>
>> Sorry for not being clear enough! My question was: is your
>> implementation going to return the same result as the old
>> implementation for ALL inputs?
>>
>
> Pretty sure! Note there's a slight variation on the way  the 'xpath' method
> is called in the experimental implementation though.
>
> typical lines as:
>
> dom.xpath("//input[translate(@type,'PASWORD','pasword')='password']")
>
>
> were converted to:
>
> dom.xpath(tag='input',
>          xpathpredicate="translate(@type,'PASWORD','pasword')='password'")

Interesting.

I still would like to know where the majority of the CPU use goes to.
gprof2xdot can tell you that in a very visual way.

>
>>>>     I'm guessing that the old implementation is faster because it's C
>>>> with a Python wrapper and this is "python calling many times different
>>>
>>> That make sense. Additionally, I think it is slower because the xpath
>>> evaluation occurs *only once* in the original implementation. I
>>> definitely
>>> misunderstood what was explained in section "Finding elements quickly" of
>>> [1] where they focus on the use of 'find' and 'findall' vs more efficient
>>> alternatives. We use in our code simple and direct xpath evaluation.
>>> Seems
>>> that anything can't be faster than that.
>>>
>>> Javier
>>>
>>>
>>> [1] http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
>>>
>>>> C functions" ? Have you tested [0] to see WHERE the CPU is consumed?
>>>>
>>>> [0] http://code.google.com/p/jrfonseca/wiki/Gprof2Dot
>>>>
>>>>>        dom = DOM()
>>>>>
>>>>>        dom.body = self.body
>>>>>
>>>>>        return dom
>>>>>
>>>>>
>>>>>
>>>>> Unfortunately this didn't work out as expected. It is slower.
>>>>>
>>>>>>>>  code = '''
>>>>>
>>>>> f = open("index-form-two-fields.html")
>>>>>
>>>>> html = f.read()
>>>>>
>>>>> f.close()
>>>>>
>>>>> u = url_object('http://w3af.com')
>>>>>
>>>>> res = core.data.url.httpResponse.httpResponse(200, html,
>>>>> {'content-type':
>>>>> 'text/html'}, u, u)
>>>>>
>>>>> for i in res.getDOM2().xpath('input',
>>>>> "translate(@type,'PASWORD','pasword')='password'"):
>>>>>
>>>>>    pass
>>>>>
>>>>> '''
>>>>>
>>>>>>>>  setup = '''import sys
>>>>>
>>>>> sys.path.append('/home/jandalia/workspace/w3af.unicode');
>>>>>
>>>>> from core.data.parsers.urlParser import url_object;
>>>>>
>>>>> import core.data.url.httpResponse
>>>>>
>>>>> '''
>>>>>
>>>>>>>>  t = timeit.Timer(code, setup)
>>>>>>>>  min(t.repeat(repeat=3, number=10000))
>>>>>
>>>>> 27.584304094314575
>>>>>
>>>>>
>>>>> Using the original version:
>>>>>
>>>>>>>>  code = '''
>>>>>
>>>>> f = open("/home/jandalia/Desktop/index-form-two-fields.html")
>>>>>
>>>>> html = f.read()
>>>>>
>>>>> f.close()
>>>>>
>>>>> u = url_object('http://w3af.com')
>>>>>
>>>>> res = core.data.url.httpResponse.httpResponse(200, html,
>>>>> {'content-type':
>>>>> 'text/html'}, u, u)
>>>>>
>>>>> dom = res.getDOM()
>>>>>
>>>>> for i in
>>>>> dom.xpath("//input[translate(@type,'PASWORD','pasword')='password']"):
>>>>>
>>>>>    pass
>>>>>
>>>>> '''
>>>>>
>>>>>>>>  t = timeit.Timer(code, setup)
>>>>>>>>  min(t.repeat(repeat=3, number=10000))
>>>>>
>>>>> 3.8396580219268799
>>>>>
>>>>>
>>>>> In other words, it is about 7 times slower.
>>>>> If anyone has an idea on how to improve this code it would be very
>>>>> appreciated. The html doc used for the tests. is attached.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Javier
>>>>>
>>>>> Note: Some useful info can be found here:
>>>>> http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Magic Quadrant for Content-Aware Data Loss Prevention
>>>>> Research study explores the data loss prevention market. Includes
>>>>> in-depth
>>>>> analysis on the changes within the DLP market, and the criteria used to
>>>>> evaluate the strengths and weaknesses of these DLP solutions.
>>>>> http://www.accelacomm.com/jaw/sfnl/114/51385063/
>>>>> _______________________________________________
>>>>> W3af-develop mailing list
>>>>> W3af-develop@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/w3af-develop
>>>>>
>>>>>
>>>>
>>>
>>
>>
>
>



-- 
Andrés Riancho
Director of Web Security at Rapid7 LLC
Founder at Bonsai Information Security
Project Leader at w3af

------------------------------------------------------------------------------
Magic Quadrant for Content-Aware Data Loss Prevention
Research study explores the data loss prevention market. Includes in-depth
analysis on the changes within the DLP market, and the criteria used to
evaluate the strengths and weaknesses of these DLP solutions.
http://www.accelacomm.com/jaw/sfnl/114/51385063/
_______________________________________________
W3af-develop mailing list
W3af-develop@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/w3af-develop

Reply via email to