[Python-Dev] Speeding up 2to3: Results from a GSOC Project

2010-07-30 Thread George Boutsioukis
Hi everyone, I haven't had a chance to introduce myself yet. I'm George
Boutsioukis, a CS student from Greece and I'm currently enrolled as a GSOC
student for the PSF. The task I was involved with for the past few weeks was
speeding up the 2to3 tool. 

For those who are not aware of 2to3's internals, the tool matches a series 
of tree patterns to a python syntax tree and applies a code transformation 
on each match. The real bottleneck of this tool is the tree pattern matching
algorithm, since the current version traverses all nodes of the tree,
top-to-bottom, and checks each node against a set of tree patterns.

The new algorithm I've implemented reduces the candidate nodes for matching.
The process works in three steps:

1) Each pattern tree is reduced to the most 'characteristic' path, i.e. the
rarest path you'll encounter on a real code tree. (this is simpler than it
sounds)

2) From the set of the above linear patterns an Aho-Corasick automaton is
constructed that is capable of inclusively matching all patterns(meaning 
that it may mark a wrong candidate but will never miss an instance)

3) The automaton is run on each leaf until the root of the tree is reached 
and the resulting match set is returned. And finally we apply each fixer 
for each node in the match set. Of course the process is a bit more 
complicated since we have to recheck the transformed code for new matches etc.

Since it is quite a hard concept to illustrate in a post, tell me if you
have any questions or need more info.

The benefit of this somewhat complex process is substantial; the total run 
time is reduced by roughly 50%. The new module is still experimental code, 
but mature enough to pass all tests. I propose to include the new module
with the next version of Python, perhaps as an explicit option for 2to3 
until it can be thoroughly tested.

The code repository can be found here: 
https://code.google.com/p/2to3-speedup2/source/browse/
(the code is a fork from py3k's 2to3 and was tested with version 3.1.2)

The blog I've been using for GSOC(containing more details on the matching
algorithm): http://george-gsoc.blogspot.com/

Regards,
George


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up 2to3: Results from a GSOC Project

2010-07-30 Thread Guido van Rossum
Great! CS FTW!

--Guido

On Fri, Jul 30, 2010 at 1:28 PM, George Boutsioukis
gboutsiou...@gmail.com wrote:
 Hi everyone, I haven't had a chance to introduce myself yet. I'm George
 Boutsioukis, a CS student from Greece and I'm currently enrolled as a GSOC
 student for the PSF. The task I was involved with for the past few weeks was
 speeding up the 2to3 tool.

 For those who are not aware of 2to3's internals, the tool matches a series
 of tree patterns to a python syntax tree and applies a code transformation
 on each match. The real bottleneck of this tool is the tree pattern matching
 algorithm, since the current version traverses all nodes of the tree,
 top-to-bottom, and checks each node against a set of tree patterns.

 The new algorithm I've implemented reduces the candidate nodes for matching.
 The process works in three steps:

 1) Each pattern tree is reduced to the most 'characteristic' path, i.e. the
 rarest path you'll encounter on a real code tree. (this is simpler than it
 sounds)

 2) From the set of the above linear patterns an Aho-Corasick automaton is
 constructed that is capable of inclusively matching all patterns(meaning
 that it may mark a wrong candidate but will never miss an instance)

 3) The automaton is run on each leaf until the root of the tree is reached
 and the resulting match set is returned. And finally we apply each fixer
 for each node in the match set. Of course the process is a bit more
 complicated since we have to recheck the transformed code for new matches etc.

 Since it is quite a hard concept to illustrate in a post, tell me if you
 have any questions or need more info.

 The benefit of this somewhat complex process is substantial; the total run
 time is reduced by roughly 50%. The new module is still experimental code,
 but mature enough to pass all tests. I propose to include the new module
 with the next version of Python, perhaps as an explicit option for 2to3
 until it can be thoroughly tested.

 The code repository can be found here:
 https://code.google.com/p/2to3-speedup2/source/browse/
 (the code is a fork from py3k's 2to3 and was tested with version 3.1.2)

 The blog I've been using for GSOC(containing more details on the matching
 algorithm): http://george-gsoc.blogspot.com/

 Regards,
 George


 ___
 Python-Dev mailing list
 Python-Dev@python.org
 http://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe: 
 http://mail.python.org/mailman/options/python-dev/guido%40python.org




-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Speeding up 2to3: Results from a GSOC Project

2010-07-30 Thread Alexandre Vassalotti
Love it!

BTW, it's not a good idea to have an import statement under 3
level of loops:

https://code.google.com/p/2to3-speedup2/source/browse/trunk/lib2to3/refactor.py#427

-- Alexandre
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com