Re: regular expressions ... slow

2008-11-19 Thread Kay Schluehr
On 18 Nov., 18:47, Stefan Behnel [EMAIL PROTECTED] wrote:
 Kay Schluehr wrote:
  All of this is prototyped in Python and it is still work in progress.
  As long as development has not reached a stable state I refuse to
  rebuild the system in an optimized C version.

 And rightfully so:

 1) the approach is algorithmically better, so it may even beat the current
 C implementation by design.

 2) switching languages before finishing and benchmarking the prototype is a
 premature optimisation. It wouldn't be the first prototype going into
 production.

 3) even before considering a reimplementation, you should throw it into
 Cython to translate it into C code, and then benchmark that.

 Stefan

I fully agree and in fact the Trail parser generator contains a single
extension module called cyTrail which is written in Cython - it's not
just a trivial recompilation of Python in Cython but it uses all kinds
of cdefs.

There is just a difference between optimizing existing code using the
best techniques available and writing code from scratch for speed. I
consider this as a different, subsequent project ( call it cTrail )
and I want to have more fine control than being possible with Cython -
actually I do want to understand the code in a simple language as C. I
have to idea what the code does, generated by Cython. There are entire
layers that can be stripped off when not dealing with Python types and
dynamic memory allocation.

Kay
--
http://mail.python.org/mailman/listinfo/python-list


Re: regular expressions ... slow

2008-11-19 Thread Lawrence D'Oliveiro
Uwe Schmitt wrote:

 Are there any plans  to speed up Pythons regular expression module ?
 Or
 is the example in this artricle too far from reality ???

http://regex.info/blog/2006-09-15/247
--
http://mail.python.org/mailman/listinfo/python-list


Re: regular expressions ... slow

2008-11-18 Thread Kay Schluehr
On 17 Nov., 22:37, Uwe Schmitt [EMAIL PROTECTED] wrote:
 Hi,

 Is anobody aware of this post:  http://swtch.com/~rsc/regexp/regexp1.html
 ?

 Are there any plans  to speed up Pythons regular expression module ?
 Or
 is the example in this artricle too far from reality ???

 Greetings, Uwe

Some remarks.

The basic idea outlined in the article is something that is re-
discovered from time to time. Last year I started working on a parse
tree validator which is a tool that checks whether a parse tree P on
which transformative actions were performed belongs to a language
described by a grammar G. To check this G was initially transformed
into a set of NFAs representing grammars rules. The checker is a
tracing tool that selects nodes in a parse tree in order to step
through the transitions described in the NFA. If it can't select a
state the validation fails.

Generally there are rules which are non left factored well just like

R: a b c | a b d

The NFA transition table accordingly looks roughly like

R:
  (R,0) - (a,1), (a,2)
  (a,1) - (b,1)
  (b,1) - (c,1)
  (c,1) - (None,'-')
  (a,2) - (b,2)
  (b,2) - (d,2)
  (d,1) - (None,'-')

A valid parse tree P which corresponds to rule R has one of two
forms:

P1 = [R, a, b, c] or P2 = [R, a, b, d]

Suppose we want to validate P1 by our NFA. First we select the NFA R
using root node-id R. R corresponds to the start state (R,0). Next we
select the follow states which are [(a,1), (a,2)].  The projection on
the first argument yields a.

Obviously we can shift to a in P1 and match the follow states (a,1)
and (a,2) in R:

P1 = [R, a, b, c]
 ^

Next we shift to b

P1 = [R, a, b, c]
^

Now we have two traces in R we can follow. The one that is assigned
by (a,1) and the other one by (a,2). Since we can't decide which one
we *both at the same time*. This yields follow states (b,1) and (b,2)
which can be projected on b.

Next we shift to c

P1 = [R, a, b, c]
   ^
Again we follow both traces and get (c,1) and (d,1). (c,1) fits well.
Now we have to check that termination of P in this state is at least
an option. The follow state of (c,1) in R is (None,'-') which is the
EXIT symbol of the NFA. The validation was successful.


The same NFAs used to validate parse trees can be used within a top-
down parser generator as input tables. Such a parser generator
automatically handles left-factoring right and is LL(*). It is also O
(n) where n is the length of the input token stream. Unlike parser
generators such as ANTLR it achieves LL(*) without any advanced
lookahead scheme. Instead it produces traces of NFA states and cancels
unused traces ( e.g. there are two initial traces in our example ((R,0)
(a,1)(b,1)) and ((R,0)(a,2)(b,2)). The first one is canceled once d is
selected. Otherwise the second one gets canceled when c gets
selected ).

A full parser generator for CFGs is more complex in many respects than
a regular-expression matcher and AFAIK the one I've built over the
last year is unique in its category. It is also simpler in a few
aspects because full regexp matchers are also deal with lots of
context sensitive issues not being expressed in a CFG.

I used the same parsing scheme to produce a lexer. There is
considerable overhead though because it produces one parse-tree per
character and I intend to avoid this by using a hybrid lexer in the
future that contains some very fast and some slower components. The
reason why I used the parser generator in the lexer is that it has one
signficant advantage over classic regexp parsing schemes: it avoids
dependence on order. Whether you write R = ab|abc or R=abc|ab is
insignificant. Longest match is always preferred unless otherwise
stated. This makes extending the grammar of the lexer very simple.

All of this is prototyped in Python and it is still work in progress.
As long as development has not reached a stable state I refuse to
rebuild the system in an optimized C version. Otherwise if someone
e.g. a student intends to write his master thesis about a next
generation regexp engine built on that stuff or even intends to
contribute to the mentioned lexer I would of course cooperate.

Kay

www.fiber-space.de
[EMAIL PROTECTED]






--
http://mail.python.org/mailman/listinfo/python-list


Re: regular expressions ... slow

2008-11-18 Thread Stefan Behnel
Kay Schluehr wrote:
 All of this is prototyped in Python and it is still work in progress.
 As long as development has not reached a stable state I refuse to
 rebuild the system in an optimized C version.

And rightfully so:

1) the approach is algorithmically better, so it may even beat the current
C implementation by design.

2) switching languages before finishing and benchmarking the prototype is a
premature optimisation. It wouldn't be the first prototype going into
production.

3) even before considering a reimplementation, you should throw it into
Cython to translate it into C code, and then benchmark that.

Stefan
--
http://mail.python.org/mailman/listinfo/python-list


regular expressions ... slow

2008-11-17 Thread Uwe Schmitt
Hi,

Is anobody aware of this post:  http://swtch.com/~rsc/regexp/regexp1.html
?

Are there any plans  to speed up Pythons regular expression module ?
Or
is the example in this artricle too far from reality ???

Greetings, Uwe
--
http://mail.python.org/mailman/listinfo/python-list


Re: regular expressions ... slow

2008-11-17 Thread Jerry Hill
On Mon, Nov 17, 2008 at 4:37 PM, Uwe Schmitt
[EMAIL PROTECTED] wrote:
 Hi,

 Is anobody aware of this post:  http://swtch.com/~rsc/regexp/regexp1.html ?

Yes, it's been brought up here, on python-dev and python-ideas several
times in the past year and a half.

 Are there any plans  to speed up Pythons regular expression module ?
 Or
 is the example in this artricle too far from reality ???

I don't think anyone has taken any concrete steps towards re-writing
the regular expression module.  My understanding from previous threads
on the topic is that the core developers would be willing to accept a
re-written regular expression engine, but none of them are interested
in doing it themselves.  The general consensus seemed to be that the
pathological cases hilited in that article are not very common in the
real world, and that simply switching to the alternative approach
advocated there would require giving up things like backreferences
that are actually used in the real world, which is probably
unacceptable.

Some references:
http://mail.python.org/pipermail/python-dev/2007-March/072241.html
http://mail.python.org/pipermail/python-list/2007-February/427604.html
http://mail.python.org/pipermail/python-ideas/2007-April/000405.html

Personally, I know very little about the nitty gritty of regular
expression engines, but there's some reference material for you to
chew on.

-- 
Jerry
--
http://mail.python.org/mailman/listinfo/python-list


Re: regular expressions ... slow

2008-11-17 Thread Terry Reedy

Uwe Schmitt wrote:

Hi,

Is anobody aware of this post:  http://swtch.com/~rsc/regexp/regexp1.html
?


Near the end:

While writing the text editor sam [6] in the early 1980s, Rob Pike wrote 
a new regular expression implementation, which Dave Presotto extracted 
into a library that appeared in the Eighth Edition. Pike's 
implementation incorporated submatch tracking into an efficient NFA 
simulation but, like the rest of the Eighth Edition source, was not 
widely distributed. Pike himself did not realize that his technique was 
anything new. Henry Spencer reimplemented the Eighth Edition library 
interface from scratch, but using backtracking, and released his 
implementation into the public domain. It became very widely used, 
eventually serving as the basis for the slow regular expression 
implementations mentioned earlier: Perl, PCRE, Python, and so on. (In 
his defense, Spencer knew the routines could be slow, and he didn't know 
that a more efficient algorithm existed. He even warned in the 
documentation, “Many users have found the speed perfectly adequate, 
although replacing the insides of egrep with this code would be a 
mistake.”) Pike's regular expression implementation, extended to support 
Unicode, was made freely available with sam in late 1992, but the 
particularly efficient regular expression search algorithm went 
unnoticed. The code is now available in many forms: as part of sam, as 
Plan 9's regular expression library, or packaged separately for Unix. 
Ville Laurikari independently discovered Pike's algorithm in 1999, 
developing a theoretical foundation as well [2].


So, depending on the license, there appears to be a potential basis for 
a Python unicode version.




Are there any plans  to speed up Pythons regular expression module ?


Yes, but I don't know how much people with such plans have considered 
the adaptive approach recommended.



is the example in this artricle too far from reality ???


The example is, but I don't think the problem illustrated is.

tjr




--
http://mail.python.org/mailman/listinfo/python-list


Re: regular expressions ... slow

2008-11-17 Thread Terry Reedy

Jerry Hill wrote:

On Mon, Nov 17, 2008 at 4:37 PM, Uwe Schmitt
[EMAIL PROTECTED] wrote:

Hi,

Is anobody aware of this post:  http://swtch.com/~rsc/regexp/regexp1.html ?


Yes, it's been brought up here, on python-dev and python-ideas several
times in the past year and a half.


Are there any plans  to speed up Pythons regular expression module ?
Or
is the example in this artricle too far from reality ???


I don't think anyone has taken any concrete steps towards re-writing
the regular expression module.  My understanding from previous threads
on the topic is that the core developers would be willing to accept a
re-written regular expression engine, but none of them are interested
in doing it themselves.  The general consensus seemed to be that the
pathological cases hilited in that article are not very common in the
real world, and that simply switching to the alternative approach
advocated there would require giving up things like backreferences
that are actually used in the real world, which is probably
unacceptable.

Some references:
http://mail.python.org/pipermail/python-dev/2007-March/072241.html
http://mail.python.org/pipermail/python-list/2007-February/427604.html
http://mail.python.org/pipermail/python-ideas/2007-April/000405.html

Personally, I know very little about the nitty gritty of regular
expression engines, but there's some reference material for you to
chew on.


Searching the tracker for open items with 'regular expression' in the 
text brings up about 20 items to also consider.



--
http://mail.python.org/mailman/listinfo/python-list


Re: regular expressions ... slow

2008-11-17 Thread MRAB
On Nov 17, 10:24 pm, Terry Reedy [EMAIL PROTECTED] wrote:
 Jerry Hill wrote:
  On Mon, Nov 17, 2008 at 4:37 PM, Uwe Schmitt
  [EMAIL PROTECTED] wrote:
  Hi,

  Is anobody aware of this post:  http://swtch.com/~rsc/regexp/regexp1.html?

  Yes, it's been brought up here, on python-dev and python-ideas several
  times in the past year and a half.

  Are there any plans  to speed up Pythons regular expression module ?
  Or
  is the example in this artricle too far from reality ???

  I don't think anyone has taken any concrete steps towards re-writing
  the regular expression module.  My understanding from previous threads
  on the topic is that the core developers would be willing to accept a
  re-written regular expression engine, but none of them are interested
  in doing it themselves.  The general consensus seemed to be that the
  pathological cases hilited in that article are not very common in the
  real world, and that simply switching to the alternative approach
  advocated there would require giving up things like backreferences
  that are actually used in the real world, which is probably
  unacceptable.

  Some references:
 http://mail.python.org/pipermail/python-dev/2007-March/072241.html
 http://mail.python.org/pipermail/python-list/2007-February/427604.html
 http://mail.python.org/pipermail/python-ideas/2007-April/000405.html

  Personally, I know very little about the nitty gritty of regular
  expression engines, but there's some reference material for you to
  chew on.

 Searching the tracker for open items with 'regular expression' in the
 text brings up about 20 items to also consider.

Work is currently being done on the re module.

I don't think the DFA approach works permits backreferences, capture
groups or non-greedy repetition, but it certainly could be used if
those features aren't required by the regular expression, so the
answer is definitely maybe! :-)
--
http://mail.python.org/mailman/listinfo/python-list


Re: regular expressions ... slow

2008-11-17 Thread Gabriel Genellina
En Mon, 17 Nov 2008 19:37:18 -0200, Uwe Schmitt  
[EMAIL PROTECTED] escribió:



Is anobody aware of this post:  http://swtch.com/~rsc/regexp/regexp1.html
?

Are there any plans  to speed up Pythons regular expression module ?
Or
is the example in this artricle too far from reality ???


It's a pathological case. There are some known cases of horrible behaviour  
that are explained in many books on regular expressions. If you avoid  
those constructs  when writing the RE, you're reasonably safe. I for  
myself avoid using RE at all :)


--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list