[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2011-08-28 Thread Steven D'Aprano

Steven D'Aprano steve+pyt...@pearwood.info added the comment:

I'm not sure if this belongs here, or on the Google code project page, so I'll 
add it in both places :)

Feature request: please change the NEW flag to something else. In five or six 
years (give or take), the re module will be long forgotten, compatibility with 
it will not be needed, so-called new features will no longer be new, and the 
NEW flag will just be silly.

If you care about future compatibility, some sort of version specification 
would be better, e.g. VERSION=0 (current re module), VERSION=1 (this regex 
module), VERSION=2 (next generation). You could then default to VERSION=0 for 
the first few releases, and potentially change to VERSION=1 some time in the 
future.

Otherwise, I suggest swapping the sense of the flag: instead of re behaviour 
unless NEW flag is given, I'd say re behaviour only if OLD flag is given. 
(Old semantics will, of course, remain old even when the new semantics are no 
longer new.)

--
nosy: +stevenjd

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2011-07-11 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

The new regex imlementation is hosted here: 
https://code.google.com/p/mrab-regex-hg/

The span of m['a_thing'] is m.span('a_thing'), if that helps.

The named groups are listed on the pattern object, which can be accessed via 
m.re:

 m.re
_regex.Pattern object at 0x0161DE30
 m.re.groupindex
{'another_thing': 3, 'a_thing': 1}

so you can use that to create a reverse dict to go from the index to the name 
or None. (Perhaps the pattern object should have such a .group_name attribute.)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2011-07-11 Thread Brian Curtin

Changes by Brian Curtin br...@python.org:


--
nosy:  -brian.curtin

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2011-07-11 Thread Alec Koumjian

Alec Koumjian akoumj...@gmail.com added the comment:

Thanks, Matthew. I did not realize I could access either of those. I should be 
able to build a helper function now to do what I want.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2011-07-11 Thread Collin Winter

Changes by Collin Winter coll...@gmail.com:


--
nosy:  -collinwinter

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2011-07-11 Thread Eric Snow

Changes by Eric Snow ericsnowcurren...@gmail.com:


--
nosy: +ericsnow

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2011-07-10 Thread Alec Koumjian

Alec Koumjian akoumj...@gmail.com added the comment:

I apologize if this is the wrong place for this message. I did not see the link 
to a separate list.

First let me explain what I am trying to accomplish. I would like to be able to 
take an unknown regular expression that contains both named and unnamed groups 
and tag their location in the original string where a match was found. Take the 
following redundantly simple example:

 a_string = rThis is a demo sentence.
 pattern = r(?a_thing\w+) (\w+) (?another_thing\w+)
 m = regex.search(pattern, a_string)

What I want is a way to insert named/numbered tags into the original string, so 
that it looks something like this:

ra_thingThis/a_thing 2is/2 another_thinga/another_thing demo 
sentence.

The syntax doesn't have to be exactly like that, but you get the place. I have 
inserted the names and/or indices of the groups into the original string, 
around the span that the groups occupy. 

This task is exceedingly difficult with the current implementation, unless I am 
missing something obvious. We could call the groups by index, the groups as a 
tuple, or the groupdict:

 m.group(1)
'This'
 m.groups()
('This', 'is', 'a')
 m.groupdict()
{'another_thing': 'a', 'a_thing': 'This'}

If all I wanted was to tag the groups by index, it would be a simple function. 
I would be able to call m.spans() for each index in the length of m.groups() 
and insert the  and / tags around the right indices.

The hard part is finding out how to find the spans of the named groups. Do any 
of you have a suggestion?

It would make more sense from my perspective, if each group was an object that 
had its own .span property. It would work like this with the above example:

 first = m.group(1)
 first.name()
'a_thing'
 second = m.group(2)
 second.name()
None


You could still call .spans() on the Match object itself, but it would query 
its children group objects for the data. Overall I think this would be a much 
more Pythonic approach, especially given that you have added subscripting and 
key lookup.

So instead of this:
 m['a_thing']
'This'
 type(m['a_thing'])
type 'str'

You could have:
 m['a_thing']
'This'
 type(m['a_thing'])
'regex.Match.Group object'

With the noted benefit of this:
 m['a_thing'].span()
(0, 4)
 m['a_thing'].index()
1


Maybe I'm missing a major point or functionality here, but I've been pouring 
over the docs and don't currently think what I'm trying to achieve is possible.

Thank you for taking the time to read all this.

-Alec

--
nosy: +akoumjian
versions:  -Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2011-05-10 Thread Jonathan Halcrow

Jonathan Halcrow jonathan.halc...@gmail.com added the comment:

I'm having a problem using the current version (0.1.20110504) with python 2.5 
on OSX 10.5.  When I try to import regex I get the following import error:

dlopen(snipped/python2.5/site-packages/_regex.so, 2): Symbol not found: 
_re_is_same_char_ign
  Referenced from: snipped/python2.5/site-packages/_regex.so
  Expected in: dynamic lookup

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2011-05-10 Thread Jonathan Halcrow

Jonathan Halcrow jonathan.halc...@gmail.com added the comment:

It seems that _regex_unicode.c is missing from setup.py, adding it to 
ext_modules fixes my previous issue.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2011-05-10 Thread Brian Curtin

Brian Curtin br...@python.org added the comment:

Issues with Regexp should probably be handled on the Regexp tracker.

--
nosy: +brian.curtin

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2011-03-15 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

I've fixed the problem with iterators for both Python 3 and Python 2. They can 
now be shared safely across threads.

I've updated the release on PyPI.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2011-03-14 Thread Gregory P. Smith

Gregory P. Smith g...@krypto.org added the comment:

Could you add me as a member or admin on the mrab-regex-hg project?  I've got a 
few things I want to fix in the code as I start looking into the state of this 
module.  gpsmith at gmail dot com is my google account.

There are some fixes in the upstream python that haven't made it into this code 
that I want to merge in among other things.  I may also add a setup.py file and 
some scripts to to make building and testing this stand alone easier.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2011-03-14 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

@Gregory: I've added you to the project.

I'm currently trying to fix a problem with iterators shared across threads. As 
a temporary measure, the current release on PyPI doesn't enable multithreading 
for them.

The mrab-regex-hg project doesn't have those sources yet. I'll update them 
later today, either to the release on PyPI, or to a fixed version if all goes 
well...

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2011-03-14 Thread Gregory P. Smith

Gregory P. Smith g...@krypto.org added the comment:

Okay. Can you push your setup.py and README and such as well?  Your pypi
release tarballs should match the hg repo and ideally include a mention of
what hg revision they are generated from. :)

-gps

On Mon, Mar 14, 2011 at 5:25 PM, Matthew Barnett rep...@bugs.python.orgwrote:


 Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

 @Gregory: I've added you to the project.

 I'm currently trying to fix a problem with iterators shared across threads.
 As a temporary measure, the current release on PyPI doesn't enable
 multithreading for them.

 The mrab-regex-hg project doesn't have those sources yet. I'll update them
 later today, either to the release on PyPI, or to a fixed version if all
 goes well...

 --

 ___
 Python tracker rep...@bugs.python.org
 http://bugs.python.org/issue2636
 ___


--
Added file: http://bugs.python.org/file21144/unnamed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___Okay. Can you push your setup.py and README and such as well?  Your pypi 
release tarballs should match the hg repo and ideally include a mention of what 
hg revision they are generated from. :)divdivbr/divdiv-gpsbr

brdiv class=gmail_quoteOn Mon, Mar 14, 2011 at 5:25 PM, Matthew Barnett 
span dir=ltrlt;a 
href=mailto:rep...@bugs.python.org;rep...@bugs.python.org/agt;/span 
wrote:brblockquote class=gmail_quote style=margin:0 0 0 
.8ex;border-left:1px #ccc solid;padding-left:1ex;

div class=imbr
Matthew Barnett lt;a 
href=mailto:pyt...@mrabarnett.plus.com;pyt...@mrabarnett.plus.com/agt; 
added the comment:br
br
/div@Gregory: I#39;ve added you to the project.br
br
I#39;m currently trying to fix a problem with iterators shared across threads. 
As a temporary measure, the current release on PyPI doesn#39;t enable 
multithreading for them.br
br
The mrab-regex-hg project doesn#39;t have those sources yet. I#39;ll update 
them later today, either to the release on PyPI, or to a fixed version if all 
goes well...br
divdiv/divdiv class=h5br
--br
br
___br
Python tracker lt;a 
href=mailto:rep...@bugs.python.org;rep...@bugs.python.org/agt;br
lt;a href=http://bugs.python.org/issue2636; 
target=_blankhttp://bugs.python.org/issue2636/agt;br
___br
/div/div/blockquote/divbr/div/div
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2011-03-11 Thread Alex

Changes by Alex alex.gay...@gmail.com:


--
nosy: +alex

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2011-03-08 Thread Davide Rizzo

Changes by Davide Rizzo sor...@gmail.com:


--
nosy: +davide.rizzo

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2011-01-25 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

I've reduced the size of some internal tables.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2011-01-16 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

That line crept in somehow.

As it's been there since the 2010-12-24 release and you're the first one to 
have a problem with it (and you've already fixed it), it looks like a new 
upload isn't urgently needed (I don't have any other changes to make at 
present).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2011-01-14 Thread ronnix

ronnix ronan.ami...@gmail.com added the comment:

The regex 0.1.20110106 package fails to install with Python 2.6, due to the use 
of 2.7 string formatting syntax in setup.py:

print(Copying {} to {}.format(unicodedata_db_h, SRC_DIR))

This line should be changed to:

print(Copying {0} to {1}.format(unicodedata_db_h, SRC_DIR))

Reference: http://docs.python.org/library/string.html#formatstrings

--
nosy: +ronnix

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2011-01-03 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

I've just done a bug fix. The issue is at:

https://code.google.com/p/mrab-regex-hg/

BTW, Jacques, I trust that your regression tests don't test how long a regex 
takes to fail to match, because a bug could cause such a non-match to occur too 
quickly, before the regex has tried all that it should! :-)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-31 Thread Jacques Grove

Jacques Grove aquara...@gmail.com added the comment:

Thanks for putting up the hg repo, makes it much easier to follow.

Getting back to the performance regression I reported in msg124904:

I've verified that if I take the hg commit 7abd9f9bb1 , and I back out the 
guards changes manually, while leaving the FAST_INIT changes in, the 
performance is back to normal on my full regression suite (i.e. the 30-40% 
penalty disappears).

I've repeated my tests a few times to make sure I'm not mistaken;  since the 
guard changes doesn't look like it should impact performance much, but it does.

I've attached the diff that restored the speed for me (as usual, using Python 
2.6.5 on Linux x86_64)

BTW, now that we have the code on google code, can we log individual issues 
over there?  Might make it easier for those interested to follow certain issues 
than trying to comb through every individual detail in this 
super-issue-thread...?

--
Added file: http://bugs.python.org/file20203/remove_guards.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-31 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

Why not? :-)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-31 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

Just to check, does this still work with your changes of msg124959?

regex.search(r'\d{4}(\s*\w)?\W*((?!\d)\w){2}', XX)

For me it fails to match!

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-31 Thread Jacques Grove

Jacques Grove aquara...@gmail.com added the comment:

You're correct, after the change:

regex.search(r'\d{4}(\s*\w)?\W*((?!\d)\w){2}', XX)

doesn't match (i.e. as before commit 7abd9f9bb1).

I was, however, just trying to narrow down which part of the code change killed 
the performance on my regression tests :-)

Happy new year to all out there.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-30 Thread Georg Brandl

Georg Brandl ge...@python.org added the comment:

Hearty +1.  I have the hope of putting this in 3.3, and for that I'd like to 
see how the code matures, which is much easier when in version control.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-30 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

The project is now at:

https://code.google.com/p/mrab-regex/

Unfortunately it doesn't have the revision history. I don't know why not.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-30 Thread Robert Xiao

Robert Xiao nneon...@gmail.com added the comment:

Do you have it in any kind of repository at all? Even a private SVN repo or 
something like that?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-30 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

msg124904: It would, of course, be slower on first use, but I'm surprised that 
it's (that much) slower afterwards.

msg124905, msg124906: I have those matching now.

msg124931: The sources are in TortoiseBzr, but I couldn't upload, so I exported 
to TortoiseSVN.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-30 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

Even after much uninstalling and reinstalling (and reboots) I never got 
TortoiseSVN to work properly, so I switched to TortoiseHg. The sources are now 
at:

https://code.google.com/p/mrab-regex-hg/

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-29 Thread Jacques Grove

Jacques Grove aquara...@gmail.com added the comment:

More an observation than a bug:

I understand that we're trading memory for performance, but I've noticed that 
the peak memory usage is rather high, e.g.:

$ cat test.py
import os
import regex as re

def resident():
for line in open('/proc/%d/status' % os.getpid(), 'r').readlines():
if line.startswith(VmRSS:):
return line.split(:)[-1].strip()

cache = {}

print resident()
for i in xrange(0,1000):
cache[i] = 
re.compile(str(i)+(abcd12kl|efghlajsdf|ijkllakjsdf|mnoplasjdf|qrstljasd|sdajdwxyzlasjdf|kajsdfjkasdjkf|kasdflkasjdflkajsd|klasdfljasdf))

print resident()


Execution output on my machine (Linux x86_64, Python 2.6.5):
4328 kB
32052 kB

with the standard regex library:
3688 kB
5428 kB

So, it looks like around 16x the memory per pattern vs standard regex module

Now the example is pretty silly, the difference is even larger for more complex 
regexes.  I also understand that the once the patterns are GC-ed, python can 
reuse the memory (pymalloc doesn't return it to the OS, unfortunately).  
However, I have some applications that use large numbers (many thousands) of 
regexes and need to keep them cached (compiled) indefinitely (especially 
because compilation is expensive).  This causes some pain (long story).

I've played around with increasing RE_MIN_FAST_LENGTH, and it makes a 
significant difference, e.g.:

RE_MIN_FAST_LENGTH = 10:
4324 kB
25976 kB

In my use-cases, having a larger RE_MIN_FAST_LENGTH doesn't make a huge 
performance difference, so that might be the way I'll go.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-29 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

issue2636-20101230.zip is a new version of the regex module.

I've delayed the building of the tables for fast searching until their first 
use, which, hopefully, will mean that fewer will be actually built.

--
Added file: http://bugs.python.org/file20192/issue2636-20101230.zip

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-29 Thread Jacques Grove

Jacques Grove aquara...@gmail.com added the comment:

Yeah, issue2636-20101230.zip DOES reduce memory usage significantly (30-50%) in 
my use cases;  however, it also tanks performance overall by 35% for me, so 
I'll prefer to stick with issue2636-20101229.zip (or some variant of it).

Maybe a regex compile-time option, although that's not necessary.

Thanks for the effort.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-29 Thread Jacques Grove

Jacques Grove aquara...@gmail.com added the comment:

re.search('\d{4}(\s*\w)?\W*((?!\d)\w){2}', XX)

matches on stock 2.6.5 regex module, but not on issue2636-20101230.zip or 
issue2636-20101229.zip (which I've fallen back to for now)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-29 Thread Jacques Grove

Jacques Grove aquara...@gmail.com added the comment:

Another one that diverges between stock regex and issue2636-20101229.zip:

re.search('A\s*?.*?(\n+.*?\s*?){0,2}\(X', 'A\n1\nS\n1 (X')

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-29 Thread Gregory P. Smith

Gregory P. Smith g...@krypto.org added the comment:

As belopolsky said... *please* move this development into version control.  Put 
it up in an Hg repo on code.google.com.  or put it on github.  *anything* other 
than repeatedly posting entire zip file source code drops to a bugtracker.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-28 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

issue2636-20101228a.zip is a new version of the regex module.

It now compiles the pattern quickly.

--
Added file: http://bugs.python.org/file20182/issue2636-20101228a.zip

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-28 Thread Jacques Grove

Jacques Grove jacq...@tripitinc.com added the comment:

Thanks, issue2636-20101228a.zip also resolves my compilation speed issues I had 
on other (very) complex regexes.

Found this one:

re.search((X.*?Y\s*){3}(X\s*)+AB:, XY\nX Y\nX  Y\nXY\nXX AB:)

produces a search hit with stock python 2.6.5 regex library, but not with 
issue2636-20101228a.zip.

re.search((X.*?Y\s*){3,}(X\s*)+AB:, XY\nX Y\nX  Y\nXY\nXX AB:)

matches on both, however.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-28 Thread Jacques Grove

Jacques Grove aquara...@gmail.com added the comment:

Here is a somewhat crazy pattern (slimmed down from something much larger and 
more complex, which didn't finish compiling even after several minutes): 

re.compile((?:(?:[23][0-9]|3[79]|0?[1-9])(?:[Aa][Aa]|[Aa][Aa]|[Aa][Aa])??(?:[Aa]{3}(?:[Aa]{4})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa][Aa])?|[Aa]{3}(?:[Aa][Aa])?|[Aa]{3}|[Aa]{4}|[Aa]{4}|[Aa]{3}(?:[Aa]{3})?|[Aa]{3}(?:[Aa](?:[Aa]{5})?)?|[Aa]{3}(?:[Aa]{4})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa]{5})?)|(?:[Aa]{3}(?:[Aa]{4})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa][Aa])?|[Aa]{3}(?:[Aa][Aa])?|[Aa]{3}|[Aa]{4}|[Aa]{4}|[Aa]{3}(?:[Aa]{3})?|[Aa]{3}(?:[Aa](?:[Aa]{5})?)?|[Aa]{3}(?:[Aa]{4})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa]{5})?)(?:(?:[\-\s\.,/]){0,4}?)(?:[23][0-9]|3[79]|0?[1-9])(?:[Aa][Aa]|[Aa][Aa]|[Aa][Aa])??)\W*(?:[79][0-9]|2[0-4]|\d)(?:[\.:Aa])?(?:[0-5][0-9])\W*(?:(?:[Aa]{3}(?:[Aa]{3})?|[Aa]{3}(?:[Aa](?:[Aa]{3})?)?|[Aa]{3}(?:[Aa]{5}[Aa])?|[Aa]{3}(?:[Aa](?:[Aa]{4})?)?|[Aa]{3}(?:[Aa]{3})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa]{3})?)|(?:[Aa][Aa](?:[Aa](?:[Aa]{3})?)?|[Aa][Aa](?:[Aa](?:[Aa](?:[Aa](?:[Aa]{3})?)?)?)?|[Aa][Aa](?:[Aa](?:[Aa](?:[Aa]{4})?)?)?|[Aa][Aa](?:[Aa](?:[Aa]{3}(?:[Aa](?:[Aa]{3})?)?)?)?
 
|[Aa][Aa](?:[Aa](?:[Aa](?:[Aa]{3})?)?)?|[Aa][Aa](?:[Aa](?:[Aa](?:[Aa]{3})?)?)?|[Aa]{3}(?:[Aa](?:[Aa](?:[Aa]{4})?)?)?|[Aa][Aa](?:[Aa](?:[Aa](?:[Aa]{3})?)?)?))\s*(\-\s*)?(?:(?:[23][0-9]|3[79]|0?[1-9])(?:[Aa][Aa]|[Aa][Aa]|[Aa][Aa])??(?:(?:[\-\s\.,/]){0,4}?)(?:[Aa]{3}(?:[Aa]{4})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa][Aa])?|[Aa]{3}(?:[Aa][Aa])?|[Aa]{3}|[Aa]{4}|[Aa]{4}|[Aa]{3}(?:[Aa]{3})?|[Aa]{3}(?:[Aa](?:[Aa]{5})?)?|[Aa]{3}(?:[Aa]{4})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa]{5})?)|(?:[Aa]{3}(?:[Aa]{4})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa][Aa])?|[Aa]{3}(?:[Aa][Aa])?|[Aa]{3}|[Aa]{4}|[Aa]{4}|[Aa]{3}(?:[Aa]{3})?|[Aa]{3}(?:[Aa](?:[Aa]{5})?)?|[Aa]{3}(?:[Aa]{4})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa]{5})?)(?:(?:[\-\s\.,/]){0,4}?)(?:[23][0-9]|3[79]|0?[1-9])(?:[Aa][Aa]|[Aa][Aa]|[Aa][Aa])??)(?:(?:(?:[\-\s\.,/]){0,4}?)(?:(?:68)?[7-9]\d|(?:2[79])?\d{2}))?\W*(?:[79][0-9]|2[0-4]|\d)(?:[\.:Aa])?(?:[0-5][0-9]))


Runs about 10.5 seconds on my machine with issue2636-20101228a.zip, less than 
0.03 seconds with stock Python 2.6.5 regex engine.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-28 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

issue2636-20101229.zip is a new version of the regex module.

It now compiles the pattern quickly.

--
Added file: http://bugs.python.org/file20185/issue2636-20101229.zip

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-27 Thread Jacques Grove

Jacques Grove jacq...@tripitinc.com added the comment:

Testing issue2636-20101224.zip:

Nested modifiers seems to hang the regex compilation when used in a 
non-capturing group e.g.:

re.compile((?:(?i)foo))

or

re.compile((?:(?u)foo))


No problem on stock Python 2.6.5 regex engine.

The unnested version of the same regex compiles fine.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-27 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

issue2636-20101228.zip is a new version of the regex module.

Sorry for the delay, the fix took me a bit longer than I expected. :-)

--
Added file: http://bugs.python.org/file20176/issue2636-20101228.zip

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-27 Thread Jacques Grove

Jacques Grove jacq...@tripitinc.com added the comment:

Another re.compile performance issue (I've seen a couple of others, but I'm 
still trying to simplify the test-cases):

re.compile((?ui)(a\s?b\s?c\s?d\s?e\s?f\s?g\s?h\s?i\s?j\s?k\s?l\s?m\s?n\s?o\s?p\s?q\s?r\s?s\s?t\s?u\s?v\s?w\s?y\s?z\s?a\s?b\s?c\s?d))

completes in around 0.01s on my machine using Python 2.6.5 standard regex 
library, but takes around 12 seconds using issue2636-20101228.zip

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-24 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

I've been trying to push the history to Launchpad, completely without success; 
it just won't authenticate (no such account, even though I can log in!).

I doubt that the history would be much use to you anyway.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-24 Thread R. David Murray

R. David Murray rdmur...@bitdance.com added the comment:

I suspect it would help if there are more changes, though.

I believe that to push to launchpad you have to upload an ssh key.  Not sure 
why you'd get no such account, though.  Barry would probably know :)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-24 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

It does have an SSH key. It's probably something simple that I'm missing.

I think that the only change I'm likely to make is to a support script I use; 
it currently uses hard-coded paths, etc, to do its magic. :-)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-23 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

issue2636-20101224.zip is a new version of the regex module.

Case-insensitive matching is now faster.

The matching functions and methods now accept a keyword argument to release the 
GIL during matching to enable other Python threads to run concurrently:

matches = regex.findall(pattern, string, concurrent=True)

This should be used only when it's guaranteed that the string won't change 
during matching.

The GIL is always released when working on instances of the builtin (immutable) 
string classes because that's known to be safe.

--
Added file: http://bugs.python.org/file20154/issue2636-20101224.zip

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-23 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

I would like to start reviewing this code, but dated zip files on a tracker 
make a very inefficient VC setup.  Would you consider exporting your 
development history to some public VC system?

--
nosy: +belopolsky

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-23 Thread Jeffrey C. Jacobs

Jeffrey C. Jacobs timeho...@users.sourceforge.net added the comment:

+1 on VC

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-13 Thread Éric Araujo

Changes by Éric Araujo mer...@netwok.org:


--
stage:  - patch review
type: compile error - feature request
versions: +Python 3.3 -Python 2.6

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-10 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

issue2636-20101210.zip is a new version of the regex module.

I've extended the additional checks of the previous version.

It has been tested with Python 2.5 to Python 3.2b1.

--
Added file: http://bugs.python.org/file20001/issue2636-20101210.zip

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-06 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

issue2636-20101207.zip is a new version of the regex module.

It includes additional checks against pathological regexes.

--
Added file: http://bugs.python.org/file19965/issue2636-20101207.zip

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-12-06 Thread Zach Dwiel

Zach Dwiel zdw...@gmail.com added the comment:

Here is the terminal log of what happens when I try to install and then import 
regex.  Any ideas what is going on?

$ python setup.py install
running install
running build
running build_py
creating build
creating build/lib.linux-i686-2.6
copying Python2/regex.py - build/lib.linux-i686-2.6
copying Python2/_regex_core.py - build/lib.linux-i686-2.6
running build_ext
building '_regex' extension
creating build/temp.linux-i686-2.6
creating build/temp.linux-i686-2.6/Python2
gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall 
-Wstrict-prototypes -fPIC -I/usr/include/python2.6 -c Python2/_regex.c -o 
build/temp.linux-i686-2.6/Python2/_regex.o
Python2/_regex.c:109: warning: ‘struct RE_State’ declared inside parameter list
Python2/_regex.c:109: warning: its scope is only this definition or 
declaration, which is probably not what you want
Python2/_regex.c:110: warning: ‘struct RE_State’ declared inside parameter list
Python2/_regex.c:538: warning: initialization from incompatible pointer type
Python2/_regex.c:539: warning: initialization from incompatible pointer type
Python2/_regex.c:679: warning: initialization from incompatible pointer type
Python2/_regex.c:680: warning: initialization from incompatible pointer type
Python2/_regex.c:1217: warning: initialization from incompatible pointer type
Python2/_regex.c:1218: warning: initialization from incompatible pointer type
Python2/_regex.c: In function ‘try_match’:
Python2/_regex.c:3153: warning: passing argument 1 of 
‘state-encoding-at_boundary’ from incompatible pointer type
Python2/_regex.c:3153: note: expected ‘struct RE_State *’ but argument is of 
type ‘struct RE_State *’
Python2/_regex.c:3184: warning: passing argument 1 of 
‘state-encoding-at_default_boundary’ from incompatible pointer type
Python2/_regex.c:3184: note: expected ‘struct RE_State *’ but argument is of 
type ‘struct RE_State *’
Python2/_regex.c: In function ‘search_start’:
Python2/_regex.c:3535: warning: assignment from incompatible pointer type
Python2/_regex.c:3581: warning: assignment from incompatible pointer type
Python2/_regex.c: In function ‘basic_match’:
Python2/_regex.c:3995: warning: assignment from incompatible pointer type
Python2/_regex.c:3996: warning: assignment from incompatible pointer type
Python2/_regex.c: At top level:
Python2/unicodedata_db.h:241: warning: ‘nfc_first’ defined but not used
Python2/unicodedata_db.h:448: warning: ‘nfc_last’ defined but not used
Python2/unicodedata_db.h:550: warning: ‘decomp_prefix’ defined but not used
Python2/unicodedata_db.h:2136: warning: ‘decomp_data’ defined but not used
Python2/unicodedata_db.h:3148: warning: ‘decomp_index1’ defined but not used
Python2/unicodedata_db.h:: warning: ‘decomp_index2’ defined but not used
Python2/unicodedata_db.h:4122: warning: ‘comp_index’ defined but not used
Python2/unicodedata_db.h:4241: warning: ‘comp_data’ defined but not used
Python2/unicodedata_db.h:5489: warning: ‘get_change_3_2_0’ defined but not used
Python2/unicodedata_db.h:5500: warning: ‘normalization_3_2_0’ defined but not 
used
Python2/_regex.c: In function ‘basic_match’:
Python2/_regex.c:4106: warning: ‘info.captures_count’ may be used uninitialized 
in this function
Python2/_regex.c:4720: warning: ‘info.captures_count’ may be used uninitialized 
in this function
Python2/_regex.c: In function ‘splitter_split’:
Python2/_regex.c:8076: warning: ‘result’ may be used uninitialized in this 
function
gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions 
build/temp.linux-i686-2.6/Python2/_regex.o -o build/lib.linux-i686-2.6/_regex.so
running install_lib
copying build/lib.linux-i686-2.6/_regex.so - 
/usr/local/lib/python2.6/dist-packages
copying build/lib.linux-i686-2.6/_regex_core.py - 
/usr/local/lib/python2.6/dist-packages
copying build/lib.linux-i686-2.6/regex.py - 
/usr/local/lib/python2.6/dist-packages
byte-compiling /usr/local/lib/python2.6/dist-packages/_regex_core.py to 
_regex_core.pyc
byte-compiling /usr/local/lib/python2.6/dist-packages/regex.py to regex.pyc
running install_egg_info
Writing /usr/local/lib/python2.6/dist-packages/regex-0.1.20101123.egg-info
$ python
Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56) 
[GCC 4.4.3] on linux2
Type help, copyright, credits or license for more information.
 import regex
Traceback (most recent call last):
  File stdin, line 1, in module
  File 
/usr/local/lib/python2.6/dist-packages/regex-0.1.20101207-py2.6-linux-i686.egg/regex.py,
 line 273, in module
from _regex_core import *
  File 
/usr/local/lib/python2.6/dist-packages/regex-0.1.20101207-py2.6-linux-i686.egg/_regex_core.py,
 line 54, in module
import _regex
ImportError: 
/usr/local/lib/python2.6/dist-packages/regex-0.1.20101207-py2.6-linux-i686.egg/_regex.so:
 undefined symbol: max

--
nosy: +zdwiel
type: feature request - compile error
versions: +Python 2.6 -Python 3.2

___
Python tracker rep...@bugs.python.org

[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-11-29 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

issue2636-20101130.zip is a new version of the regex module.

Added 'special_only' keyword parameter (default False) to regex.escape. When 
True, regex.escape escapes only 'special' characters, such as '?'.

--
Added file: http://bugs.python.org/file19881/issue2636-20101130.zip

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-11-23 Thread Steve Moran

Steve Moran s...@uw.edu added the comment:

Forgive me if this is just a stupid oversight. 

I'm a linguist and use UTF-8 for special characters for linguistics data. 
This often includes multi-byte Unicode character sequences that are composed as 
one grapheme. For example the í̵ (if it's displaying correctly for you) is a 
LATIN SMALL LETTER I WITH STROKE \u0268 combined with COMBINING ACUTE ACCENT 
\u0301. E.g. a word I'm parsing:

jí̵-e-gɨ

I was pretty excited to find out that this regex library implements the 
grapheme match \X (equivalent to \P{M}\p{M}*). For the above example I needed 
to evaluate which sequences of characters can occur across syllable boundaries 
(here the hyphen -), so I'm aiming for:

í̵-e
e-g

When regex couldn't get any better, you awesome developers implemented an 
overlapped=True flag with findall and finditer. 

Python 3.1.2 (r312:79147, May 19 2010, 11:50:28) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2
 import regex
 s = jí̵-e-gɨ
 s
'jí̵-e-gɨ'
 m = regex.compile((\X)(-)(\X))
 m.findall(s, overlapped=False)
[('í̵', '-', 'e')]

But these results are weird to me:

 m.findall(s, overlapped=True)
[('í̵', '-', 'e'), ('í̵', '-', 'e'), ('e', '-', 'g'), ('e', '-', 'g'), ('e', 
'-', 'g')]

Why the extra matches? At first I figured this had something to do with the 
overlapping match of the grapheme, since it's multiple characters. So I tried 
it with with out the grapheme match:

 m = regex.compile((.)(-)(.))
 s2 = a-b-cd-e-f
 m.findall(s2, overlapped=False)
[('a', '-', 'b'), ('d', '-', 'e')]

That's right. But with overlap...

 m.findall(s2, overlapped=True)
[('a', '-', 'b'), ('b', '-', 'c'), ('b', '-', 'c'), ('d', '-', 'e'), ('d', '-', 
'e'), ('d', '-', 'e'), ('e', '-', 'f'), ('e', '-', 'f')]

Those 'extra' matches are confusing me. 2x b-c, 3x d-e, 2x e-f? Or even more 
simply:

 s2 = a-b-c
 m.findall(s2, overlapped=False)
[('a', '-', 'b')]
 m.findall(s2, overlapped=True)
[('a', '-', 'b'), ('b', '-', 'c'), ('b', '-', 'c')]

Thanks!

--
nosy: +stiv
type: feature request - behavior

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-11-23 Thread R. David Murray

R. David Murray rdmur...@bitdance.com added the comment:

Please don't change the type, this issue is about the feature request of adding 
this regex engine to the stdlib.

I'm sure Matthew will get back to you about your question.

--
type: behavior - feature request

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-11-23 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

issue2636-20101123.zip is a new version of the regex module.

Oops, sorry, the weird behaviour of msg11 was a bug. :-(

--
Added file: http://bugs.python.org/file19786/issue2636-20101123.zip

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-11-20 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

issue2636-20101121.zip is a new version of the regex module.

The captures didn't work properly with lookarounds or atomic groups.

--
Added file: http://bugs.python.org/file19723/issue2636-20101121.zip

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-11-19 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

issue2636-20101120.zip is a new version of the regex module.

The match object now supports additional methods which return information on 
all the successful matches of a repeated capture group.

The API was inspired by that of .Net:

matchobject.captures([group1, ...])

Returns a tuple of the strings matched in a group or groups. Compare 
with matchobject.group([group1, ...]).

matchobject.starts([group])

Returns a tuple of the start positions. Compare with 
matchobject.start([group]).

matchobject.ends([group])

Returns a tuple of the end positions. Compare with 
matchobject.end([group]).

matchobject.spans([group])

Returns a tuple of the spans. Compare with matchobject.span([group]).

--
Added file: http://bugs.python.org/file19651/issue2636-20101120.zip

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-11-13 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

I'd have liked to suggest updating the underlying unicode data to the latest 
standard 6.0, but it turns out, it might be problematic with the cross-version 
compatibility;
according to the clarification in 
http://bugs.python.org/issue10400
the 3... versions are going to be updated, while it is not allowed in the 2.x 
series.
I guess it would cause maintainance problems (as the needed properties are not 
available via unicodedata).
Anyway, while I'd like the recent unicode data to be supported (new characters, 
ranges, scripts, and corrected individual properties...),
I'm much happier, that there is support for the 2 series in regex...
vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-11-13 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

issue2636-20101113.zip is a new version of the regex module.

It now supports Unicode 6.0.0.

--
Added file: http://bugs.python.org/file19597/issue2636-20101113.zip

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-11-13 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

Thank you very much!
a quick test with my custom unicodedata with 6.0 on py 2.7 seems ok.
I hope, there won't be problems with cooperation of the more recent internal 
data with the original 5.2 database in python 2.x releases.

vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-11-11 Thread Alex Willmer

Alex Willmer a...@moreati.org.uk added the comment:

The re module throws an exception for re.compile(r'[\A\w]'). latest
regex doesn't, but I don't think the pattern is matching correctly.
Shouldn't findall(r'[\A]\w', 'a b c') return ['a'] and
findall(r'[\A\s]\w', 'a b c') return ['a', ' b', ' c'] ?

Python 2.6.6 (r266:84292, Sep 15 2010, 16:22:56)
[GCC 4.4.5] on linux2
Type help, copyright, credits or license for more information.
 import re
 for s in [r'\A\w', r'[\A]\w', r'[\A\s]\w']: print re.findall(s, 'a b c')
...
['a']
[]
Traceback (most recent call last):
  File stdin, line 1, in module
  File /usr/lib/python2.6/re.py, line 177, in findall
return _compile(pattern, flags).findall(string)
  File /usr/lib/python2.6/re.py, line 245, in _compile
raise error, v # invalid expression
sre_constants.error: internal: unsupported set operator
 import regex
 for s in [r'\A\w', r'[\A]\w', r'[\A\s]\w']: print regex.findall(s, 'a b c')
...
['a']
[]
[' b', ' c']

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-11-11 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

Maybe I am missing something, but the result in regex seem ok to me:
\A is treated like A in a character set; when the test string is changed to A 
b c or in the case insensitive search the A is matched.

[\A\s]\w doesn't match the starting a, as it is not followed by any word 
character:

 for s in [r'\A\w', r'[\A]\w', r'[\A\s]\w']: print regex.findall(s, 'A b c')
... 
['A']
[]
[' b', ' c']
 for s in [r'\A\w', r'(?i)[\A]\w', r'[\A\s]\w']: print regex.findall(s, 'a b 
 c')
... 
['a']
[]
[' b', ' c']
 

In the original re there seem to be a bug/limitation in this regard (\A and 
also \Z in character sets aren't supported in some combinations...

vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-11-11 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

It looks like a similar problem to msg116252 and msg116276.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-11-11 Thread Alex Willmer

Alex Willmer a...@moreati.org.uk added the comment:

On Thu, Nov 11, 2010 at 10:20 PM, Vlastimil Brom rep...@bugs.python.org wrote:
 Maybe I am missing something, but the result in regex seem ok to me:
 \A is treated like A in a character set;

I think it's me who missed something. I'd assumed that all backslash
patterns (including \A for beginning of string) maintain their meaning
in a character class. AFAICT that assumption was wrong.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-11-05 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

issue2636-20101106.zip is a new version of the regex module.

Fix for issue 10328, which regex also shared.

--
Added file: http://bugs.python.org/file19514/issue2636-20101106.zip

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-11-02 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

There seems to be a bug in the handling of numbered backreferences in sub() in
issue2636-20101102.zip
I believe, it would be a fairly new regression, as it would be noticed rather 
soon.
(tested on Python 2.7; winXP)

 re.sub(([xy]), -\\1-, abxc)
'ab-x-c'
 regex.sub(([xy]), -\\1-, abxc)
Traceback (most recent call last):
  File stdin, line 1, in module
  File C:\Python27\lib\regex.py, line 176, in sub
return _compile(pattern, flags).sub(repl, string, count, pos, endpos)
  File C:\Python27\lib\regex.py, line 375, in _compile_replacement
compiled.extend(items)
TypeError: 'int' object is not iterable


vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-11-02 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

Sorry for the noise, please, forgot my previous msg120215;
I somehow managed to keep an older version of _regex_core.py along with the new 
regex.py in the Lib directory, which are obviously incompatible.
After updating the files correctly, the mentioned examples work correctly.

vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-11-02 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

issue2636-20101102a.zip is a new version of the regex module.

msg120204 relates to issue #1519638 Unmatched group in replacement. In 
'regex' an unmatched group is treated as an empty string in a replacement 
template. This behaviour is more in keeping with regex implementations in other 
languages.

msg120206 was caused by not all group references being made case-insensitive 
when they should be.

--
Added file: http://bugs.python.org/file19469/issue2636-20101102a.zip

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-11-01 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

issue2636-20101101.zip is a new version of the regex module.

I hope it's finally fixed this time! :-)

--
Added file: http://bugs.python.org/file19456/issue2636-20101101.zip

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-11-01 Thread Jacques Grove

Jacques Grove jacq...@tripitinc.com added the comment:

OK, I think this might be the last one I will find for the moment:

$ cat test.py
import re, regex

text = test?
regexp = test\?
sub_value = result\?
print repr(re.sub(regexp, sub_value, text))
print repr(regex.sub(regexp, sub_value, text))


$ python test.py
'result\\?'
'result?'

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-11-01 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

issue2636-20101102.zip is a new version of the regex module.

--
Added file: http://bugs.python.org/file19460/issue2636-20101102.zip

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-11-01 Thread Jacques Grove

Jacques Grove jacq...@tripitinc.com added the comment:

Spoke too soon, although this might be a valid divergence in behavior:

$ cat test.py 
import re, regex

text = test: 2

print regex.sub('(test)\W+(\d+)(?:\W+(TEST)\W+(\d))?', '\\2 \\1, \\4 \\3', text)
print re.sub('(test)\W+(\d+)(?:\W+(TEST)\W+(\d))?', '\\2 \\1, \\4 \\3', text)


$ python test.py 
2 test,  
Traceback (most recent call last):
  File test.py, line 6, in module
print re.sub('(test)\W+(\d+)(?:\W+(TEST)\W+(\d))?', '\\2 \\1, \\4 \\3', 
text)
  File /usr/lib64/python2.7/re.py, line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
  File /usr/lib64/python2.7/re.py, line 278, in filter
return sre_parse.expand_template(template, match)
  File /usr/lib64/python2.7/sre_parse.py, line 787, in expand_template
raise error, unmatched group
sre_constants.error: unmatched group

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-11-01 Thread Jacques Grove

Jacques Grove jacq...@tripitinc.com added the comment:

Another, with backreferences:

import re, regex

text = TEST, BEST; LEST ; Lest 123 Test, Best
regexp = (?i)(.{1,40}?),(.{1,40}?)(?:;)+(.{1,80}).{1,40}?\\3(\ 
|;)+(.{1,80}?)\\1
print re.findall(regexp, text)
print regex.findall(regexp, text)

$ python test.py 
[('TEST', ' BEST', ' LEST', ' ', '123 ')]
[('T', ' BEST', ' ', ' ', 'Lest 123 ')]

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-10-31 Thread Jacques Grove

Jacques Grove jacq...@tripitinc.com added the comment:

And another, bit less pathological, testcase.  Sorry for the ugly testcase;  it 
was much worse before I boiled it down :-)

$ cat test.py 
import re, regex

text = \nTest\nxyz\nxyz\nEnd

regexp = '(\nTest(\n+.+?){0,2}?)?\n+End'
print re.findall(regexp, text)
print regex.findall(regexp, text)

$ python test.py
[('\nTest\nxyz\nxyz', '\nxyz')]
[('', '')]

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-10-30 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

issue2636-20101030a.zip is a new version of the regex module.

This bug was a bit more difficult to fix, but I think it's OK now!

--
Added file: http://bugs.python.org/file19435/issue2636-20101030a.zip

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-10-30 Thread Jacques Grove

Jacques Grove jacq...@tripitinc.com added the comment:

Here's one that really falls in the category of don't do that;  but I found 
this because I was limiting the system recursion level to somewhat less than 
the standard 1000 (for other reasons), and I had some shorter duplicate 
patterns in a big regex.  Here is the simplest case to make it blow up with the 
standard recursion settings:

$ cat test.py
import re, regex
regexp = 
'(abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ|abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ)'
re.compile(regexp)
regex.compile(regexp)

$ python test.py
snip big traceback except for last few lines

File /tmp/test/src/lib/_regex_core.py, line 2024, in optimise
subpattern = subpattern.optimise(info)
  File /tmp/test/src/lib/_regex_core.py, line 1552, in optimise
branches = [_Branch(branches)]
RuntimeError: maximum recursion depth exceeded

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-10-29 Thread Jacques Grove

Jacques Grove jacq...@tripitinc.com added the comment:

Do we expect this to work on 64 bit Linux and python 2.6.5?  I've compiled and 
run some of my code through this, and there seems to be issues with non-greedy 
quantifier matching (at least relative to the old re module):

$ cat test.py
import re, regex

text = (MY TEST)
regexp = '\((?Ptest.{0,5}?TEST)\)'
print re.findall(regexp, text)
print regex.findall(regexp, text)


$ python test.py
['MY TEST']
[]

python 2.7 produces the same results for me.

However, making the quantifier greedy (removing the '?') gives the same result 
for both re and regex modules.

--
nosy: +jacques

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-10-29 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

That's a bug. I'll fix it as soon has I've reinstalled the SDK. sigh/

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-10-29 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

issue2636-20101029.zip is a new version of the regex module.

I've also added to the unit tests.

--
Added file: http://bugs.python.org/file19419/issue2636-20101029.zip

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-10-29 Thread Jacques Grove

Jacques Grove jacq...@tripitinc.com added the comment:

Here's another inconsistency (same setup as before, running 
issue2636-20101029.zip code):

$ cat test.py
import re, regex

text = \n  S

regexp = '[^a]{2}[A-Z]'
print re.findall(regexp, text)
print regex.findall(regexp, text)

$ python test.py
['  S']
[]


I might flush out some more as I excercise this over the next few days.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-10-29 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

issue2636-20101030.zip is a new version of the regex module.

I've also added yet more to the unit tests.

--
Added file: http://bugs.python.org/file19422/issue2636-20101030.zip

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-10-29 Thread Jacques Grove

Jacques Grove jacq...@tripitinc.com added the comment:

And another (with issue2636-20101030.zip):

$ cat test.py 
import re, regex
text = XYABCYPPQ\nQ DEF
regexp = 'X(Y[^Y]+?){1,2}(\ |Q)+DEF'
print re.findall(regexp, text)
print regex.findall(regexp, text)

$ python test.py 
[('YPPQ\n', ' ')]
[]

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-10-14 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

I tried to give the 64-bit version a try, but I might have encountered a more 
general difficulties.
I tested this on Windows 7 Home Premium (Czech), the system is 64-bit (or I've 
hoped so sofar :-), according to System info: x64-based PC
I installed
Python 2.7 Windows X86-64 installer
from http://www.python.org/download/
which run ok, but the header in the python shell contains win32

Python 2.7 (r27:82525, Jul  4 2010, 07:43:08) [MSC v.1500 64 bit (AMD64)] on 
win32
Type help, copyright, credits or license for more information.

Consequently, after copying the respecitive files from issue2636-20101009.zip
I get an import error:

 import regex
Traceback (most recent call last):
  File stdin, line 1, in module
  File C:\Python_64bit_27\lib\regex.py, line 253, in module
from _regex_core import *
  File C:\Python_64bit_27\lib\_regex_core.py, line 53, in module
import _regex
ImportError: DLL load failed: %1 nenÝ platnß aplikace typu Win32.

 

(The last part of the message is a in Czech with broken diacritics:
 %1 is not a valid Win32 type application.)

Is there something I can do in this case? I'd think, the installer would refuse 
to install a 64-bit software on a 32-bit OS or 32-bit architecture, or am I 
missing something obvious from the naming peculiarities x64, 64bit etc.?
That being said, I probably don't need to use 64-bit version of python, 
obviously, it isn't a wide unicode build mentioned earlier, hence
 len(u\U00010333) # is still: 
2

And I currently don't have special memory requirements, which might be better 
addressed on a 64-bit system.

If there is something I can do to test regex in this environment, please, let 
me know;
On the same machine the 32-version is ok:
Python 2.7 (r27:82525, Jul  4 2010, 09:01:59) [MSC v.1500 32 bit (Intel)] on 
win32
Type help, copyright, credits or license for more information.
 import regex


regards
   vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-10-14 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

Vlastil, what makes you think that issue2636-20101009.zip is a 64-bit version? 
I can only find 32-bit DLLs in it.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-10-14 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

Well, it seemed to me too,
I happened to read the last post from Matthew, msg118243, in the sense that he 
made some updates which need testing on a 64 bit system (I am unsure, whether 
hardware architecture, OS type, python build or something else was meant); but 
it must have been somehow separated as a new directory in the 
issue2636-20101009.zip which is not the case.

More generaly, I was somhow confused about the win32 in the shell header in 
the mentioned install.
vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-10-14 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

I am not able to build or test a 64-bit version. The update was to the source 
files to ensure that if it is compiled for 64 bits then the string positions 
will also be 64-bit.

This change was prompted by a poster who tried to use the re module of a 64-bit 
Python build on a 30GB memmapped file but found that the string positions were 
still limited to 32 bits.

It looked like a 64-bit build of the regex module would have the same 
limitation.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-10-14 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

Sorry for the noise,
it seems, I can go back to the 32-bit python for now then...
vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-10-08 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

issue2636-20101009.zip is a new version of the regex module.

It appears from a posting in python-list and a closer look at the docs that 
string positions in the 're' module are limited to 32 bits, even on 64-bit 
builds. I think it's because of things like:

Py_BuildValue(i, ...)

where 'i' indicates the size of a C int, which, at least in Windows compilers, 
is 32-bits in both 32-bit and 64-bit builds.

The regex module shared the same problem. I've changed such code to:

Py_BuildValue(n, ...)

and so forth, which indicates Py_ssize_t.

Unfortunately I'm not able to confirm myself that this will fix the problem on 
64 bits.

--
Added file: http://bugs.python.org/file19168/issue2636-20101009.zip

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-09-21 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

I use Python 3, where len(\U00010337) == 2 on a narrow build.

Yes, wide Unicode on a narrow build is a problem:

 regex.findall(\\U00010337, a\U00010337bc)
[]
 regex.findall((?i)\\U00010337, a\U00010337bc)
[]

I'm not sure how (or whether!) to handle surrogate pairs. It _would_ make 
things more complicated.

I suppose the moral is that if you want to use wide Unicode then you really 
should use a wide build.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-09-21 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

Well, of course, the surrogates probably shouldn't be handled separately in one 
module independently of the rest of the standard library. (I actually don't 
know such narrow implementation (although it is mentioned in those unicode 
quidelines 
http://unicode.org/reports/tr18/#Supplementary_Characters )

The main surprise on my part was due to the compile error rather than empty 
match as was the case with re; 
but now I see, that it is a consequence of the newly introduced wide unicode 
notation, the matching behaviour changed consistently.

(for my part, the workarounds I found, seem to be sufficient in the cases I 
work with wide unicode; most likely I am not going to compile wide unicode 
build on windows myself in the near future :-)
 vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-09-20 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

I like the idea of the general new flag introducing the reasonable, backwards 
incompatible behaviour; one doesn't have to remember a list of non-standard 
flags to get this features.

While I recognise, that the module probably can't work correctly with wide 
unicode characters on a narrow python build (py 2.7, win XP in this case), i 
noticed a difference to re in this regard (it might be based on the absence of 
the wide unicode literal in the latter).

re.findall(u\\U00010337, ua\U00010337bc)
[]
re.findall(u(?i)\\U00010337, ua\U00010337bc)
[]
regex.findall(u\\U00010337, ua\U00010337bc)
[]
regex.findall(u(?i)\\U00010337, ua\U00010337bc)
Traceback (most recent call last):
  File input, line 1, in module
  File C:\Python27\lib\regex.py, line 203, in findall
return _compile(pattern, flags).findall(string, pos, endpos,
  File C:\Python27\lib\regex.py, line 310, in _compile
parsed = parsed.optimise(info)
  File C:\Python27\lib\_regex_core.py, line 1735, in optimise
if self.is_case_sensitive(info):
  File C:\Python27\lib\_regex_core.py, line 1727, in is_case_sensitive
return char_type(self.value).lower() != char_type(self.value).upper()
ValueError: unichr() arg not in range(0x1) (narrow Python build)

I.e. re fails to match this pattern (as it actually looks for U00010337 ), 
regex doesn't recognise the wide unicode as surrogate pair either, but it also 
raises an error from narrow unichr. Not sure, whether/how it should be fixed, 
but the difference based on the i-flag seems unusual.

Of course it would be nice, if surrogate pairs were interpreted, but I can 
imagine, that it would open a whole can of worms, as this is not thoroughly 
supported in the builtin unicode either (len, indices, slicing).

I am trying to make wide unicode characters somehow usable in my app, mainly 
with hacks like extended unichr
(\U+hex(67)[2:].zfill(8)).decode(unicode-escape) 
or likewise for ord
surrog_ord = (ord(first) - 0xD800) * 0x400 + (ord(second) - 0xDC00) + 0x1

Actually, using regex, one can work around some of these limitations of len, 
index or slice using a list form of the string containing surrogates

regex.findall(ur(?s)(?:\p{inHighSurrogates}\p{inLowSurrogates})|., uab̷̸̹cd)
[u'a', u'b', u'\U00010337', u'\U00010338', u'\U00010339', u'c', u'd']

but apparently things like wide unicode literals or character sets (even 
extending of the shorthands like \w etc.) are much more complicated.

regards,
   vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-09-17 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

issue2636-20100918.zip is a new version of the regex module.

I've added 'pos' and 'endpos' arguments to regex.sub and regex.subn and 
refactored a little.

I can't think of any other features that need to be added or see any more speed 
improvements.

Have I missed anything important? :-)

--
Added file: http://bugs.python.org/file18913/issue2636-20100918.zip

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-09-12 Thread Georg Brandl

Georg Brandl ge...@python.org added the comment:

(?flags) are still scoping by default... a new flag to activate that behavior 
would really by helpful  :)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-09-12 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

Another flag? Hmm.

How about this instead: if a scoped flag appears at the end of a regex (and 
would therefore normally have no effect) then it's treated as though it's at 
the start of the regex. Thus:

foo(?i)

is treated like:

(?i)foo

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-09-12 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

Not that my opinion matters, but for what is it worth, I find it rather unusual 
to have to use special flags to get normal (for some definition of normal) 
behaviour, while retaining the defaults buggy in some way (like ZEROWIDTH). I 
would think, the backwards compatibility would not be needed under these 
circumstances - in such probably marginal cases (or is setting global flags at 
the end or otherwhere than on beginning oof the pattern that frequent?). It 
seems, that with many new features and enhancements for previously impossible 
patterns, chances are, that the code using regular expressions in a more 
advanced way might benefit from reviewing the patterns (where also the flags 
for historical behaviour could be adjusted if really needed).

Anyway, thanks for further improvements! (although it broke my custom function 
previously misusing the internal data of the regex module for getting the 
unicode script property (currently unavailable via unicodedata) :-).

Best regards,
   vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-09-12 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

The tests for re include these regexes:

a.b(?s)
a.*(?s)b

I understand what Georg said previously about some people preferring to put 
them at the end, but I personally wouldn't do that because some regex 
implementations support scoped inline flags, although others, like re, don't.

I think that second regex is a bit perverse, though! :-)

On the other matter, I could make the Unicode script and block available 
through a couple of functions if you need them, eg:

# Using Python 3 here
 regex.script(A)
'Latin'
 regex.block(A)
'BasicLatin'

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-09-12 Thread Georg Brandl

Georg Brandl ge...@python.org added the comment:

Matthew, I understand why you want to have these flags scoped, and if you 
designed a regex dialect from scratch, that would be the way to go.  However, 
if we want to integrate this in Python 3.2 or 3.3, this is an absolute killer 
if it's not backwards compatible.

I can live with behavior changes that really are bug fixes, and of course with 
new features that were invalid syntax before, but this is changing an aspect 
that was designed that way (as the test case shows), and that really is not 
going to happen without an explicit new flag. Special-casing the flags at the 
end case is too magical to be of any help.

It will be hard enough to get your code into Python -- it is a huge new 
codebase for an absolutely essential module.  I'm nevertheless optimistic that 
it is going to happen at some point or other.  Of course, you would have to 
commit to maintaining it within Python for the forseeable future.

The script and block functions really belong into unicodedata; you'll have 
to coordinate that with Marc-Andre.

@Vlastimil: backwards compatibility is needed very much here.  Nobody wants to 
review all their regexes when switching from Python 3.1 to Python 3.2.  Many 
people will not care about the improved engine, they just expect their regexes 
to work as before, and that is a perfectly fine attitude.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-09-12 Thread Matthew Barnett

Matthew Barnett pyt...@mrabarnett.plus.com added the comment:

OK, so would it be OK if there was, say, a NEW (N) flag which made the inline 
flags (?flags) scoped and allowed splitting on zero-width matches?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-09-12 Thread Brian Curtin

Changes by Brian Curtin cur...@acm.org:


--
nosy:  -brian.curtin

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



  1   2   3   >