[issue1693050] \w not helpful for non-Roman scripts

2013-05-29 Thread Matthew Barnett
Matthew Barnett added the comment: You could've obtained it from msg76556 or msg190100: >>> print(ascii('हिन्दी')) '\u0939\u093f\u0928\u094d\u0926\u0940' >>> import re, regex >>> print(ascii(re.match(r"\w+", >>> &#x

[issue1693050] \w not helpful for non-Roman scripts

2013-05-28 Thread Matthew Barnett
Matthew Barnett added the comment: I'm not sure what you're saying. The re module in Python 3.3 matches only the first codepoint, treating the second codepoint as not part of a word, whereas the regex module matches all 6 codepoints, treating them all as part of a s

[issue7940] re.finditer and re.findall should support negative end positions

2013-05-26 Thread Matthew Barnett
Matthew Barnett added the comment: Like the OP, I would've expected it to handle negative indexes the way that strings do. In practice, I wouldn't normally provide negative indexes; I'd use some string or regex method to determine the search limits, and then pass them to findit

[issue1693050] \w not helpful for non-Roman scripts

2013-05-26 Thread Matthew Barnett
Matthew Barnett added the comment: I had to check what re does in Python 3.3: >>> print(len(re.match(r'\w+', 'हिन्दी').group())) 1 Regex does this: >>> print(len(regex.match(r'\w+', 'हिन्दी').group())) 6 -- ___

[issue814253] Grouprefs in lookbehind assertions

2013-05-25 Thread Matthew Barnett
Matthew Barnett added the comment: Issue #2636 resulted in the regex module, which supports variable-length look-behinds. I don't know how much work it would take even to put a limited fixed-length look-behind fix for this into the re module, so I'm afraid the issue must r

[issue7940] re.finditer and re.findall should support negative end positions

2013-05-25 Thread Matthew Barnett
Matthew Barnett added the comment: I've attached a patch. -- keywords: +patch Added file: http://bugs.python.org/file30377/issue7940.patch ___ Python tracker <http://bugs.python.org/i

[issue7940] re.finditer and re.findall should support negative end positions

2013-05-23 Thread Matthew Barnett
Matthew Barnett added the comment: Yes. As msg99456 suggests, I fixed it the my source code before posting. Compare re in Python 3.3.2: >>> re.compile('x').findall('', 1, 3) ['x', 'x'] >>> re.compile('x').findall('x

[issue17998] internal error in regular expression engine

2013-05-17 Thread Matthew Barnett
Matthew Barnett added the comment: Here are some simpler examples of the bug: re.compile('.*yz', re.S).findall('xyz') re.compile('.?yz', re.S).findall('xyz') re.compile('.+yz', re.S).findall('xyz') Unfortunately I find it difficult to

[issue17668] re.split loses characters matching ungrouped parts of a pattern

2013-04-08 Thread Matthew Barnett
Matthew Barnett added the comment: It's not a bug. The documentation says """Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list."

[issue17447] str.identifier shouldn't accept Python keywords

2013-03-17 Thread Matthew Barnett
Matthew Barnett added the comment: I already use it in the regex module for named groups. I don't think it would ever be a problem in practice because the names are invariably handled as strings. -- nosy: +mrabarnett ___ Python tracker

[issue17426] \0 in re.sub substitutes to space

2013-03-15 Thread Matthew Barnett
Matthew Barnett added the comment: The regex behaves the same as re. The reason it isn't supported is that \0 starts an octal escape sequence. -- ___ Python tracker <http://bugs.python.org/is

[issue17381] IGNORECASE breaks unicode literal range matching

2013-03-11 Thread Matthew Barnett
Matthew Barnett added the comment: In issue #3511 the range was slightly unusual, so closing it seemed a reasonable approach, but the range in this issue is less clearly a problem. My preference would be to fix it, if possible. -- ___ Python

[issue17381] IGNORECASE breaks unicode literal range matching

2013-03-07 Thread Matthew Barnett
Matthew Barnett added the comment: The way the re handles ranges is to convert the two endpoints to lowercase and then check whether the lowercase form of the character in the text is in that range. For example, [A-Z] is converted to the range [\x41-\x5A], and the lowercase form of &#

[issue8402] Add a function to escape metacharacters in glob/fnmatch

2013-03-07 Thread Matthew Barnett
Matthew Barnett added the comment: I've attached fnmatch_implementation.py, which is a simple pure-Python implementation of the fnmatch function. It's not as susceptible to catastrophic backtracking as the current re-based one. For example: fnmatch('a' * 50, '*a*&

[issue17297] Issue with return in recursive functions

2013-02-25 Thread Matthew Barnett
Matthew Barnett added the comment: This question should've been posted to python-l...@python.org, not here. Your functions are calling themselves, but not returning the result of the call to their own callers. -- ___ Python tracker

[issue694374] Recursive regular expressions

2013-02-23 Thread Matthew Barnett
Matthew Barnett added the comment: FYI, I did eventually add it to my regex implementation. It was quite challenging! -- ___ Python tracker <http://bugs.python.org/issue694

[issue17184] re.VERBOSE doesn't respect whitespace in '( ?P...)'

2013-02-11 Thread Matthew Barnett
Matthew Barnett added the comment: It does look like a duplicate to me. -- ___ Python tracker <http://bugs.python.org/issue17184> ___ ___ Python-bugs-list mailin

[issue17047] Fix double double words words

2013-02-06 Thread Matthew Barnett
Matthew Barnett added the comment: These are the ones that I think are wrong: Doc/c-api/long.rst:206 Return a C :c:type:`size_t` representation of of *pylong*. *pylong* must be Doc/c-api/long.rst:218 Return a C :c:type:`unsigned PY_LONG_LONG` representation of of *pylong*. Doc

[issue16203] Proposal: add re.fullmatch() method

2013-02-05 Thread Matthew Barnett
Matthew Barnett added the comment: 3 of the tests expect None when using 'fullmatch'; they won't return None when using 'match'. -- ___ Python tracker <http:

[issue16203] Proposal: add re.fullmatch() method

2013-02-04 Thread Matthew Barnett
Matthew Barnett added the comment: I've attached a patch. -- Added file: http://bugs.python.org/file28955/issue16203_mrab.patch ___ Python tracker <http://bugs.python.org/is

[issue13169] Regular expressions with 0 to 65536 repetitions raises OverflowError

2013-01-23 Thread Matthew Barnett
Matthew Barnett added the comment: IMHO, I don't think that MAXREPEAT should be defined in sre_constants.py _and_ SRE_MAXREPEAT defined in sre_constants.h. (In the latter case, why is it in decimal?) I think that it should be defined in one place, namely sre_constants.h, perhaps as: #d

[issue17016] _sre: avoid relying on pointer overflow

2013-01-23 Thread Matthew Barnett
Matthew Barnett added the comment: You're checking "int offset", but what happens with "unsigned int offset"? -- ___ Python tracker <http:

[issue17016] _sre: avoid relying on pointer overflow

2013-01-22 Thread Matthew Barnett
Matthew Barnett added the comment: Lines 1000 and 1084 will be a problem only if you're near the top of the address space. This is because: 1. ctx->pattern[1] will always be <= ctx->pattern[2]. 2. A value of 65535 in ctx->pattern[2] means unlimited, even though SRE_CODE i

[issue9669] regexp: zero-width matches in MIN_UNTIL

2013-01-15 Thread Matthew Barnett
Matthew Barnett added the comment: I've attached my attempt at a patch. -- keywords: +patch Added file: http://bugs.python.org/file28744/issue9669.patch ___ Python tracker <http://bugs.python.org/i

[issue13899] re pattern r"[\A]" should work like "A" but matches nothing. Ditto B and Z.

2013-01-07 Thread Matthew Barnett
Matthew Barnett added the comment: I've attached a patch. -- keywords: +patch Added file: http://bugs.python.org/file28614/issue13899.patch ___ Python tracker <http://bugs.python.org/is

[issue16870] re fails to match ^ when start index is specified ?

2013-01-05 Thread Matthew Barnett
Matthew Barnett added the comment: The semantics of '^' are common to many different regex implementations, including those of Perl and C#. The 'pos' argument merely gives the starting position the search (C# also lets you provide a starting position, and behaves in

[issue16741] `int()`, `float()`, etc think python strings are null-terminated

2012-12-30 Thread Matthew Barnett
Matthew Barnett added the comment: I've attached a small additional patch for truncating the UTF-8. I don't know whether it's strictly necessary, but I don't know that it's unnecessary either! (Better safe than sorry.) -- Added file: http://bugs.python.org/fil

[issue16741] `int()`, `float()`, etc think python strings are null-terminated

2012-12-29 Thread Matthew Barnett
Matthew Barnett added the comment: I've attached a patch. It now reports an invalid literal as-is: >>> int("#\N{ARABIC-INDIC DIGIT ONE}") Traceback (most recent call last): File "", line 1, in int("#\N{ARABIC-INDIC DIGIT ONE}") ValueError:

[issue16741] `int()`, `float()`, etc think python strings are null-terminated

2012-12-23 Thread Matthew Barnett
Matthew Barnett added the comment: It occurred to me that the truncation of the string when building the error message could cause a UnicodeDecodeError: >>> int("1".ljust(199) + "\u0100") Traceback (most recent call last): File "", line

[issue16741] `int()`, `float()`, etc think python strings are null-terminated

2012-12-21 Thread Matthew Barnett
Matthew Barnett added the comment: Python takes a long way round when converting strings to int. It does the following (I'll be talking about Python 3.3 here): 1. In function 'fix_decimal_and_space_to_ascii', the different kinds of spaces are converted to " " and the

[issue1075356] exceeding obscure weakproxy bug

2012-12-19 Thread Matthew Barnett
Matthew Barnett added the comment: The patch "issue1075356.patch" is my attempt to fix this bug. 'PyArg_ParseTuple', etc, eventually call 'convertsimple'. What this patch does is to insert some code at the start of 'convertsimple' that checks whether the

[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-16 Thread Matthew Barnett
Changes by Matthew Barnett : Removed file: http://bugs.python.org/file28330/issue16688#3.patch ___ Python tracker <http://bugs.python.org/issue16688> ___ ___ Python-bug

[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-16 Thread Matthew Barnett
Matthew Barnett added the comment: Oops! :-( Now corrected. -- Added file: http://bugs.python.org/file28332/issue16688#3.patch ___ Python tracker <http://bugs.python.org/issue16

[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-16 Thread Matthew Barnett
Matthew Barnett added the comment: Here are some tests for the issue. -- Added file: http://bugs.python.org/file28330/issue16688#3.patch ___ Python tracker <http://bugs.python.org/issue16

[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-15 Thread Matthew Barnett
Matthew Barnett added the comment: I haven't found any other issues, so here's the second patch. -- Added file: http://bugs.python.org/file28325/issue16688#2.patch ___ Python tracker <http://bugs.python.o

[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-15 Thread Matthew Barnett
Matthew Barnett added the comment: I found another bug while looking through the source. On line 495 in function SRE_COUNT: if (maxcount < end - ptr && maxcount != 65535) end = ptr + maxcount*state->charsize; where 'end' and 'ptr' are of type &

[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-15 Thread Matthew Barnett
Matthew Barnett added the comment: I found another bug while looking through the source. On line 495 in function SRE_COUNT: if (maxcount < end - ptr && maxcount != 65535) end = ptr + maxcount*state->charsize; where 'end' and 'ptr' are of type &

[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-15 Thread Matthew Barnett
Matthew Barnett added the comment: OK, here's a patch. -- keywords: +patch Added file: http://bugs.python.org/file28321/issue16688.patch ___ Python tracker <http://bugs.python.org/is

[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-14 Thread Matthew Barnett
Matthew Barnett added the comment: In function SRE_MATCH, the code for SRE_OP_GROUPREF (line 1290) contains this: while (p < e) { if (ctx->ptr >= end || SRE_CHARGET(state, ctx->ptr, 0) != SRE_CHARGET(state, p, 0)) RETURN_FAILURE; p += sta

[issue16619] LOAD_GLOBAL used to load `None` under certain circumstances

2012-12-05 Thread Matthew Barnett
Matthew Barnett added the comment: The same problem occurs with both `False` and `True`. -- nosy: +mrabarnett ___ Python tracker <http://bugs.python.org/issue16

[issue11204] re module: strange behaviour of space inside {m, n}

2012-12-02 Thread Matthew Barnett
Matthew Barnett added the comment: The question is whether re should always treat 'b{1, 3}a' as a literal, even with the VERBOSE flag. I've checked with Perl 5.14.2, and it agrees with re: adding a space _always_ makes it a literal, even with the 'x' flag (/b{1, 3}a/x

[issue11204] re module: strange behaviour of space inside {m, n}

2012-12-02 Thread Matthew Barnett
Matthew Barnett added the comment: Interesting. In my regex module (http://pypi.python.org/pypi/regex) I have: bool(regex.match(pat, "bb", regex.VERBOSE)) # True bool(regex.match(pat, "b{1,3}", regex.VERBOSE)) # False because I thought that when the VERBOSE flag is turned

[issue16203] Proposal: add re.fullmatch() method

2012-10-16 Thread Matthew Barnett
Matthew Barnett added the comment: OK, in order to avoid bikeshedding, "fullmatch" it is. -- ___ Python tracker <http://bugs.python.org/issue16203> ___ ___

[issue16203] Proposal: add re.fullmatch() method

2012-10-16 Thread Matthew Barnett
Matthew Barnett added the comment: re2's FullMatch method contrasts with its PartialMatch method, which re doesn't have! -- ___ Python tracker <http://bugs.python.o

[issue16203] Proposal: add re.fullmatch() method

2012-10-16 Thread Matthew Barnett
Matthew Barnett added the comment: I'm about to add this to my regex implementation and, naturally, I want it to have the same name for compatibility. However, I'm not that keen on "fullmatch" and would prefer "matchall&quo

[issue16203] Proposal: add re.fullmatch() method

2012-10-13 Thread Matthew Barnett
Matthew Barnett added the comment: It certainly appears to ignore the whitespace, even if the "(?x)" is at the end of the pattern or in the middle of a group. Another point we need to consider is that the user might want to use a pre-compil

[issue16203] Proposal: add re.fullmatch() method

2012-10-13 Thread Matthew Barnett
Matthew Barnett added the comment: Tim, my point is that if the MULTILINE flag happens to be turned on, '$' won't just match at the end of the string (or slice), it'll also match at a newline, so wrapping the pattern in (?:...)$ in that case could give the wrong answer,

[issue16203] Proposal: add re.fullmatch() method

2012-10-12 Thread Matthew Barnett
Matthew Barnett added the comment: '$' will match at the end of the string or just before the final '\n': >>> re.match(r'abc$', 'abc\n') <_sre.SRE_Match object at 0x00F15448> So shouldn't you be using r'\Z' instea

[issue15956] backreference to named group does not work

2012-09-18 Thread Matthew Barnett
Matthew Barnett added the comment: There needed to be a way of referring to named groups in the replacement template. The existing form \groupnumber clearly wouldn't work. Other regex implementations, such as Perl, do have \g and also \k (for named groups). In my implementation I

[issue10076] Regex objects became uncopyable in 2.5

2012-08-26 Thread Matthew Barnett
Matthew Barnett added the comment: Is it necessary to actually copy it? Isn't the pattern object immutable? -- nosy: +mrabarnett ___ Python tracker <http://bugs.python.org/is

[issue15606] re.VERBOSE doesn't ignore certain whitespace

2012-08-10 Thread Matthew Barnett
Matthew Barnett added the comment: Ideally, yes, that whitespace should be ignored. The question is whether it's worth fixing the code for the small case of when there's whitespace within "tokens", such as within "(?:". Usually those who use verbose mode use whit

[issue15537] MULTILINE confuses re.split

2012-08-02 Thread Matthew Barnett
Matthew Barnett added the comment: There are actually 2 issues here: 1. The third argument is 'maxsplit', the fourth is 'flags'. 2. It never splits on a zero-width match. See issue 3262. -- ___ Python tracker <http://bug

[issue15515] Regular expression match does not return

2012-07-31 Thread Matthew Barnett
Matthew Barnett added the comment: It's probably inappropriate for me to mention that the alternative 'regex' module on PyPI completes promptly, so I won't. :-) -- ___ Python tracker <http://bug

[issue15515] Regular expression match does not return

2012-07-31 Thread Matthew Barnett
Matthew Barnett added the comment: That's because it uses a pathological regular expression (catastrophic backtracking). The problem lies here: (\\?[\w\.\-]+)+ -- ___ Python tracker <http://bugs.python.org/is

[issue13592] repr(regex) doesn't include actual regex

2012-07-19 Thread Matthew Barnett
Matthew Barnett added the comment: Python 2.7 is the end of the Python 2 line, and it's closed except for security fixes. -- ___ Python tracker <http://bugs.python.org/is

[issue15372] Python is missing alternative for common quoting character

2012-07-16 Thread Matthew Barnett
Matthew Barnett added the comment: A codepoint such as "é" ("\N{LATIN SMALL LETTER E WITH ACUTE}") can be decomposed to "\u0065\u0301" ("\N{LATIN SMALL LETTER E}\N{COMBINING ACUTE ACCENT"), but "\u201c" ("\N{LEFT DOUBLE QUOTATION

[issue15216] Support setting the encoding on a text stream after creation

2012-06-30 Thread Matthew Barnett
Matthew Barnett added the comment: Would a "set_encoding" method be Pythonic? I would've preferred an "encoding" property which flushes the output when it's changed. -- nosy: +mrabarnett ___ Python tracker <

[issue15077] Regexp match goes into infinite loop

2012-06-28 Thread Matthew Barnett
Matthew Barnett added the comment: It's not a bug, it's a pathological regex (i.e. it causes catastrophic backtracking). It also works correctly in the "regex" module. -- ___ Python tracker <http://bug

[issue14991] Option for regex groupdict() to show only matching names

2012-06-17 Thread Matthew Barnett
Matthew Barnett added the comment: @rhettinger: The problem with "nodefault" is that it's negative, so that "nodefault=False" means that you don't not want the default, if you see what I mean. I think that "suppress" would be better: mo.groupdict(

[issue14462] In re's named group the name cannot contain unicode characters

2012-04-29 Thread Matthew Barnett
Matthew Barnett added the comment: It doesn't work in regex, but it probably should. IMHO, if it's a valid identifier, then it should be allowed. -- ___ Python tracker <http://bugs.python.o

[issue14510] Regular Expression "+" perform wrong repeat

2012-04-05 Thread Matthew Barnett
Matthew Barnett added the comment: If a capture group is repeated, as in r'(\$.)+', only its last match is returned. -- ___ Python tracker <http://bugs.python.o

[issue14343] In re's examples the example with re.split() shadows builtin input()

2012-03-16 Thread Matthew Barnett
Changes by Matthew Barnett : -- title: In re's examples the example with re.split() overlaps builtin input() -> In re's examples the example with re.split() shadows builtin input() ___ Python tracker <http://bugs.pytho

[issue14342] In re's examples the example with recursion doesn't work

2012-03-16 Thread Matthew Barnett
Matthew Barnett added the comment: As far as I can tell, back in 2003, changes were made to replace the recursive scheme which used stack allocation with a non-recursive scheme which used heap allocation in order to the improve the behaviour. To me it looks like an oversight and that the

[issue1519638] Unmatched Group issue - workaround

2012-03-15 Thread Matthew Barnett
Matthew Barnett added the comment: The replacement can be a callable, so you could do this: re.sub(r'(?:\((?:(\d+)|.*?)\)\s*)+$', lambda m: m.group(1) or '', 'avatar (special edition)') -- ___ Python tracker <ht

[issue14260] re.groupindex available for modification and continues to work, having incorrect data inside it

2012-03-12 Thread Matthew Barnett
Matthew Barnett added the comment: It appears I was wrong. :-( The simplest solution in that case is for it to return a _copy_ of the dict. -- ___ Python tracker <http://bugs.python.org/issue14

[issue14260] re.groupindex available for modification and continues to work, having incorrect data inside it

2012-03-12 Thread Matthew Barnett
Matthew Barnett added the comment: The re module creates the dict purely for the benefit of the user, and as it's a normal dict, it's mutable. An alternative would to use an immutable dict or dict-like object, but Python doesn't have such a class, and it's probably not wo

[issue14237] Special sequences \A and \Z don't work in character set []

2012-03-09 Thread Matthew Barnett
Matthew Barnett added the comment: \s matches a character, whereas \A and \Z don't. Within a character set \s makes sense, but \A and \Z don't, so they should be treated as literals. -- ___ Python tracker <http://bugs.python.o

[issue14237] Special sequences \A and \Z don't work in character set []

2012-03-09 Thread Matthew Barnett
Matthew Barnett added the comment: Within a character set \A and \Z should behave like, say, \C; in other words, they should be the literals "A" and "Z". -- ___ Python tracker <http://bug

[issue14212] Segfault when using re.finditer over mmap

2012-03-07 Thread Matthew Barnett
Matthew Barnett added the comment: In the function "getstring" in _sre.c, the code obtains a pointer to the characters of the buffer and then releases the buffer. There's a comment before the release: /* Release the buffer immediately --- possibly dangerous but

[issue14212] Segfault when using re.finditer over mmap

2012-03-06 Thread Matthew Barnett
Matthew Barnett added the comment: It segfaults because it attempts to access the buffer of an mmap that has been closed. It would be certainly be more friendly if it checked whether the mmap was still open and, if not, raised an exception instead. -- nosy: +mrabarnett

[issue13169] Regular expressions with 0 to 65536 repetitions raises OverflowError

2012-02-29 Thread Matthew Barnett
Matthew Barnett added the comment: Ideally, it should raise an exception (or a warning) because the behaviour is unexpected. -- ___ Python tracker <http://bugs.python.org/issue13

[issue13998] Lookbehind assertions go behind the start position for the match

2012-02-13 Thread Matthew Barnett
Matthew Barnett added the comment: The documentation says of the 'pos' parameter "This is not completely equivalent to slicing the string" and of the 'endpos' parameter "it will be as if the string is endpos characters long". In other words, it st

[issue13899] re pattern r"[\A]" should work like "A" but matches nothing. Ditto B and Z.

2012-02-04 Thread Matthew Barnett
Matthew Barnett added the comment: In re, "\A" within a character set should be similar to "\C", but instead it's still interpreted as meaning the start of the string. That's definitely a bug. If it doesn't do what it's supposed to do, then it's a

[issue13899] re pattern r"[\A]" should work like "A" but matches nothing. Ditto B and Z.

2012-02-03 Thread Matthew Barnett
Matthew Barnett added the comment: This should answer that question: >>> re.findall(r"[\A\C]", r"\AC") ['C'] >>> regex.findall(r"[\A\C]", r"\AC") ['A', 'C

[issue13652] Creating lambda functions in a loop has unexpected results when resolving variables used as arguments

2011-12-22 Thread Matthew Barnett
Matthew Barnett added the comment: That's not a bug. This might help to explain what's going on: What do (lambda) function closures capture in Python? http://stackoverflow.com/questions/2295290/what-do-lambda-function-closures-capture-in-python -- nosy: +

[issue13592] repr(regex) doesn't include actual regex

2011-12-22 Thread Matthew Barnett
Matthew Barnett added the comment: I'm just adding this to the regex module and I've come up against a possible issue. The regex module supports named lists, which could be very big. Should the entire contents of those lists also be shown in the repr?They would have to be if the

[issue13592] repr(regex) doesn't include actual regex

2011-12-13 Thread Matthew Barnett
Matthew Barnett added the comment: Actually, one possibility that occurs to me is to provide the flags within the pattern. The .pattern attribute gives the original pattern, but repr could give the flags in-line at the start of the pattern: >>> # Assuming Python 3. >>>

[issue13592] repr(regex) doesn't include actual regex

2011-12-13 Thread Matthew Barnett
Matthew Barnett added the comment: In reply to Ezio, the repr of a large string, list, tuple or dict is also long. The repr of a compiled regex should probably also show the flags, but should it just be the numeric value? -- ___ Python tracker

[issue13169] Regular expressions with 0 to 65536 repetitions raises OverflowError

2011-10-14 Thread Matthew Barnett
Matthew Barnett added the comment: The limit is an implementation detail. The pattern is compiled into codes which are then interpreted, and it just happens that the codes are (usually) 16 bits, giving a range of 0..65535, but it uses 65535 to represent no limit and doesn't warn i

[issue13169] Regular expressions with 0 to 65536 repetitions raises OverflowError

2011-10-13 Thread Matthew Barnett
Matthew Barnett added the comment: The quantifiers use 65535 to represent no upper limit, so ".{0,65535}" is equivalent to ".*". For example: >>> re.match(".*", "x" * 10).span() (0, 10) >>> re.match(".{0,65535}", &

[issue2636] Adding a new regex module (compatible with re)

2011-09-02 Thread Matthew Barnett
Matthew Barnett added the comment: So, VERSION0 and VERSION1, with "(?V0)" and "(?V1)" in the pattern? -- ___ Python tracker <http://bu

[issue2636] Adding a new regex module (compatible with re)

2011-09-02 Thread Matthew Barnett
Matthew Barnett added the comment: The least disruptive change would be to have a NEW flag for the new behaviour, as at present, and an OLD flag for the old behaviour. Currently the default is old behaviour, but in the future it will be new behaviour. The differences would be: Old

[issue2636] Adding a new regex module (compatible with re)

2011-09-01 Thread Matthew Barnett
Matthew Barnett added the comment: I think I need a show of hands. Should the default be old behaviour (like re) or new behaviour? (It might be old now, new later.) Should there be a NEW flag (as at present), or an OLD flag, or a VERSION parameter (0=old, 1=new, 2

[issue2636] Adding a new regex module (compatible with re)

2011-09-01 Thread Matthew Barnett
Matthew Barnett added the comment: The regex module supports nested sets and set operations, eg. r"[[a-z]--[aeiou]]" (the letters from 'a' to 'z', except the vowels). This means that literal '[' in a set needs to be escaped. For example, re module s

[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-28 Thread Matthew Barnett
Matthew Barnett added the comment: The regex module currently uses simple case-folding, although I'm working towards full case-folding, as listed in http://www.unicode.org/Public/UNIDATA/CaseFolding.txt. -- ___ Python tracker

[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-27 Thread Matthew Barnett
Matthew Barnett added the comment: There are some oddities in Unicode case-folding. Under full case-folding, both "\N{LATIN CAPITAL LETTER SHARP S}" and "\N{LATIN SMALL LETTER SHARP S}" fold to "ss", which means that those codepoints match each other. Howe

[issue12789] re.Scanner don't support more then 2 groups on regex

2011-08-20 Thread Matthew Barnett
Matthew Barnett added the comment: Even if this bug is fixed, it still won't work as you expect, and this s why. The Scanner function accepts a list of 2-tuples. The first item of the tuple is a regex and the second is a function. For example: re.Scanner([(r"\d+", number)

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-08-19 Thread Matthew Barnett
Matthew Barnett added the comment: For the "Line_Break" property, one of the possible values is "Inseparable", with 2 permitted aliases, the shorter "IN" (which is reasonable) and "Inseperable" (ouch!). -- _

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-15 Thread Matthew Barnett
Matthew Barnett added the comment: For what it's worth, I've had idea about string storage, roughly based on how *nix stores data on disk. If a string is small, point to a block of codepoints. If a string is medium-sized, point to a block of pointers to codepoint blocks. If a

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Matthew Barnett
Matthew Barnett added the comment: Have a look here: http://98.245.80.27/tcpc/OSCON2011/gbu/index.html -- ___ Python tracker <http://bugs.python.org/issue12

[issue12749] lib re cannot match non-BMP ranges (all versions, all builds)

2011-08-14 Thread Matthew Barnett
Matthew Barnett added the comment: On a narrow build, "\N{MATHEMATICAL SCRIPT CAPITAL A}" is stored as 2 code units, and neither re nor regex recombine them when compiling a regex or looking for a match. regex supports \xNN, \u and \U and \N{XYZ} itself, so they can be

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Matthew Barnett
Matthew Barnett added the comment: You're right about starting the second search from where the first finished. Caching the position would be an advantage there. The memory cost of extra pointers wouldn't be so bad if UTF-8 took less space than the current format. Regex isn'

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Matthew Barnett
Matthew Barnett added the comment: There are occasions when you want to do string slicing, often of the form: pos = my_str.index(x) endpos = my_str.index(y) substring = my_str[pos : endpos] To me that suggests that if UTF-8 is used then it may be worth profiling to see whether caching the

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-12 Thread Matthew Barnett
Matthew Barnett added the comment: In a narrow build, a codepoint in the astral plane is encoded as surrogate pair. I could implement a workaround for it in the regex module, but I think that the proper place to fix it is in the language as a whole, perhaps by implementing PEP 393 ("Fle

[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-12 Thread Matthew Barnett
Changes by Matthew Barnett : -- nosy: +mrabarnett ___ Python tracker <http://bugs.python.org/issue12736> ___ ___ Python-bugs-list mailing list Unsubscribe:

[issue12735] request full Unicode collation support in std python library

2011-08-12 Thread Matthew Barnett
Changes by Matthew Barnett : -- nosy: +mrabarnett ___ Python tracker <http://bugs.python.org/issue12735> ___ ___ Python-bugs-list mailing list Unsubscribe:

[issue12734] Request for property support in Python re lib

2011-08-12 Thread Matthew Barnett
Changes by Matthew Barnett : -- nosy: +mrabarnett ___ Python tracker <http://bugs.python.org/issue12734> ___ ___ Python-bugs-list mailing list Unsubscribe:

[issue12733] Request for grapheme support in Python re lib

2011-08-12 Thread Matthew Barnett
Changes by Matthew Barnett : -- nosy: +mrabarnett ___ Python tracker <http://bugs.python.org/issue12733> ___ ___ Python-bugs-list mailing list Unsubscribe:

[issue12732] Can't portably use Unicode in Python identifiers

2011-08-12 Thread Matthew Barnett
Changes by Matthew Barnett : -- nosy: +mrabarnett ___ Python tracker <http://bugs.python.org/issue12732> ___ ___ Python-bugs-list mailing list Unsubscribe:

[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2011-08-12 Thread Matthew Barnett
Changes by Matthew Barnett : -- nosy: +mrabarnett ___ Python tracker <http://bugs.python.org/issue12731> ___ ___ Python-bugs-list mailing list Unsubscribe:

<    1   2   3   4   5   6   >