Matthew Barnett added the comment:
You could've obtained it from msg76556 or msg190100:
>>> print(ascii('हिन्दी'))
'\u0939\u093f\u0928\u094d\u0926\u0940'
>>> import re, regex
>>> print(ascii(re.match(r"\w+",
>>>
Matthew Barnett added the comment:
I'm not sure what you're saying.
The re module in Python 3.3 matches only the first codepoint, treating the
second codepoint as not part of a word, whereas the regex module matches all 6
codepoints, treating them all as part of a s
Matthew Barnett added the comment:
Like the OP, I would've expected it to handle negative indexes the way that
strings do.
In practice, I wouldn't normally provide negative indexes; I'd use some string
or regex method to determine the search limits, and then pass them to findit
Matthew Barnett added the comment:
I had to check what re does in Python 3.3:
>>> print(len(re.match(r'\w+', 'हिन्दी').group()))
1
Regex does this:
>>> print(len(regex.match(r'\w+', 'हिन्दी').group()))
6
--
___
Matthew Barnett added the comment:
Issue #2636 resulted in the regex module, which supports variable-length
look-behinds.
I don't know how much work it would take even to put a limited fixed-length
look-behind fix for this into the re module, so I'm afraid the issue must
r
Matthew Barnett added the comment:
I've attached a patch.
--
keywords: +patch
Added file: http://bugs.python.org/file30377/issue7940.patch
___
Python tracker
<http://bugs.python.org/i
Matthew Barnett added the comment:
Yes. As msg99456 suggests, I fixed it the my source code before posting.
Compare re in Python 3.3.2:
>>> re.compile('x').findall('', 1, 3)
['x', 'x']
>>> re.compile('x').findall('x
Matthew Barnett added the comment:
Here are some simpler examples of the bug:
re.compile('.*yz', re.S).findall('xyz')
re.compile('.?yz', re.S).findall('xyz')
re.compile('.+yz', re.S).findall('xyz')
Unfortunately I find it difficult to
Matthew Barnett added the comment:
It's not a bug.
The documentation says """Split string by the occurrences of pattern. If
capturing parentheses are used in pattern, then the text of all groups in the
pattern are also returned as part of the resulting list."
Matthew Barnett added the comment:
I already use it in the regex module for named groups. I don't think it would
ever be a problem in practice because the names are invariably handled as
strings.
--
nosy: +mrabarnett
___
Python tracker
Matthew Barnett added the comment:
The regex behaves the same as re.
The reason it isn't supported is that \0 starts an octal escape sequence.
--
___
Python tracker
<http://bugs.python.org/is
Matthew Barnett added the comment:
In issue #3511 the range was slightly unusual, so closing it seemed a
reasonable approach, but the range in this issue is less clearly a problem. My
preference would be to fix it, if possible.
--
___
Python
Matthew Barnett added the comment:
The way the re handles ranges is to convert the two endpoints to lowercase and
then check whether the lowercase form of the character in the text is in that
range.
For example, [A-Z] is converted to the range [\x41-\x5A], and the lowercase
form of
Matthew Barnett added the comment:
I've attached fnmatch_implementation.py, which is a simple pure-Python
implementation of the fnmatch function.
It's not as susceptible to catastrophic backtracking as the current re-based
one. For example:
fnmatch('a' * 50, '*a*&
Matthew Barnett added the comment:
This question should've been posted to python-l...@python.org, not here.
Your functions are calling themselves, but not returning the result of the call
to their own callers.
--
___
Python tracker
Matthew Barnett added the comment:
FYI, I did eventually add it to my regex implementation. It was quite
challenging!
--
___
Python tracker
<http://bugs.python.org/issue694
Matthew Barnett added the comment:
It does look like a duplicate to me.
--
___
Python tracker
<http://bugs.python.org/issue17184>
___
___
Python-bugs-list mailin
Matthew Barnett added the comment:
These are the ones that I think are wrong:
Doc/c-api/long.rst:206
Return a C :c:type:`size_t` representation of of *pylong*. *pylong* must be
Doc/c-api/long.rst:218
Return a C :c:type:`unsigned PY_LONG_LONG` representation of of *pylong*.
Doc
Matthew Barnett added the comment:
3 of the tests expect None when using 'fullmatch'; they won't return None when
using 'match'.
--
___
Python tracker
<http:
Matthew Barnett added the comment:
I've attached a patch.
--
Added file: http://bugs.python.org/file28955/issue16203_mrab.patch
___
Python tracker
<http://bugs.python.org/is
Matthew Barnett added the comment:
IMHO, I don't think that MAXREPEAT should be defined in sre_constants.py _and_
SRE_MAXREPEAT defined in sre_constants.h. (In the latter case, why is it in
decimal?)
I think that it should be defined in one place, namely sre_constants.h, perhaps
as:
#d
Matthew Barnett added the comment:
You're checking "int offset", but what happens with "unsigned int offset"?
--
___
Python tracker
<http:
Matthew Barnett added the comment:
Lines 1000 and 1084 will be a problem only if you're near the top of the
address space. This is because:
1. ctx->pattern[1] will always be <= ctx->pattern[2].
2. A value of 65535 in ctx->pattern[2] means unlimited, even though SRE_CODE i
Matthew Barnett added the comment:
I've attached my attempt at a patch.
--
keywords: +patch
Added file: http://bugs.python.org/file28744/issue9669.patch
___
Python tracker
<http://bugs.python.org/i
Matthew Barnett added the comment:
I've attached a patch.
--
keywords: +patch
Added file: http://bugs.python.org/file28614/issue13899.patch
___
Python tracker
<http://bugs.python.org/is
Matthew Barnett added the comment:
The semantics of '^' are common to many different regex implementations,
including those of Perl and C#.
The 'pos' argument merely gives the starting position the search (C# also lets
you provide a starting position, and behaves in
Matthew Barnett added the comment:
I've attached a small additional patch for truncating the UTF-8.
I don't know whether it's strictly necessary, but I don't know that it's
unnecessary either! (Better safe than sorry.)
--
Added file: http://bugs.python.org/fil
Matthew Barnett added the comment:
I've attached a patch.
It now reports an invalid literal as-is:
>>> int("#\N{ARABIC-INDIC DIGIT ONE}")
Traceback (most recent call last):
File "", line 1, in
int("#\N{ARABIC-INDIC DIGIT ONE}")
ValueError:
Matthew Barnett added the comment:
It occurred to me that the truncation of the string when building the error
message could cause a UnicodeDecodeError:
>>> int("1".ljust(199) + "\u0100")
Traceback (most recent call last):
File "", line
Matthew Barnett added the comment:
Python takes a long way round when converting strings to int. It does the
following (I'll be talking about Python 3.3 here):
1. In function 'fix_decimal_and_space_to_ascii', the different kinds of spaces
are converted to " " and the
Matthew Barnett added the comment:
The patch "issue1075356.patch" is my attempt to fix this bug.
'PyArg_ParseTuple', etc, eventually call 'convertsimple'. What this patch does
is to insert some code at the start of 'convertsimple' that checks whether the
Changes by Matthew Barnett :
Removed file: http://bugs.python.org/file28330/issue16688#3.patch
___
Python tracker
<http://bugs.python.org/issue16688>
___
___
Python-bug
Matthew Barnett added the comment:
Oops! :-( Now corrected.
--
Added file: http://bugs.python.org/file28332/issue16688#3.patch
___
Python tracker
<http://bugs.python.org/issue16
Matthew Barnett added the comment:
Here are some tests for the issue.
--
Added file: http://bugs.python.org/file28330/issue16688#3.patch
___
Python tracker
<http://bugs.python.org/issue16
Matthew Barnett added the comment:
I haven't found any other issues, so here's the second patch.
--
Added file: http://bugs.python.org/file28325/issue16688#2.patch
___
Python tracker
<http://bugs.python.o
Matthew Barnett added the comment:
I found another bug while looking through the source.
On line 495 in function SRE_COUNT:
if (maxcount < end - ptr && maxcount != 65535)
end = ptr + maxcount*state->charsize;
where 'end' and 'ptr' are of type &
Matthew Barnett added the comment:
I found another bug while looking through the source.
On line 495 in function SRE_COUNT:
if (maxcount < end - ptr && maxcount != 65535)
end = ptr + maxcount*state->charsize;
where 'end' and 'ptr' are of type &
Matthew Barnett added the comment:
OK, here's a patch.
--
keywords: +patch
Added file: http://bugs.python.org/file28321/issue16688.patch
___
Python tracker
<http://bugs.python.org/is
Matthew Barnett added the comment:
In function SRE_MATCH, the code for SRE_OP_GROUPREF (line 1290) contains this:
while (p < e) {
if (ctx->ptr >= end ||
SRE_CHARGET(state, ctx->ptr, 0) != SRE_CHARGET(state, p, 0))
RETURN_FAILURE;
p += sta
Matthew Barnett added the comment:
The same problem occurs with both `False` and `True`.
--
nosy: +mrabarnett
___
Python tracker
<http://bugs.python.org/issue16
Matthew Barnett added the comment:
The question is whether re should always treat 'b{1, 3}a' as a literal, even
with the VERBOSE flag.
I've checked with Perl 5.14.2, and it agrees with re: adding a space _always_
makes it a literal, even with the 'x' flag (/b{1, 3}a/x
Matthew Barnett added the comment:
Interesting.
In my regex module (http://pypi.python.org/pypi/regex) I have:
bool(regex.match(pat, "bb", regex.VERBOSE)) # True
bool(regex.match(pat, "b{1,3}", regex.VERBOSE)) # False
because I thought that when the VERBOSE flag is turned
Matthew Barnett added the comment:
OK, in order to avoid bikeshedding, "fullmatch" it is.
--
___
Python tracker
<http://bugs.python.org/issue16203>
___
___
Matthew Barnett added the comment:
re2's FullMatch method contrasts with its PartialMatch method, which re doesn't
have!
--
___
Python tracker
<http://bugs.python.o
Matthew Barnett added the comment:
I'm about to add this to my regex implementation and, naturally, I want it to
have the same name for compatibility.
However, I'm not that keen on "fullmatch" and would prefer "matchall&quo
Matthew Barnett added the comment:
It certainly appears to ignore the whitespace, even if the "(?x)" is at the end
of the pattern or in the middle of a group.
Another point we need to consider is that the user might want to use a
pre-compil
Matthew Barnett added the comment:
Tim, my point is that if the MULTILINE flag happens to be turned on, '$' won't
just match at the end of the string (or slice), it'll also match at a newline,
so wrapping the pattern in (?:...)$ in that case could give the wrong answer,
Matthew Barnett added the comment:
'$' will match at the end of the string or just before the final '\n':
>>> re.match(r'abc$', 'abc\n')
<_sre.SRE_Match object at 0x00F15448>
So shouldn't you be using r'\Z' instea
Matthew Barnett added the comment:
There needed to be a way of referring to named groups in the replacement
template. The existing form \groupnumber clearly wouldn't work. Other regex
implementations, such as Perl, do have \g and also \k (for named groups).
In my implementation I
Matthew Barnett added the comment:
Is it necessary to actually copy it? Isn't the pattern object immutable?
--
nosy: +mrabarnett
___
Python tracker
<http://bugs.python.org/is
Matthew Barnett added the comment:
Ideally, yes, that whitespace should be ignored.
The question is whether it's worth fixing the code for the small case of when
there's whitespace within "tokens", such as within "(?:". Usually those who use
verbose mode use whit
Matthew Barnett added the comment:
There are actually 2 issues here:
1. The third argument is 'maxsplit', the fourth is 'flags'.
2. It never splits on a zero-width match. See issue 3262.
--
___
Python tracker
<http://bug
Matthew Barnett added the comment:
It's probably inappropriate for me to mention that the alternative 'regex'
module on PyPI completes promptly, so I won't. :-)
--
___
Python tracker
<http://bug
Matthew Barnett added the comment:
That's because it uses a pathological regular expression (catastrophic
backtracking).
The problem lies here: (\\?[\w\.\-]+)+
--
___
Python tracker
<http://bugs.python.org/is
Matthew Barnett added the comment:
Python 2.7 is the end of the Python 2 line, and it's closed except for security
fixes.
--
___
Python tracker
<http://bugs.python.org/is
Matthew Barnett added the comment:
A codepoint such as "é" ("\N{LATIN SMALL LETTER E WITH ACUTE}") can be
decomposed to "\u0065\u0301" ("\N{LATIN SMALL LETTER E}\N{COMBINING ACUTE
ACCENT"), but "\u201c" ("\N{LEFT DOUBLE QUOTATION
Matthew Barnett added the comment:
Would a "set_encoding" method be Pythonic? I would've preferred an "encoding"
property which flushes the output when it's changed.
--
nosy: +mrabarnett
___
Python tracker
<
Matthew Barnett added the comment:
It's not a bug, it's a pathological regex (i.e. it causes catastrophic
backtracking).
It also works correctly in the "regex" module.
--
___
Python tracker
<http://bug
Matthew Barnett added the comment:
@rhettinger: The problem with "nodefault" is that it's negative, so that
"nodefault=False" means that you don't not want the default, if you see what I
mean. I think that "suppress" would be better:
mo.groupdict(
Matthew Barnett added the comment:
It doesn't work in regex, but it probably should. IMHO, if it's a valid
identifier, then it should be allowed.
--
___
Python tracker
<http://bugs.python.o
Matthew Barnett added the comment:
If a capture group is repeated, as in r'(\$.)+', only its last match is
returned.
--
___
Python tracker
<http://bugs.python.o
Changes by Matthew Barnett :
--
title: In re's examples the example with re.split() overlaps builtin input() ->
In re's examples the example with re.split() shadows builtin input()
___
Python tracker
<http://bugs.pytho
Matthew Barnett added the comment:
As far as I can tell, back in 2003, changes were made to replace the recursive
scheme which used stack allocation with a non-recursive scheme which used heap
allocation in order to the improve the behaviour.
To me it looks like an oversight and that the
Matthew Barnett added the comment:
The replacement can be a callable, so you could do this:
re.sub(r'(?:\((?:(\d+)|.*?)\)\s*)+$', lambda m: m.group(1) or '', 'avatar
(special edition)')
--
___
Python tracker
<ht
Matthew Barnett added the comment:
It appears I was wrong. :-(
The simplest solution in that case is for it to return a _copy_ of the dict.
--
___
Python tracker
<http://bugs.python.org/issue14
Matthew Barnett added the comment:
The re module creates the dict purely for the benefit of the user, and as it's
a normal dict, it's mutable.
An alternative would to use an immutable dict or dict-like object, but Python
doesn't have such a class, and it's probably not wo
Matthew Barnett added the comment:
\s matches a character, whereas \A and \Z don't. Within a character set \s
makes sense, but \A and \Z don't, so they should be treated as literals.
--
___
Python tracker
<http://bugs.python.o
Matthew Barnett added the comment:
Within a character set \A and \Z should behave like, say, \C; in other words,
they should be the literals "A" and "Z".
--
___
Python tracker
<http://bug
Matthew Barnett added the comment:
In the function "getstring" in _sre.c, the code obtains a pointer to the
characters of the buffer and then releases the buffer.
There's a comment before the release:
/* Release the buffer immediately --- possibly dangerous
but
Matthew Barnett added the comment:
It segfaults because it attempts to access the buffer of an mmap that has been
closed. It would be certainly be more friendly if it checked whether the mmap
was still open and, if not, raised an exception instead.
--
nosy: +mrabarnett
Matthew Barnett added the comment:
Ideally, it should raise an exception (or a warning) because the behaviour is
unexpected.
--
___
Python tracker
<http://bugs.python.org/issue13
Matthew Barnett added the comment:
The documentation says of the 'pos' parameter "This is not completely
equivalent to slicing the string" and of the 'endpos' parameter "it will be as
if the string is endpos characters long".
In other words, it st
Matthew Barnett added the comment:
In re, "\A" within a character set should be similar to "\C", but instead it's
still interpreted as meaning the start of the string. That's definitely a bug.
If it doesn't do what it's supposed to do, then it's a
Matthew Barnett added the comment:
This should answer that question:
>>> re.findall(r"[\A\C]", r"\AC")
['C']
>>> regex.findall(r"[\A\C]", r"\AC")
['A', 'C
Matthew Barnett added the comment:
That's not a bug.
This might help to explain what's going on:
What do (lambda) function closures capture in Python?
http://stackoverflow.com/questions/2295290/what-do-lambda-function-closures-capture-in-python
--
nosy: +
Matthew Barnett added the comment:
I'm just adding this to the regex module and I've come up against a possible
issue. The regex module supports named lists, which could be very big. Should
the entire contents of those lists also be shown in the repr?They would have to
be if the
Matthew Barnett added the comment:
Actually, one possibility that occurs to me is to provide the flags within the
pattern. The .pattern attribute gives the original pattern, but repr could give
the flags in-line at the start of the pattern:
>>> # Assuming Python 3.
>>>
Matthew Barnett added the comment:
In reply to Ezio, the repr of a large string, list, tuple or dict is also long.
The repr of a compiled regex should probably also show the flags, but should it
just be the numeric value?
--
___
Python tracker
Matthew Barnett added the comment:
The limit is an implementation detail. The pattern is compiled into codes which
are then interpreted, and it just happens that the codes are (usually) 16 bits,
giving a range of 0..65535, but it uses 65535 to represent no limit and doesn't
warn i
Matthew Barnett added the comment:
The quantifiers use 65535 to represent no upper limit, so ".{0,65535}" is
equivalent to ".*".
For example:
>>> re.match(".*", "x" * 10).span()
(0, 10)
>>> re.match(".{0,65535}", &
Matthew Barnett added the comment:
So, VERSION0 and VERSION1, with "(?V0)" and "(?V1)" in the pattern?
--
___
Python tracker
<http://bu
Matthew Barnett added the comment:
The least disruptive change would be to have a NEW flag for the new behaviour,
as at present, and an OLD flag for the old behaviour.
Currently the default is old behaviour, but in the future it will be new
behaviour.
The differences would be:
Old
Matthew Barnett added the comment:
I think I need a show of hands.
Should the default be old behaviour (like re) or new behaviour? (It might be
old now, new later.)
Should there be a NEW flag (as at present), or an OLD flag, or a VERSION
parameter (0=old, 1=new, 2
Matthew Barnett added the comment:
The regex module supports nested sets and set operations, eg.
r"[[a-z]--[aeiou]]" (the letters from 'a' to 'z', except the vowels). This
means that literal '[' in a set needs to be escaped.
For example, re module s
Matthew Barnett added the comment:
The regex module currently uses simple case-folding, although I'm working
towards full case-folding, as listed in
http://www.unicode.org/Public/UNIDATA/CaseFolding.txt.
--
___
Python tracker
Matthew Barnett added the comment:
There are some oddities in Unicode case-folding.
Under full case-folding, both "\N{LATIN CAPITAL LETTER SHARP S}" and "\N{LATIN
SMALL LETTER SHARP S}" fold to "ss", which means that those codepoints match
each other.
Howe
Matthew Barnett added the comment:
Even if this bug is fixed, it still won't work as you expect, and this s why.
The Scanner function accepts a list of 2-tuples. The first item of the tuple is
a regex and the second is a function. For example:
re.Scanner([(r"\d+", number)
Matthew Barnett added the comment:
For the "Line_Break" property, one of the possible values is "Inseparable",
with 2 permitted aliases, the shorter "IN" (which is reasonable) and
"Inseperable" (ouch!).
--
_
Matthew Barnett added the comment:
For what it's worth, I've had idea about string storage, roughly based on how
*nix stores data on disk.
If a string is small, point to a block of codepoints.
If a string is medium-sized, point to a block of pointers to codepoint blocks.
If a
Matthew Barnett added the comment:
Have a look here: http://98.245.80.27/tcpc/OSCON2011/gbu/index.html
--
___
Python tracker
<http://bugs.python.org/issue12
Matthew Barnett added the comment:
On a narrow build, "\N{MATHEMATICAL SCRIPT CAPITAL A}" is stored as 2 code
units, and neither re nor regex recombine them when compiling a regex or
looking for a match.
regex supports \xNN, \u and \U and \N{XYZ} itself, so they can be
Matthew Barnett added the comment:
You're right about starting the second search from where the first finished.
Caching the position would be an advantage there.
The memory cost of extra pointers wouldn't be so bad if UTF-8 took less space
than the current format.
Regex isn'
Matthew Barnett added the comment:
There are occasions when you want to do string slicing, often of the form:
pos = my_str.index(x)
endpos = my_str.index(y)
substring = my_str[pos : endpos]
To me that suggests that if UTF-8 is used then it may be worth profiling to see
whether caching the
Matthew Barnett added the comment:
In a narrow build, a codepoint in the astral plane is encoded as surrogate pair.
I could implement a workaround for it in the regex module, but I think that the
proper place to fix it is in the language as a whole, perhaps by implementing
PEP 393 ("Fle
Changes by Matthew Barnett :
--
nosy: +mrabarnett
___
Python tracker
<http://bugs.python.org/issue12736>
___
___
Python-bugs-list mailing list
Unsubscribe:
Changes by Matthew Barnett :
--
nosy: +mrabarnett
___
Python tracker
<http://bugs.python.org/issue12735>
___
___
Python-bugs-list mailing list
Unsubscribe:
Changes by Matthew Barnett :
--
nosy: +mrabarnett
___
Python tracker
<http://bugs.python.org/issue12734>
___
___
Python-bugs-list mailing list
Unsubscribe:
Changes by Matthew Barnett :
--
nosy: +mrabarnett
___
Python tracker
<http://bugs.python.org/issue12733>
___
___
Python-bugs-list mailing list
Unsubscribe:
Changes by Matthew Barnett :
--
nosy: +mrabarnett
___
Python tracker
<http://bugs.python.org/issue12732>
___
___
Python-bugs-list mailing list
Unsubscribe:
Changes by Matthew Barnett :
--
nosy: +mrabarnett
___
Python tracker
<http://bugs.python.org/issue12731>
___
___
Python-bugs-list mailing list
Unsubscribe:
201 - 300 of 541 matches
Mail list logo