[issue40980] group names of bytes regexes are strings

2020-06-17 Thread Quentin Wenger


Quentin Wenger  added the comment:

bytes are _not_ Unicode code points, not even in the 256 range. End of the 
story.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-17 Thread Quentin Wenger


Quentin Wenger  added the comment:

If I don't have to think about the str -> bytes direction, re should first stop 
going in the other direction.

When I have bytes regexes I actually don't care about strings and would happily 
receive group names as bytes. But no, re decides that latin-1 is the way to go, 
and this way it 1) reduces my freedom in the choice of the group names, 2) 
makes me need to go read the internals to understand the the encoding it 
arbitrarily chose is latin-1, so that I can undo it properly and get back what 
I always wanted - a bytes group name.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-17 Thread Quentin Wenger


Quentin Wenger  added the comment:

Because utf-8 is Python's default encoding, e.g. in source files, decode() and 
encode(). Literally everywhere.

If you ask around "I have a bytestring, I need a string, what do I do?", using 
latin-1 will not be the first answer (and moreover, the correct answer should 
be "it depends on the encoding", which re happily ignores by just asserting 
one).

Saying "just strip that b prefix, it's fine" cannot be taken seriously.

Yes latin-1 will never give an error on converting a bytestring, because it has 
full coverage of the 256 byte values, but saying that this is the reason why it 
should be used instead of another is forgetting why we have Unicode in the 
first place. **It is just pretending that Unicode never was a thing**. It is 
not because it can decode any bytestring that it will not return garbage _when 
the bytestring is not latin-1-encoded in the first place_.

Take a look at the documentation: https://docs.python.org/3/howto/unicode.html
7 references to latin-1, none saying that latin-1 is the way to go because it 
is so much better than anything else.

latin-1 used to be prominent in the 2.x world, it should slowly be time to 
recognize that this is over, and we cannot ignore anymore that encoding is a 
thing.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Ma Lin


Ma Lin  added the comment:

Why you always want to use "utf-8" encoded identifier as group name in `bytes` 
pattern.

The direction is: a group name written in `bytes` pattern, and will convert to 
`str.
Not this direction: `str` group name -(utf8)-> `bytes` pattern -> `str` group 
name

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger

Quentin Wenger  added the comment:

I just had an "aha moment": What re claims is that, rather than doing as I 
suggested:

> ```
> # consider the following bytestring pattern
> >>> p = b"(?P<\xc3\xba>)"
> 
> # what character does the group name correspond to?
> # maybe we can try to infer it by decoding the bytestring?
> # let's try to do it with the default encoding... that's natural, right?
> >>> p.decode()
> '(?P<ú>)'
> ```

the actual way to know what group name is represented would be to look at the 
(unicode) string with the same "graphical representation":

```
# consider the following bytestring pattern
>>> p = b"(?P<\xc3\xba>)"

# what character does the group name correspond to?
# to discover it, we instead consider the string that "looks the same":
>>> "(?P<\xc3\xba>)"
'(?P<ú>)'

# ok so the group name will be "ú"
```

This way of going from bytes to strings _naively_ (which happens to be called 
latin-1) makes IMHO as much sense as saying that 0x10, 0b10 and 0o10 should be 
the same value, just because they "look the same" in the source code.

This is like throwing away everything we ever learned about Unicode and how a 
code point is fundamentally different from what is stored in memory.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger


Quentin Wenger  added the comment:

You questioned my knowledge of encodings. Let's quote from one of the most 
famous introductory articles on the subject 
(https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/):

> It does not make sense to have a string without knowing what encoding it uses

So I have that bytestring that comes from somewhere, maybe it was originally 
utf-8 or cp1250 or ... encoded, but I won't tell or don't know, the only thing 
I swear is that it originally was a valid Python identifier.
Now I pass it as a group name in re.match (it was a valid Python identifier, so 
that has to be alright per the docs) and I get back a (unicode) string.
re.match, how dare you giving me back a string when _you have no clue what my 
bytestring originally represented, resp. what it originally was encoded with_?
Maybe re.match will even crash, because it wrongly and assumes the bytestring 
to have been latin-1 encoded!

So: latin-1 is an arbitrary choice that is no better than any other, and the 
fact that it "naturally" converts bytes to unicode code points is an 
implementation detail.
If you want to keep it so, it ought (cf. the quote above) to be made clear in 
the docs that group names come out as latin-1-encoded strings, with all the 
restrictions that follow from that choice.
But the more logical way would be to renounce this arbitrary encoding 
altogether.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger

Quentin Wenger  added the comment:

The problem can also be played in reverse, maybe it is more telling:

```
# consider the following bytestring pattern
>>> p = b"(?P<\xc3\xba>)"

# what character does the group name correspond to?
# maybe we can try to infer it by decoding the bytestring?
# let's try to do it with the default encoding... that natural, right?
>>> p.decode()
'(?P<ú>)'

# so we can reasonably expect the group name to be ú, right?
>>> list(re.compile(p).groupindex.keys()).pop()
'ú'

# Fail.
```

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger

Quentin Wenger  added the comment:

And there's no need for a cryptic encoding like cp1250 for this problem to 
arise. Here is a simple example with Python's default encoding utf-8:

```
>>> a = "ú"
>>> b = list(re.match(b"(?P<" + a.encode() + b">)", b"").groupdict())[0]
>>> a.isidentifier()
True
>>> b.isidentifier()
True
>>> b
'ú'
>>> a.encode() == b.encode("latin1")
True
```

For reference, here is the very source of the issue: 
https://github.com/python/cpython/blob/master/Lib/sre_parse.py#L228

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger

Quentin Wenger  added the comment:

> > this limitation to the latin-1 subset is not compatible with the 
> > documentation, which says that valid Python identifiers are valid group 
> > names.
> 
> Not all latin-1 characters are valid identifier, for example:
> 
> >>> '\x94'.encode('latin1')
> b'\x94'
> >>> '\x94'.isidentifier()
> False

True but that's not the point. Δ is a valid Python identifier but not a valid 
group name in bytes regexes, because it is not in the latin-1 plane. The 
documentation does not mention this.


> There is a workaround, you can convert `bytes` to `str` with "latin-1" 
> decoder before processing, IIRC there will be no extra overhead 
> (memory/speed) during processing, then the name and content are the same 
> type. :)

I am not searching a workaround for my current code.

And the simplest workaround is to latin-1-convert back to bytes, because re 
should not latin-1-convert to string in the first place.

Are you saying that the proper way to use bytes regexes is to use string 
regexes instead?


> Please look at these:
> 
> >>> orig_name = "Ř"
> >>> orig_ch = orig_name.encode("cp1250") # Because why not?
> >>> orig_ch
> b'\xd8'
> >>> name = list(re.match(b"(?P<" + orig_ch + b">)", 
> b"").groupdict().keys())[0]
> >>> name
> 'Ø'  # '\xd8'
> >>> name == orig_name
> False
> >>> name.encode("latin-1")
> b'\xd8'
> >>> name.encode("latin-1") == orig_ch
> True
> 
> "Ř" (\u0158) --cp1250--> b'\xd8'
> "Ø" (\u00d8) --latin-1--> b'\xd8'

That's no surprize, I carefully crafted this example. :-)

Rather, that is exactly my point: several different strings (which can all be 
valid Python identifiers) can have the same single-byte representation, simply 
by the mean of different encodings (duh).

So why convert group names to strings when outputting them from matches, when 
you don't know where the bytes come from, or even whether they ever were 
strings? That should be left to the programmer.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Ma Lin

Ma Lin  added the comment:

Please look at these:

>>> orig_name = "Ř"
>>> orig_ch = orig_name.encode("cp1250") # Because why not?
>>> orig_ch
b'\xd8'
>>> name = list(re.match(b"(?P<" + orig_ch + b">)", 
b"").groupdict().keys())[0]
>>> name
'Ø'  # '\xd8'
>>> name == orig_name
False
>>> name.encode("latin-1")
b'\xd8'
>>> name.encode("latin-1") == orig_ch
True

"Ř" (\u0158) --cp1250--> b'\xd8'
"Ø" (\u00d8) --latin-1--> b'\xd8'

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Ma Lin


Ma Lin  added the comment:

> this limitation to the latin-1 subset is not compatible with the 
> documentation, which says that valid Python identifiers are valid group names.

Not all latin-1 characters are valid identifier, for example:

>>> '\x94'.encode('latin1')
b'\x94'
>>> '\x94'.isidentifier()
False

There is a workaround, you can convert `bytes` to `str` with "latin-1" decoder 
before processing, IIRC there will be no extra overhead (memory/speed) during 
processing, then the name and content are the same type. :)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger

Quentin Wenger  added the comment:

I prove my point that the decoding to string is arbitrary:

```
>>> import re
>>> orig_name = "Ř"
>>> orig_ch = orig_name.encode("cp1250") # Because why not?
>>> name = list(re.match(b"(?P<" + orig_ch + b">)", b"").groupdict().keys())[0]
>>> name == orig_name
False
>>> name
'Ø'
>>> name.encode("latin-1") == orig_ch
True
```

For any dynamically-constructed bytes regex pattern, a string group name as 
output is unusable. Only after latin-1-reencoding can it be safely compared. 
This latin-1 choice is arbitrary.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger


Quentin Wenger  added the comment:

> It seems you don't know some knowledge of encoding yet.

I don't have to be ashamed of my knowledge of encoding. Yet you are right that 
I was missing a subtlety, which is that latin-1 is a strict subset of Unicode 
rather than a completely arbitrary encoding. Thank you for that.

So what you are saying is that group names in bytes regexes can only be 
specified directly (without -explicit- encoding), so de facto they are limited 
to the latin-1 subset.

Very well.

But then, once again:

1) why convert them to string when spitting them out? bytes they were when 
going in, bytes they should remain... **By converting them you are choosing an 
arbitrary encoding, even if it is the "natural" one.**
2) this limitation to the latin-1 subset is not compatible with the 
documentation, which says that valid Python identifiers are valid group names. 
If this was really the case, then I would expect to be able to use any string 
for which .isidentifier() is true as a group name, programmatically.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Ma Lin


Ma Lin  added the comment:

It seems you don't know some knowledge of encoding yet.

Naturally, `bytes` cannot contain character which Unicode code point is greater 
than \u00ff. So you can only use "latin1" encoding, which map from character to 
byte (or reverse) directly.

"utf-8", "utf-16" and "utf-32" are all encoding codecs, "utf-8" should not have 
a special status in this scene.

--
nosy:  -ezio.melotti, mrabarnett

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger


Quentin Wenger  added the comment:

The issue with the second variant is that utf-8 is an arbitrary (although 
default) choice.

But: re is doing that same arbitrary choice already in decoding the group names 
into a string, which is my original complaint!

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger


Quentin Wenger  added the comment:

Sorry, b"(?P<\xce\x94>)"

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger

Quentin Wenger  added the comment:

But Δ has no latin-1 representation. So Δ currently cannot be used as a group 
name in bytes regex, although it is a valid Python identifier. So that's a bug.

I mean, if you insist of having group names as strings even for bytes regexes, 
then it is not reasonable to prevent them from going _in_.

b"(??<\xce\x94>)" is a valid utf-8-encoded bytestring, why wouldn't you accept 
it as a valid re pattern?

IMHO, either

- group names from byte regexes should be returned as bytes
- or any utf-8-encoded representation of a valid Python identifier should be 
accepted as a group name of a bytes regex pattern.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Ma Lin

Ma Lin  added the comment:

In this case, you can only use 'latin1', which directly map one character 
(\u-\u00FF) to/from one byte.

If use 'utf-8', it may map one character to multiple bytes, such as 'Δ' -> 
b'\xce\x94'

'\x94' is an invalid identifier, it will raise an error:

>>> '\xce'.isidentifier()   # '\xce' is 'Î'
True
>>> '\x94'.isidentifier()
False

You may close this issue (I can't close it), we can continue the discussion.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger

Quentin Wenger  added the comment:

> So b'\xe9' is mapped to \u00e9, it is `é`.

Yes but \xe9 is not strictly valid utf-8, or say not the canonical 
representation of "é". So there is no way to get \xe9 starting from é without 
leaving utf-8. So starting with é as group name, I cannot programmatically 
encode it into a bytes pattern.

> Of course, characters with Unicode code point greater than 0xff are 
> impossible to appear in `bytes`.

But \xce and \x94 are both lower than \xff, yet using \xce\x94 ("Δ".encode()) 
in a group name fails.

According to the doc, the sole constraint on group names is that they have to 
be valid and unique Python identifiers. So this should work:

```
# Δ is a valid identifier
>>> "Δ".isidentifier()
True
>>> Δ = 1
>>> Δ
1
>>> import re
>>> name = "Δ"
>>> re.match(b"(?P<" + name.encode() + b">)", b"")
Traceback (most recent call last):
  File "", line 1, in 
re.match(b"(?P<" + name.encode() + b">)", b"")
  File "/usr/lib/python3.8/re.py", line 191, in match
return _compile(pattern, flags).match(string)
  File "/usr/lib/python3.8/re.py", line 304, in _compile
p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.8/sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.8/sre_parse.py", line 948, in parse
p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/lib/python3.8/sre_parse.py", line 443, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/lib/python3.8/sre_parse.py", line 703, in _parse
raise source.error(msg, len(name) + 1)
re.error: bad character in group name 'Î\x94' at position 4
re.match(b'(?P<\xce\x94>)', b'').groupdict()
```

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Ma Lin

Ma Lin  added the comment:

`latin1` is the character set that Unicode code point from \u to \u00ff, 
and the characters are directly mapped from/to bytes.

So b'\xe9' is mapped to \u00e9, it is `é`.

Of course, characters with Unicode code point greater than 0xff are impossible 
to appear in `bytes`.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger

Quentin Wenger  added the comment:

Of course an inconvenience in my program is not per se the reason to change the 
language. I just wanted to motivate that the current situation gives unexpected 
results.

"\xe9" doesn't look like proper utf-8 to me:

```
>>> "é".encode("latin-1")
b'\xe9'
>>> "é".encode()
b'\xc3\xa9'
```

Let's try another one: how would you go for Δ ("\u0394") as a group name?


```
>>> "Δ".encode()
b'\xce\x94'
>>> "Δ".encode("latin-1")
Traceback (most recent call last):
  File "", line 1, in 
"Δ".encode("latin-1")
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0394' in position 
0: ordinal not in range(256)
>>> re.match(b'(?P<\xce\x94>)', b'').groupdict()
Traceback (most recent call last):
  File "", line 1, in 
re.match(b'(?P<\xce\x94>)', b'').groupdict()
  File "/usr/lib/python3.8/re.py", line 191, in match
return _compile(pattern, flags).match(string)
  File "/usr/lib/python3.8/re.py", line 304, in _compile
p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.8/sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.8/sre_parse.py", line 948, in parse
p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/lib/python3.8/sre_parse.py", line 443, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/lib/python3.8/sre_parse.py", line 703, in _parse
raise source.error(msg, len(name) + 1)
re.error: bad character in group name 'Î\x94' at position 4
>>> re.match(b'(?P<\u0394>)', b'').groupdict()
Traceback (most recent call last):
  File "", line 1, in 
re.match(b'(?P<\u0394>)', b'').groupdict()
  File "/usr/lib/python3.8/re.py", line 191, in match
return _compile(pattern, flags).match(string)
  File "/usr/lib/python3.8/re.py", line 304, in _compile
p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.8/sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.8/sre_parse.py", line 948, in parse
p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/lib/python3.8/sre_parse.py", line 443, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/lib/python3.8/sre_parse.py", line 703, in _parse
raise source.error(msg, len(name) + 1)
re.error: bad character in group name '\\u0394' at position 4
```

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Ma Lin

Ma Lin  added the comment:

> a non-ascii group name will raise an error in bytes, even if encoded

Looks like this is a language limitation:

>>> b'é'
  File "", line 1
SyntaxError: bytes can only contain ASCII literal characters.

No problem if you use escaped character:

>>> re.match(b'(?P<\xe9>)', b'').groupdict()
{'é': b''}

There may be some inconveniences in your program, but IMO there is nothing 
wrong, maybe this issue can be closed.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger


Quentin Wenger  added the comment:

should *be a valid name

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-16 Thread Quentin Wenger

Quentin Wenger  added the comment:

Agreed to some extent, but there is the difference that group names are 
embedded in the pattern, which has to be bytes if the target is bytes.

My use case is in an all-bytes, no-string project where I construct a large 
regular expression at startup, with semi-dynamical group names.

So it seems natural to have everything in bytes to concatenate the regular 
expression, incl. the group names.

But then group names that I receive back are strings, so I cannot look them up 
directly into the set of group names that I used to create the expression in 
the first place.

Of course I can live with it by storing them as strings in the first place and 
encode()'ing them during concatenation, but it does not feel "natural".

Furthermore, even if it is "just a name", a non-ascii group name will raise an 
error in bytes, even if encoded...:

```
>>> re.compile("(?P<" + "é" + ">)")
re.compile('(?P<é>)')
>>> re.compile(b"(?P<" + "é".encode() + b">)")
Traceback (most recent call last):
  File "", line 1, in 
re.compile(b"(?P<" + "é".encode() + b">)")
  File "/usr/lib/python3.8/re.py", line 252, in compile
return _compile(pattern, flags)
  File "/usr/lib/python3.8/re.py", line 304, in _compile
p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.8/sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.8/sre_parse.py", line 948, in parse
p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/lib/python3.8/sre_parse.py", line 443, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/lib/python3.8/sre_parse.py", line 703, in _parse
raise source.error(msg, len(name) + 1)
re.error: bad character in group name 'é' at position 4
```

So no, it's not really "just a name", considering that in Python "é" should is 
a valid name.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-15 Thread Ma Lin


Ma Lin  added the comment:

Group name is `str` is very reasonable. Essentially it is just a name, it has 
nothing to do with `bytes`.

Other names in Python are also `str` type, such as codec names, hashlib names.

--
nosy: +Ma Lin

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-15 Thread Quentin Wenger


Quentin Wenger  added the comment:

This also affects functions/methods expecting a group name as parameter (e.g. 
match.group), the group name has to be passed as string.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40980] group names of bytes regexes are strings

2020-06-14 Thread Quentin Wenger


New submission from Quentin Wenger :

I noticed that match.groupdict() returns string keys, even for a bytes regex:

```
>>> import re
>>> re.match(b"(?P)", b"").groupdict()
{'a': b''}
```

This seems somewhat strange, because string and bytes matching in re are kind 
of two separate parts, cf. doc:

> Both patterns and strings to be searched can be Unicode strings (str) as well 
> as 8-bit strings (bytes). However, Unicode strings and 8-bit strings cannot 
> be mixed: that is, you cannot match a Unicode string with a byte pattern or 
> vice-versa; similarly, when asking for a substitution, the replacement string 
> must be of the same type as both the pattern and the search string.

--
components: Regular Expressions
messages: 371516
nosy: ezio.melotti, matpi, mrabarnett
priority: normal
severity: normal
status: open
title: group names of bytes regexes are strings
type: behavior
versions: Python 3.8

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com