Re: [Python-Dev] Misc re.match() complaint
Hi, On Wed, Jul 17, 2013 at 6:15 AM, Stephen J. Turnbull step...@xemacs.org wrote: BTW, I suggest that Terry's usage of string (to mean str or bytes in 3.x, unicode or str in 2.x) be adopted, and Guido's stringish be given expanded meaning, including buffer objects. string means str, bytes means bytes, bytes-like object means any object that supports the buffer protocol [0] (including bytes). string and bytes-like object includes all of them. I don't think we need to introduce new terms. Best Regards, Ezio Melotti [0]: http://docs.python.org/3/glossary.html#term-bytes-like-object Then we can say informally that in searching and matching a target is a stringish, the pattern is a stringish (?) or compiled re, but the group method returns a string. Steve ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Misc re.match() complaint
On Thu, Jul 18, 2013 at 6:15 AM, Ezio Melotti ezio.melo...@gmail.com wrote: I don't think we need to introduce new terms. +1 -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Misc re.match() complaint
On 7/18/2013 9:15 AM, Ezio Melotti wrote: In 3.x string means str, bytes means bytes, bytes-like object means any object that supports the buffer protocol [0] (including bytes). string and bytes-like object includes all of them. I don't think we need to introduce new terms. I agree. We just need to use them consistently, and update docs carried over without change from 2.x (like re doc), where 'string' meant 'unicode or str (bytes)' or even 'unicode and bytes-like'. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Misc re.match() complaint
On 7/17/2013 12:15 AM, Stephen J. Turnbull wrote: Terry Reedy writes: On 7/15/2013 10:20 PM, Guido van Rossum wrote: Or is this something deeper, that a group *is* a new object in principle? No, I just think of it as returning a string That is exactly what the doc says it does. See my other post. The problem is that IIUC 'a string' is intentionally *not* referring to the usual str or bytes objects (at least that's one of the standard uses for scare quotes, to indicate an unusual usage). There are no 'scare quotes' in the doc. I put quote marks on things to indicated that I was quoting. I do not know how Guido regarded his marks. Either the docstring is using string in a similarly ambiguous way, or else it's incorrect under the interpretation that buffer objects are *not* strings, so they should be inadmissible as targets. Saying that input arguments can be Unicode strings as well as 8-bit strings' (the wording is from 2.x, carried over to 3.x) does not necessary exclude other inputs. CPython is somethimes more more permissive than the doc requires. If the doc said str, bytes, butearray, or memoryview, then other implementations would have to do the same to be conforming. I do not know if that is intended or not. The question is whether CPython should be just as permissive as to the output types of .group(). (And what, if any requirement should be imposed on other implementations.) Something should be fixed, and I suppose it should be the return type of group(). BTW, I suggest that Terry's usage of string (to mean str or bytes in 3.x, unicode or str in 2.x) be adopted, and Guido's stringish This word is an adjective, not a noun. be given expanded meaning, including buffer objects. Then we can say informally that in searching and matching a target is a stringish, the pattern is a stringish (?) or compiled re, but the group method returns a string. Guido's idea to fix (tighten up) the output in 3.4 is fine with me. -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Misc re.match() complaint
Terry Reedy writes: stringish This word is an adjective, not a noun. Ah, a strict grammarian. That trumps any cards I could possibly play. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Misc re.match() complaint
16.07.13 20:21, Guido van Rossum написав(ла): The situation is most egregious if the target string is a bytearray, where there is currently no way to get the result as an immutable bytes object without an extra copy. (There's no API that lets you create a bytes object directly from a slice of a bytearray.) m = memoryview(data) if m: return m.cast('B')[low:high].tobytes() else: # cast() doesn't work for empty memoryview return b'' ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Misc re.match() complaint
On 17/07/2013 05:15, Stephen J. Turnbull wrote: Terry Reedy writes: On 7/15/2013 10:20 PM, Guido van Rossum wrote: Or is this something deeper, that a group *is* a new object in principle? No, I just think of it as returning a string That is exactly what the doc says it does. See my other post. The problem is that IIUC 'a string' is intentionally *not* referring to the usual str or bytes objects (at least that's one of the standard uses for scare quotes, to indicate an unusual usage). Either the docstring is using string in a similarly ambiguous way, or else it's incorrect under the interpretation that buffer objects are *not* strings, so they should be inadmissible as targets. Something should be fixed, and I suppose it should be the return type of group(). BTW, I suggest that Terry's usage of string (to mean str or bytes in 3.x, unicode or str in 2.x) be adopted, and Guido's stringish be given expanded meaning, including buffer objects. Then we can say informally that in searching and matching a target is a stringish, the pattern is a stringish (?) or compiled re, but the group method returns a string. Instead of stringish, how about stringoid? To me, stringish is an adjective, but stringoid can be a noun or an adjective. According to http://dictionary.reference.com: -oid —suffix forming adjectives, —suffix forming nouns indicating likeness, resemblance, or similarity: anthropoid ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Misc re.match() complaint
Guido van Rossum writes: I'm not sure I understand you. :-( My apologies. This: Or is this something deeper, that a group *is* a new object in principle? No, I just think of it as returning a string and I think it's most useful if that is always an immutable object, even if the target string is some other bytes buffer. is exactly the kind of answer I was looking for. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Misc re.match() complaint
On 16 July 2013 14:53, Guido van Rossum gu...@python.org wrote: Hm. I'd still like to change this, but I understand it's debatable... Is the group() method written in C or Python? If it's in C it should be simple enough to let it just do a little bit of pointer math and construct a bytes object from the given area of memory -- after all, it must have a pointer to that memory area in order to do the matching in the first place (although I realize the code may be separated by a gulf of abstraction :-). It shouldn't be too bad - I tracked it down through sre_compile, and everything seems to funnel into match_getslice_by_index [1], so it should be possible to detect the non-bytes, non-strings there and coerce them to bytes. OTOH, you can already get the same effect by explicitly wrapping the input in memoryview before passing it to re, and then converting the output to bytes to release the reference to the underlying data, and doing that doesn't raise ugly backwards compatibility concerns Cheers, Nick. [1] http://hg.python.org/cpython/file/daf9ea42b610/Modules/_sre.c#l3198 -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Misc re.match() complaint
Le Mon, 15 Jul 2013 21:53:42 -0700, Guido van Rossum gu...@python.org a écrit : Hm. I'd still like to change this, but I understand it's debatable... Is the group() method written in C or Python? Is there a strong enough use case to change it? I can't say the current behaviour seems very useful either, but some people may depend on it. I already find it a bit weird that you're passing a bytearray or memoryview to re.match(), to be honest :-) Regards Antoine. If it's in C it should be simple enough to let it just do a little bit of pointer math and construct a bytes object from the given area of memory -- after all, it must have a pointer to that memory area in order to do the matching in the first place (although I realize the code may be separated by a gulf of abstraction :-). --Guido On Mon, Jul 15, 2013 at 8:03 PM, Nick Coghlan ncogh...@gmail.com wrote: On 16 July 2013 12:20, Guido van Rossum gu...@python.org wrote: On Mon, Jul 15, 2013 at 7:03 PM, Stephen J. Turnbull step...@xemacs.org wrote: Or is this something deeper, that a group *is* a new object in principle? No, I just think of it as returning a string and I think it's most useful if that is always an immutable object, even if the target string is some other bytes buffer. FWIW, it feels as if the change in behavior is probably just due to how slices work. I took a look at the way the 2.7 re code works, and the change does indeed appear to be due to the difference in the way slices work for buffer and memoryview objects: Slicing a buffer creates an 8-bit string: buffer(babc)[0:1] 'a' Slicing a memoryview creates another memoryview: memoryview(babc)[0:1] memory at 0x7f3320541b98 Unfortunately, memoryview doesn't currently allow subclasses, so it isn't easy to create a derivative that coerces to bytes on slicing :( Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Misc re.match() complaint
On 7/15/2013 7:14 PM, Guido van Rossum wrote: In a discussion about mypy I discovered that the Python 3 version of the re module's Match object behaves subtly different from the Python 2 version when the target string (i.e. the haystack, not the needle) is a buffer object. In Python 2, the type of the return value of group() is always either a Unicode string or an 8-bit string, and the type is determined by looking at the target string -- if the target is unicode, group() returns a unicode string, otherwise, group() returns an 8-bit string. In particular, if the target is a buffer object, group() returns an 8-bit string. I think this is the appropriate behavior: otherwise using regular expression matching to extract a small substring from a large target string would unnecessarily keep the large target string alive as long as the substring is alive. But in Python 3, the behavior of group() has changed so that its return type always matches that of the target string. I think this is bad -- apart from the lifetime concern, it means that if your target happens to be a bytearray, the return value isn't even hashable! Does anyone remember whether this was a conscious decision? Is it too late to fix? In both Python 2 and Python 3, the second sentence of the docs is Both patterns and strings to be searched can be Unicode strings as well as 8-bit strings. The Python 3 version goes on to say that patterns and targets must match. However, Unicode strings and 8-bit strings cannot be mixed. I normally consider '8-bit string' to mean 'bytes'. It certainly meant that in Python 2. We use 'buffer object' or 'object satisfying the buffer protocol' to mean 'bytes, byte_arrays, or memoryviews'. I wonder if the change was an artifact of changing the code to prohibit mixing Unicode and bytes. Going on match.group([group1, ...]) Returns one or more subgroups of the match. If there is a single argument, the result is a single string; In both 2.x and 3.x docs, I usually understand generic 'string' to mean 'Unicode or bytes'. In any case, The sentence and a half from 'Returns' to 'string' is *exactly the same* as in the 2.x docs. As near as I could tell looking by the, the rest of the entry for match.group is unchanged from 2.x to 3.x. So it is easy to think that the behavior change is an unintended regression. -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Misc re.match() complaint
On 16 July 2013 19:18, Terry Reedy tjre...@udel.edu wrote: I wonder if the change was an artifact of changing the code to prohibit mixing Unicode and bytes. I'm pretty sure we the only thing we changed in 3.x is to migrate re to the PEP 3118 buffer API, and the behavioural change Guido is seeing is actually the one between the 2.x buffer (which returns 8-bit strings when sliced) and other types (including memoryview) which return instances of themselves. Getting the old buffer behaviour in 3.x without an extra copy operation should just be a matter of wrapping the input with memoryview (to avoid copying the group elements in the match object) and the output with bytes (to avoid keeping the entire original object alive just to reference a few small pieces of it that were matched by the regex): import re data = bytearray(baaabbbcccddd) re.match(b(a*)b*c*(d*), data).group(2) bytearray(b'ddd') bytes(re.match(b(a*)b*c*(d*), memoryview(data)).group(2)) b'ddd' Given that, I'm inclined to keep the existing behaviour on backwards compatibility grounds. To make the above code work on both 2.x *and* 3.x without making an extra copy, it's possible to keep the bytes call (it should be a no-op on 2.x) and dynamically switch the type used to wrap the input between buffer in 2.x and memoryview in 3.x (unfortunately, the 2.x memoryview doesn't work for this case, as the 2.x re API doesn't accept it as valid input). Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Misc re.match() complaint
On Tue, Jul 16, 2013 at 12:55 AM, Antoine Pitrou solip...@pitrou.net wrote: Is there a strong enough use case to change it? I can't say the current behaviour seems very useful either, but some people may depend on it. This is the crucial question. I personally see the current behavior as an artifact of the (lack of) design process, not as a conscious decision. Given that we also have m.string, m.start(grp) and m.end(grp), those who need something matching the original type (or even something that is known to be a reference into the original object) can use that API; for most use cases, all you care about is is the selected group as a string, and it is more useful if that is always an immutable string (bytes or str). The situation is most egregious if the target string is a bytearray, where there is currently no way to get the result as an immutable bytes object without an extra copy. (There's no API that lets you create a bytes object directly from a slice of a bytearray.) In terms of backwards compatibility, I wouldn't want to do this in a bugfix release, but for a feature release I think it's fine -- the number of applications that could be bitten by this must be extremely small (and the work-around is backward-compatible: just use m.string[m.start() : m.stop()]). I already find it a bit weird that you're passing a bytearray or memoryview to re.match(), to be honest :-) Yes, this is somewhat of an odd corner, but actually most built-in APIs taking bytes also take anything else that can be coerced to bytes (io.open() seems to be the exception, and it feels like an accident -- os.open() *does* accept bytearray and friends). This is quite useful for code that interacts with C code or system calls -- often you have a large buffer shared between C and Python code for efficiency, and being able to do pretty much anything to the buffer that you can do to a bytes object (apart from using it as a dict key) helps a lot. -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Misc re.match() complaint
In a discussion about mypy I discovered that the Python 3 version of the re module's Match object behaves subtly different from the Python 2 version when the target string (i.e. the haystack, not the needle) is a buffer object. In Python 2, the type of the return value of group() is always either a Unicode string or an 8-bit string, and the type is determined by looking at the target string -- if the target is unicode, group() returns a unicode string, otherwise, group() returns an 8-bit string. In particular, if the target is a buffer object, group() returns an 8-bit string. I think this is the appropriate behavior: otherwise using regular expression matching to extract a small substring from a large target string would unnecessarily keep the large target string alive as long as the substring is alive. But in Python 3, the behavior of group() has changed so that its return type always matches that of the target string. I think this is bad -- apart from the lifetime concern, it means that if your target happens to be a bytearray, the return value isn't even hashable! Does anyone remember whether this was a conscious decision? Is it too late to fix? -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Misc re.match() complaint
On Mon, Jul 15, 2013 at 4:14 PM, Guido van Rossum gu...@python.org wrote: In a discussion about mypy I discovered that the Python 3 version of the re module's Match object behaves subtly different from the Python 2 version when the target string (i.e. the haystack, not the needle) is a buffer object. In Python 2, the type of the return value of group() is always either a Unicode string or an 8-bit string, and the type is determined by looking at the target string -- if the target is unicode, group() returns a unicode string, otherwise, group() returns an 8-bit string. In particular, if the target is a buffer object, group() returns an 8-bit string. I think this is the appropriate behavior: otherwise using regular expression matching to extract a small substring from a large target string would unnecessarily keep the large target string alive as long as the substring is alive. But in Python 3, the behavior of group() has changed so that its return type always matches that of the target string. I think this is bad -- apart from the lifetime concern, it means that if your target happens to be a bytearray, the return value isn't even hashable! Does anyone remember whether this was a conscious decision? Is it too late to fix? Hmm, that is not what I'd expect either. I would never expect it to return a bytearray; I'd normally assume that .group() returned a bytes object if the input was binary data and a str object if the input was unicode data (str) regardless of specific types containing the input target data. I'm going to hazard a guess that not much, if anything, would be depending on getting a bytearray out of that. Fix this in 3.4? 3.3 and earlier users are stuck with an extra bytes() call and data copy in these cases I guess. -gps ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Misc re.match() complaint
On 16/07/2013 00:30, Gregory P. Smith wrote: On Mon, Jul 15, 2013 at 4:14 PM, Guido van Rossum gu...@python.org mailto:gu...@python.org wrote: In a discussion about mypy I discovered that the Python 3 version of the re module's Match object behaves subtly different from the Python 2 version when the target string (i.e. the haystack, not the needle) is a buffer object. In Python 2, the type of the return value of group() is always either a Unicode string or an 8-bit string, and the type is determined by looking at the target string -- if the target is unicode, group() returns a unicode string, otherwise, group() returns an 8-bit string. In particular, if the target is a buffer object, group() returns an 8-bit string. I think this is the appropriate behavior: otherwise using regular expression matching to extract a small substring from a large target string would unnecessarily keep the large target string alive as long as the substring is alive. But in Python 3, the behavior of group() has changed so that its return type always matches that of the target string. I think this is bad -- apart from the lifetime concern, it means that if your target happens to be a bytearray, the return value isn't even hashable! Does anyone remember whether this was a conscious decision? Is it too late to fix? Hmm, that is not what I'd expect either. I would never expect it to return a bytearray; I'd normally assume that .group() returned a bytes object if the input was binary data and a str object if the input was unicode data (str) regardless of specific types containing the input target data. I'm going to hazard a guess that not much, if anything, would be depending on getting a bytearray out of that. Fix this in 3.4? 3.3 and earlier users are stuck with an extra bytes() call and data copy in these cases I guess. I'm not sure I understand the complaint. I get this for Python 2.7: Python 2.7.5 (default, May 15 2013, 22:43:36) [MSC v.1500 32 bit (Intel)] on win 32 Type help, copyright, credits or license for more information. import array import re re.match(ra, array.array(b, a)).group() array('b', [97]) It's the same even in Python 2.4. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Misc re.match() complaint
On 16 Jul 2013 09:17, Guido van Rossum gu...@python.org wrote: Does anyone remember whether this was a conscious decision? I doubt it was a conscious decision - an unfortunate amount of the standard library's handling of the text model change falls into the category of implementation accident :( Is it too late to fix? Like Greg, I'm comfortable with the idea of calling bug on this one, fixing it in 3.4 and making a note in the Porting to Python 3.4 section of the What's New guide. Cheers, Nick. -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Misc re.match() complaint
On Mon, Jul 15, 2013 at 5:10 PM, MRAB pyt...@mrabarnett.plus.com wrote: On 16/07/2013 00:30, Gregory P. Smith wrote: On Mon, Jul 15, 2013 at 4:14 PM, Guido van Rossum gu...@python.org mailto:gu...@python.org wrote: In a discussion about mypy I discovered that the Python 3 version of the re module's Match object behaves subtly different from the Python 2 version when the target string (i.e. the haystack, not the needle) is a buffer object. In Python 2, the type of the return value of group() is always either a Unicode string or an 8-bit string, and the type is determined by looking at the target string -- if the target is unicode, group() returns a unicode string, otherwise, group() returns an 8-bit string. In particular, if the target is a buffer object, group() returns an 8-bit string. I think this is the appropriate behavior: otherwise using regular expression matching to extract a small substring from a large target string would unnecessarily keep the large target string alive as long as the substring is alive. But in Python 3, the behavior of group() has changed so that its return type always matches that of the target string. I think this is bad -- apart from the lifetime concern, it means that if your target happens to be a bytearray, the return value isn't even hashable! Does anyone remember whether this was a conscious decision? Is it too late to fix? Hmm, that is not what I'd expect either. I would never expect it to return a bytearray; I'd normally assume that .group() returned a bytes object if the input was binary data and a str object if the input was unicode data (str) regardless of specific types containing the input target data. I'm going to hazard a guess that not much, if anything, would be depending on getting a bytearray out of that. Fix this in 3.4? 3.3 and earlier users are stuck with an extra bytes() call and data copy in these cases I guess. I'm not sure I understand the complaint. I get this for Python 2.7: Python 2.7.5 (default, May 15 2013, 22:43:36) [MSC v.1500 32 bit (Intel)] on win 32 Type help, copyright, credits or license for more information. import array import re re.match(ra, array.array(b, a)).group() array('b', [97]) It's the same even in Python 2.4. Ah, but now try it with buffer(): re.search('yz+', buffer('xyzzy')).group() 'yzz' The equivalent in Python 3 (using memoryview) returns a memoryview: re.search(b'yz+', memoryview(b'xyzzy')).group() memory at 0x10d03a688 And I still think that any return type for group() except bytes or str is wrong. (Except possibly a subclass of these.) -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Misc re.match() complaint
Guido van Rossum writes: And I still think that any return type for group() except bytes or str is wrong. (Except possibly a subclass of these.) I'm not sure I understand. Do you mean in the context of the match object API, where constructing (target, match.start(), match.end()) to get a group-like object that refers to the target rather than copying the text is simple? (Such objects are very useful in the restricted application of constructing a programmable text editor.) Or is this something deeper, that a group *is* a new object in principle? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Misc re.match() complaint
On 16/07/2013 01:25, Guido van Rossum wrote: On Mon, Jul 15, 2013 at 5:10 PM, MRAB pyt...@mrabarnett.plus.com wrote: On 16/07/2013 00:30, Gregory P. Smith wrote: On Mon, Jul 15, 2013 at 4:14 PM, Guido van Rossum gu...@python.org mailto:gu...@python.org wrote: In a discussion about mypy I discovered that the Python 3 version of the re module's Match object behaves subtly different from the Python 2 version when the target string (i.e. the haystack, not the needle) is a buffer object. In Python 2, the type of the return value of group() is always either a Unicode string or an 8-bit string, and the type is determined by looking at the target string -- if the target is unicode, group() returns a unicode string, otherwise, group() returns an 8-bit string. In particular, if the target is a buffer object, group() returns an 8-bit string. I think this is the appropriate behavior: otherwise using regular expression matching to extract a small substring from a large target string would unnecessarily keep the large target string alive as long as the substring is alive. But in Python 3, the behavior of group() has changed so that its return type always matches that of the target string. I think this is bad -- apart from the lifetime concern, it means that if your target happens to be a bytearray, the return value isn't even hashable! Does anyone remember whether this was a conscious decision? Is it too late to fix? Hmm, that is not what I'd expect either. I would never expect it to return a bytearray; I'd normally assume that .group() returned a bytes object if the input was binary data and a str object if the input was unicode data (str) regardless of specific types containing the input target data. I'm going to hazard a guess that not much, if anything, would be depending on getting a bytearray out of that. Fix this in 3.4? 3.3 and earlier users are stuck with an extra bytes() call and data copy in these cases I guess. I'm not sure I understand the complaint. I get this for Python 2.7: Python 2.7.5 (default, May 15 2013, 22:43:36) [MSC v.1500 32 bit (Intel)] on win 32 Type help, copyright, credits or license for more information. import array import re re.match(ra, array.array(b, a)).group() array('b', [97]) It's the same even in Python 2.4. Ah, but now try it with buffer(): re.search('yz+', buffer('xyzzy')).group() 'yzz' The equivalent in Python 3 (using memoryview) returns a memoryview: re.search(b'yz+', memoryview(b'xyzzy')).group() memory at 0x10d03a688 And I still think that any return type for group() except bytes or str is wrong. (Except possibly a subclass of these.) On the other hand, I think that it's not unreasonable that the output is the same type as the input. You could reason that what it's doing is returning a slice of the input, and that slice should be the same type as its source. Incidentally, the regex module does what Python 3's re module currently does, even in Python 2. Nobody's complained! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Misc re.match() complaint
On Mon, Jul 15, 2013 at 7:18 PM, MRAB pyt...@mrabarnett.plus.com wrote: On the other hand, I think that it's not unreasonable that the output is the same type as the input. You could reason that what it's doing is returning a slice of the input, and that slice should be the same type as its source. By now I'm pretty sure that is why it changed. But I am challenging how useful that is, compared to always returning something immutable. Incidentally, the regex module does what Python 3's re module currently does, even in Python 2. Nobody's complained! Well, you'd only see complaints from folks who (a) use the regex module, (b) use it with a buffer object as the target string, and (c) try to use the group() return value as a dict key. Each of these is probably a small majority of all users. -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Misc re.match() complaint
On 16 July 2013 12:20, Guido van Rossum gu...@python.org wrote: On Mon, Jul 15, 2013 at 7:03 PM, Stephen J. Turnbull step...@xemacs.org wrote: Or is this something deeper, that a group *is* a new object in principle? No, I just think of it as returning a string and I think it's most useful if that is always an immutable object, even if the target string is some other bytes buffer. FWIW, it feels as if the change in behavior is probably just due to how slices work. I took a look at the way the 2.7 re code works, and the change does indeed appear to be due to the difference in the way slices work for buffer and memoryview objects: Slicing a buffer creates an 8-bit string: buffer(babc)[0:1] 'a' Slicing a memoryview creates another memoryview: memoryview(babc)[0:1] memory at 0x7f3320541b98 Unfortunately, memoryview doesn't currently allow subclasses, so it isn't easy to create a derivative that coerces to bytes on slicing :( Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com