[issue18468] re.group() should never return a bytearray
Changes by Serhiy Storchaka storch...@gmail.com: -- assignee: - serhiy.storchaka ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18468 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18468] re.group() should never return a bytearray
Roundup Robot added the comment: New changeset add40e9f7cbe by Serhiy Storchaka in branch 'default': Issue #18468: The re.split, re.findall, and re.sub functions and the group() http://hg.python.org/cpython/rev/add40e9f7cbe -- nosy: +python-dev ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18468 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18468] re.group() should never return a bytearray
Serhiy Storchaka added the comment: Thank you Antoine for your review. -- resolution: - fixed stage: patch review - committed/rejected status: open - closed ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18468 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18468] re.group() should never return a bytearray
Changes by Serhiy Storchaka storch...@gmail.com: Removed file: http://bugs.python.org/file31737/re_group_type.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18468 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18468] re.group() should never return a bytearray
Serhiy Storchaka added the comment: Fixed a typo. Could anyone please make a review? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18468 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18468] re.group() should never return a bytearray
Changes by Serhiy Storchaka storch...@gmail.com: Added file: http://bugs.python.org/file31939/re_group_type.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18468 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18468] re.group() should never return a bytearray
Serhiy Storchaka added the comment: Updated patch addressed Antoine's comments. -- Added file: http://bugs.python.org/file31941/re_group_type_2.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18468 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18468] re.group() should never return a bytearray
Serhiy Storchaka added the comment: Oh, seems I again did not attach a patch. Now I understand why there were no any feedback so long time. -- keywords: +needs review, patch Added file: http://bugs.python.org/file31737/re_group_type.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18468 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18468] re.group() should never return a bytearray
Serhiy Storchaka added the comment: Here is a patch with an implementation and tests. Feel free to add a documentation changes if needed. -- stage: needs patch - patch review ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18468 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18468] re.group() should never return a bytearray
Changes by Serhiy Storchaka storch...@gmail.com: -- nosy: +serhiy.storchaka ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18468 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18468] re.group() should never return a bytearray
Changes by Arfrever Frehtes Taifersar Arahesis arfrever@gmail.com: -- nosy: +Arfrever ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18468 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18468] re.group() should never return a bytearray
Ezio Melotti added the comment: I'm not sure it's worth changing it. As I see it, match/search are supposed to work with str or bytes and they return str/bytes accordingly. The fact that they work with other bytes-like objects seems to me an undocumented implementation detail people should not rely on. If they are passing bytes-like object, both the current behavior (return same type) or the new proposed behavior (always return bytes) seem reasonable expectations. IIUC the advantage of changing the behavior is that it won't keep the target string alive anymore, but on the other hand is not backward compatible and makes things more difficult for people who want the same type back. If people always want bytes back regardless of the input, they can convert the input or output to bytes explicitly. -- components: +Regular Expressions nosy: +ezio.melotti, mrabarnett ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18468 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18468] re.group() should never return a bytearray
Matthew Barnett added the comment: There's also the fact that the match object keeps a reference to the target string anyway: import re t = memoryview(ba) t memory at 0x0100F110 m = re.match(ba, t) m.string memory at 0x0100F110 On that subject, buried in the source code (_sre.c) is the comment: /* FIXME: implement setattr(string, None) as a special case (to detach the associated string, if any */ In the regex module I added a method detach_string to perform that function. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18468 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18468] re.group() should never return a bytearray
Ezio Melotti added the comment: match/search are supposed to work with str or bytes and they return str/bytes accordingly. s/they return/calling m.group() returns/ -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18468 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18468] re.group() should never return a bytearray
Guido van Rossum added the comment: Ezio Melotti added the comment: [...] IIUC the advantage of changing the behavior is that it won't keep the target string alive anymore, but on the other hand is not backward compatible and makes things more difficult for people who want the same type back. Everyone seems to be afraid of backward compatibility here. I will take full responsibility, so let's just discuss what's the better API, regardless of what we did (and in 99% of the cases it's the same anyway). People who want the same type back -- there is no evidence that anyone wants this. People who want a bytes object -- this is definitely a valid use case. If people always want bytes back regardless of the input, they can convert the input or output to bytes explicitly. But this requires an extra copy if the input is a bytearray. I suspect this might be the most commonly used non-bytes non-str target in Python 3 programs, and we are striving to support bytearray as input in as many places as possible where plain bytes is accepted. But generally getting bytearray as output requires a different API, e.g. recv_into(). I think a very reasonable general rule is that for functions that take either str or bytes and adjust their output to the input type, if their input is one of the bytes alternatives (bytearray, memoryview, array.array('b'), maybe others) the output is always a bytes object. The reason is that while the buffer API makes it easy to access the underlying bytes from C, it doesn't give you a way to create a new object of the same type (except by slicing, which doesn't always apply, e.g. os.listdir()). So for creating return values that match a memoryview (or bytearray, etc.) input, the only reasonable thing is to return a bytes object. (FWIW os.listdir() violates this too -- os.listdir(b'.') returns a list of bytes objects, while os.listdir(bytearray(b'.')) returns a list of str objects. This seems caused by revesed logic -- it probably tests if the type is bytes rather than if the type isn't str for the output type, even though it does the right thing with the input...) -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18468 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18468] re.group() should never return a bytearray
New submission from Guido van Rossum: I discovered that the Python 3 version of the re module's Match object behaves subtly different from the Python 2 version when the target string (i.e. the haystack, not the needle) is a buffer object. In Python 2, the type of the return value of group() is always either a Unicode string or an 8-bit string, and the type is determined by looking at the target string -- if the target is unicode, group() returns a unicode string, otherwise, group() returns an 8-bit string. In particular, if the target is a buffer object, group() returns an 8-bit string. I think this is the appropriate behavior: otherwise using regular expression matching to extract a small substring from a large target string would unnecessarily keep the large target string alive as long as the substring is alive. But in Python 3, the behavior of group() has changed so that its return type always matches that of the target string. I think this is bad -- apart from the lifetime concern, it means that if your target happens to be a bytearray, the return value isn't even hashable! Proper behavior should be that .group() returned a bytes object if the input was binary data and a str object if the input was unicode data (str) regardless of specific types containing the input target data. Probably not much, if anything, would be depending on getting a bytearray out of that. Fix this in 3.4? 3.3 and earlier users are stuck with an extra bytes() call and data copy in these cases. [Further discussion at http://mail.python.org/pipermail/python-dev/2013-July/127332.html] -- components: Library (Lib) messages: 193136 nosy: gvanrossum priority: normal severity: normal stage: needs patch status: open title: re.group() should never return a bytearray type: behavior versions: Python 3.4 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18468 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com