subject:"\[Python\-Dev\] Misc re.match\(\) complaint"

Re: [Python-Dev] Misc re.match() complaint

2013-07-18 Thread Ezio Melotti

Hi,

On Wed, Jul 17, 2013 at 6:15 AM, Stephen J. Turnbull step...@xemacs.org wrote:

 BTW, I suggest that Terry's usage of string (to mean str or bytes
 in 3.x, unicode or str in 2.x) be adopted, and Guido's stringish
 be given expanded meaning, including buffer objects.

string means str, bytes means bytes, bytes-like object means
any object that supports the buffer protocol [0] (including bytes).
string and bytes-like object includes all of them.
I don't think we need to introduce new terms.

Best Regards,
Ezio Melotti

[0]: http://docs.python.org/3/glossary.html#term-bytes-like-object

  Then we can say
 informally that in searching and matching a target is a stringish, the
 pattern is a stringish (?) or compiled re, but the group method
 returns a string.

 Steve
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Misc re.match() complaint

2013-07-18 Thread Guido van Rossum

On Thu, Jul 18, 2013 at 6:15 AM, Ezio Melotti ezio.melo...@gmail.com wrote:
 I don't think we need to introduce new terms.
+1

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Misc re.match() complaint

2013-07-18 Thread Terry Reedy



On 7/18/2013 9:15 AM, Ezio Melotti wrote:

In 3.x


string means str, bytes means bytes, bytes-like object means
any object that supports the buffer protocol [0] (including bytes).
string and bytes-like object includes all of them.
I don't think we need to introduce new terms.


I agree. We just need to use them consistently, and update docs carried 
over without change from 2.x (like re doc), where 'string' meant 
'unicode or str (bytes)' or even 'unicode and bytes-like'.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Misc re.match() complaint

2013-07-17 Thread Terry Reedy


On 7/17/2013 12:15 AM, Stephen J. Turnbull wrote:

Terry Reedy writes:
   On 7/15/2013 10:20 PM, Guido van Rossum wrote:
  
Or is this something deeper, that a group *is* a new object in
principle?
   
No, I just think of it as returning a string
  
   That is exactly what the doc says it does. See my other post.

The problem is that IIUC 'a string' is intentionally *not* referring
to the usual str or bytes objects (at least that's one of the
standard uses for scare quotes, to indicate an unusual usage).


There are no 'scare quotes' in the doc. I put quote marks on things to 
indicated that I was quoting. I do not know how Guido regarded his marks.


 Either

the docstring is using string in a similarly ambiguous way, or else
it's incorrect under the interpretation that buffer objects are *not*
strings, so they should be inadmissible as targets.


Saying that input arguments can be Unicode strings as well as 8-bit 
strings' (the wording is from 2.x, carried over to 3.x) does not 
necessary exclude other inputs. CPython is somethimes more more 
permissive than the doc requires. If the doc said str, bytes, butearray, 
or memoryview, then other implementations would have to do the same to 
be conforming. I do not know if that is intended or not.


The question is whether CPython should be just as permissive as to the 
output types of .group(). (And what, if any requirement should be 
imposed on other implementations.)


 Something

should be fixed, and I suppose it should be the return type of group().

BTW, I suggest that Terry's usage of string (to mean str or bytes
in 3.x, unicode or str in 2.x) be adopted, and Guido's stringish


This word is an adjective, not a noun.


be given expanded meaning, including buffer objects.  Then we can say
informally that in searching and matching a target is a stringish, the
pattern is a stringish (?) or compiled re, but the group method
returns a string.


Guido's idea to fix (tighten up) the output in 3.4 is fine with me.

--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Misc re.match() complaint

2013-07-17 Thread Stephen J. Turnbull

Terry Reedy writes:

   stringish
  
  This word is an adjective, not a noun.

Ah, a strict grammarian.  That trumps any cards I could possibly play.




___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Misc re.match() complaint

2013-07-17 Thread Serhiy Storchaka


16.07.13 20:21, Guido van Rossum написав(ла):

The situation is most egregious if the target string is a bytearray,
where there is currently no way to get the result as an immutable
bytes object without an extra copy. (There's no API that lets you
create a bytes object directly from a slice of a bytearray.)


m = memoryview(data)
if m:
   return m.cast('B')[low:high].tobytes()
else:
   # cast() doesn't work for empty memoryview
   return b''


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Misc re.match() complaint

2013-07-17 Thread MRAB


On 17/07/2013 05:15, Stephen J. Turnbull wrote:

Terry Reedy writes:
   On 7/15/2013 10:20 PM, Guido van Rossum wrote:
  
Or is this something deeper, that a group *is* a new object in
principle?
   
No, I just think of it as returning a string
  
   That is exactly what the doc says it does. See my other post.

The problem is that IIUC 'a string' is intentionally *not* referring
to the usual str or bytes objects (at least that's one of the
standard uses for scare quotes, to indicate an unusual usage).  Either
the docstring is using string in a similarly ambiguous way, or else
it's incorrect under the interpretation that buffer objects are *not*
strings, so they should be inadmissible as targets.  Something
should be fixed, and I suppose it should be the return type of group().

BTW, I suggest that Terry's usage of string (to mean str or bytes
in 3.x, unicode or str in 2.x) be adopted, and Guido's stringish
be given expanded meaning, including buffer objects.  Then we can say
informally that in searching and matching a target is a stringish, the
pattern is a stringish (?) or compiled re, but the group method
returns a string.


Instead of stringish, how about stringoid? To me, stringish is an
adjective, but stringoid can be a noun or an adjective.

According to http://dictionary.reference.com:


-oid

—suffix forming adjectives, —suffix forming nouns
indicating likeness, resemblance, or similarity: anthropoid


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Misc re.match() complaint

2013-07-16 Thread Stephen J. Turnbull

Guido van Rossum writes:

  I'm not sure I understand you. :-(

My apologies.  This:

   Or is this something deeper, that a group *is* a new object in
   principle?
  
  No, I just think of it as returning a string and I think it's most
  useful if that is always an immutable object, even if the target
  string is some other bytes buffer.

is exactly the kind of answer I was looking for.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Misc re.match() complaint

2013-07-16 Thread Nick Coghlan

On 16 July 2013 14:53, Guido van Rossum gu...@python.org wrote:
 Hm. I'd still like to change this, but I understand it's debatable...
 Is the group() method written in C or Python? If it's in C it should
 be simple enough to let it just do a little bit of pointer math and
 construct a bytes object from the given area of memory -- after all,
 it must have a pointer to that memory area in order to do the matching
 in the first place (although I realize the code may be separated by a
 gulf of abstraction :-).

It shouldn't be too bad - I tracked it down through sre_compile, and
everything seems to funnel into match_getslice_by_index [1], so it
should be possible to detect the non-bytes, non-strings there and
coerce them to bytes.

OTOH, you can already get the same effect by explicitly wrapping the
input in memoryview before passing it to re, and then converting the
output to bytes to release the reference to the underlying data, and
doing that doesn't raise ugly backwards compatibility concerns

Cheers,
Nick.

[1] http://hg.python.org/cpython/file/daf9ea42b610/Modules/_sre.c#l3198

--
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Misc re.match() complaint

2013-07-16 Thread Antoine Pitrou

Le Mon, 15 Jul 2013 21:53:42 -0700,
Guido van Rossum gu...@python.org a écrit :
 Hm. I'd still like to change this, but I understand it's debatable...
 Is the group() method written in C or Python?

Is there a strong enough use case to change it? I can't say the current
behaviour seems very useful either, but some people may depend on it.
I already find it a bit weird that you're passing a bytearray or
memoryview to re.match(), to be honest :-)

Regards

Antoine.


 If it's in C it should
 be simple enough to let it just do a little bit of pointer math and
 construct a bytes object from the given area of memory -- after all,
 it must have a pointer to that memory area in order to do the matching
 in the first place (although I realize the code may be separated by a
 gulf of abstraction :-).
 
 --Guido
 
 On Mon, Jul 15, 2013 at 8:03 PM, Nick Coghlan ncogh...@gmail.com
 wrote:
  On 16 July 2013 12:20, Guido van Rossum gu...@python.org wrote:
  On Mon, Jul 15, 2013 at 7:03 PM, Stephen J. Turnbull
  step...@xemacs.org wrote:
  Or is this something deeper, that a group *is* a new object in
  principle?
 
  No, I just think of it as returning a string and I think it's
  most useful if that is always an immutable object, even if the
  target string is some other bytes buffer.
 
  FWIW, it feels as if the change in behavior is probably just due to
  how slices work.
 
  I took a look at the way the 2.7 re code works, and the change does
  indeed appear to be due to the difference in the way slices work for
  buffer and memoryview objects:
 
  Slicing a buffer creates an 8-bit string:
 
  buffer(babc)[0:1]
  'a'
 
  Slicing a memoryview creates another memoryview:
 
  memoryview(babc)[0:1]
  memory at 0x7f3320541b98
 
  Unfortunately, memoryview doesn't currently allow subclasses, so it
  isn't easy to create a derivative that coerces to bytes on
  slicing :(
 
  Cheers,
  Nick.
 
  --
  Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
 
 
 



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Misc re.match() complaint

2013-07-16 Thread Terry Reedy


On 7/15/2013 7:14 PM, Guido van Rossum wrote:

In a discussion about mypy I discovered that the Python 3 version of
the re module's Match object behaves subtly different from the Python
2 version when the target string (i.e. the haystack, not the needle)
is a buffer object.

In Python 2, the type of the return value of group() is always either
a Unicode string or an 8-bit string, and the type is determined by
looking at the target string -- if the target is unicode, group()
returns a unicode string, otherwise, group() returns an 8-bit string.
In particular, if the target is a buffer object, group() returns an
8-bit string. I think this is the appropriate behavior: otherwise
using regular expression matching to extract a small substring from a
large target string would unnecessarily keep the large target string
alive as long as the substring is alive.

But in Python 3, the behavior of group() has changed so that its
return type always matches that of the target string. I think this is
bad -- apart from the lifetime concern, it means that if your target
happens to be a bytearray, the return value isn't even hashable!

Does anyone remember whether this was a conscious decision? Is it too
late to fix?


In both Python 2 and Python 3, the second sentence of the docs is Both 
patterns and strings to be searched can be Unicode strings as well as 
8-bit strings. The Python 3 version goes on to say that patterns and 
targets must match. However, Unicode strings and 8-bit strings cannot 
be mixed. I normally consider '8-bit string' to mean 'bytes'. It 
certainly meant that in Python 2. We use 'buffer object' or 'object 
satisfying the buffer protocol' to mean 'bytes, byte_arrays, or 
memoryviews'.


I wonder if the change was an artifact of changing the code to prohibit 
mixing Unicode and bytes.


Going on

match.group([group1, ...])
Returns one or more subgroups of the match. If there is a single 
argument, the result is a single string;


In both 2.x and 3.x docs, I usually understand generic 'string' to mean 
'Unicode or bytes'. In any case, The sentence and a half from 'Returns' 
to 'string' is *exactly the same* as in the 2.x docs. As near as I could 
tell looking by the, the rest of the entry for match.group is unchanged 
from 2.x to 3.x. So it is easy to think that the behavior change is an 
unintended regression.


--
Terry Jan Reedy


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Misc re.match() complaint

2013-07-16 Thread Nick Coghlan

On 16 July 2013 19:18, Terry Reedy tjre...@udel.edu wrote:
 I wonder if the change was an artifact of changing the code to prohibit
 mixing Unicode and bytes.

I'm pretty sure we the only thing we changed in 3.x is to migrate re
to the PEP 3118 buffer API, and the behavioural change Guido is seeing
is actually the one between the 2.x buffer (which returns 8-bit
strings when sliced) and other types (including memoryview) which
return instances of themselves.

Getting the old buffer behaviour in 3.x without an extra copy
operation should just be a matter of wrapping the input with
memoryview (to avoid copying the group elements in the match object)
and the output with bytes (to avoid keeping the entire original object
alive just to reference a few small pieces of it that were matched by
the regex):

 import re
 data = bytearray(baaabbbcccddd)
 re.match(b(a*)b*c*(d*), data).group(2)
bytearray(b'ddd')
 bytes(re.match(b(a*)b*c*(d*), memoryview(data)).group(2))
b'ddd'

Given that, I'm inclined to keep the existing behaviour on backwards
compatibility grounds. To make the above code work on both 2.x *and*
3.x without making an extra copy, it's possible to keep the bytes call
(it should be a no-op on 2.x) and dynamically switch the type used to
wrap the input between buffer in 2.x and memoryview in 3.x
(unfortunately, the 2.x memoryview doesn't work for this case, as the
2.x re API doesn't accept it as valid input).

Cheers,
Nick.

--
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Misc re.match() complaint

2013-07-16 Thread Guido van Rossum

On Tue, Jul 16, 2013 at 12:55 AM, Antoine Pitrou solip...@pitrou.net wrote:
 Is there a strong enough use case to change it? I can't say the current
 behaviour seems very useful either, but some people may depend on it.

This is the crucial question. I personally see the current behavior as
an artifact of the (lack of) design process, not as a conscious
decision. Given that we also have m.string, m.start(grp) and
m.end(grp), those who need something matching the original type (or
even something that is known to be a reference into the original
object) can use that API; for most use cases, all you care about is is
the selected group as a string, and it is more useful if that is
always an immutable string (bytes or str).

The situation is most egregious if the target string is a bytearray,
where there is currently no way to get the result as an immutable
bytes object without an extra copy. (There's no API that lets you
create a bytes object directly from a slice of a bytearray.)

In terms of backwards compatibility, I wouldn't want to do this in a
bugfix release, but for a feature release I think it's fine -- the
number of applications that could be bitten by this must be extremely
small (and the work-around is backward-compatible: just use
m.string[m.start() : m.stop()]).

 I already find it a bit weird that you're passing a bytearray or
 memoryview to re.match(), to be honest :-)

Yes, this is somewhat of an odd corner, but actually most built-in
APIs taking bytes also take anything else that can be coerced to bytes
(io.open() seems to be the exception, and it feels like an accident --
os.open() *does* accept bytearray and friends). This is quite useful
for code that interacts with C code or system calls -- often you have
a large buffer shared between C and Python code for efficiency, and
being able to do pretty much anything to the buffer that you can do to
a bytes object (apart from using it as a dict key) helps a lot.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] Misc re.match() complaint

2013-07-15 Thread Guido van Rossum

In a discussion about mypy I discovered that the Python 3 version of
the re module's Match object behaves subtly different from the Python
2 version when the target string (i.e. the haystack, not the needle)
is a buffer object.

In Python 2, the type of the return value of group() is always either
a Unicode string or an 8-bit string, and the type is determined by
looking at the target string -- if the target is unicode, group()
returns a unicode string, otherwise, group() returns an 8-bit string.
In particular, if the target is a buffer object, group() returns an
8-bit string. I think this is the appropriate behavior: otherwise
using regular expression matching to extract a small substring from a
large target string would unnecessarily keep the large target string
alive as long as the substring is alive.

But in Python 3, the behavior of group() has changed so that its
return type always matches that of the target string. I think this is
bad -- apart from the lifetime concern, it means that if your target
happens to be a bytearray, the return value isn't even hashable!

Does anyone remember whether this was a conscious decision? Is it too
late to fix?

--
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Misc re.match() complaint

2013-07-15 Thread Gregory P. Smith

On Mon, Jul 15, 2013 at 4:14 PM, Guido van Rossum gu...@python.org wrote:

 In a discussion about mypy I discovered that the Python 3 version of
 the re module's Match object behaves subtly different from the Python
 2 version when the target string (i.e. the haystack, not the needle)
 is a buffer object.

 In Python 2, the type of the return value of group() is always either
 a Unicode string or an 8-bit string, and the type is determined by
 looking at the target string -- if the target is unicode, group()
 returns a unicode string, otherwise, group() returns an 8-bit string.
 In particular, if the target is a buffer object, group() returns an
 8-bit string. I think this is the appropriate behavior: otherwise
 using regular expression matching to extract a small substring from a
 large target string would unnecessarily keep the large target string
 alive as long as the substring is alive.

 But in Python 3, the behavior of group() has changed so that its
 return type always matches that of the target string. I think this is
 bad -- apart from the lifetime concern, it means that if your target
 happens to be a bytearray, the return value isn't even hashable!

 Does anyone remember whether this was a conscious decision? Is it too
 late to fix?


Hmm, that is not what I'd expect either. I would never expect it to return
a bytearray; I'd normally assume that .group() returned a bytes object if
the input was binary data and a str object if the input was unicode data
(str) regardless of specific types containing the input target data.

I'm going to hazard a guess that not much, if anything, would be depending
on getting a bytearray out of that. Fix this in 3.4? 3.3 and earlier users
are stuck with an extra bytes() call and data copy in these cases I guess.

-gps
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Misc re.match() complaint

2013-07-15 Thread MRAB


On 16/07/2013 00:30, Gregory P. Smith wrote:


On Mon, Jul 15, 2013 at 4:14 PM, Guido van Rossum gu...@python.org
mailto:gu...@python.org wrote:

In a discussion about mypy I discovered that the Python 3 version of
the re module's Match object behaves subtly different from the Python
2 version when the target string (i.e. the haystack, not the needle)
is a buffer object.

In Python 2, the type of the return value of group() is always either
a Unicode string or an 8-bit string, and the type is determined by
looking at the target string -- if the target is unicode, group()
returns a unicode string, otherwise, group() returns an 8-bit string.
In particular, if the target is a buffer object, group() returns an
8-bit string. I think this is the appropriate behavior: otherwise
using regular expression matching to extract a small substring from a
large target string would unnecessarily keep the large target string
alive as long as the substring is alive.

But in Python 3, the behavior of group() has changed so that its
return type always matches that of the target string. I think this is
bad -- apart from the lifetime concern, it means that if your target
happens to be a bytearray, the return value isn't even hashable!

Does anyone remember whether this was a conscious decision? Is it too
late to fix?


Hmm, that is not what I'd expect either. I would never expect it to
return a bytearray; I'd normally assume that .group() returned a bytes
object if the input was binary data and a str object if the input was
unicode data (str) regardless of specific types containing the input
target data.

I'm going to hazard a guess that not much, if anything, would be
depending on getting a bytearray out of that. Fix this in 3.4? 3.3 and
earlier users are stuck with an extra bytes() call and data copy in
these cases I guess.


I'm not sure I understand the complaint.

I get this for Python 2.7:

Python 2.7.5 (default, May 15 2013, 22:43:36) [MSC v.1500 32 bit 
(Intel)] on win

32
Type help, copyright, credits or license for more information.
 import array
 import re
 re.match(ra, array.array(b, a)).group()
array('b', [97])

It's the same even in Python 2.4.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Misc re.match() complaint

2013-07-15 Thread Nick Coghlan

On 16 Jul 2013 09:17, Guido van Rossum gu...@python.org wrote:

 Does anyone remember whether this was a conscious decision?

I doubt it was a conscious decision - an unfortunate amount of the standard
library's handling of the text model change falls into the category of
implementation accident :(

 Is it too
 late to fix?

Like Greg, I'm comfortable with the idea of calling bug on this one,
fixing it in 3.4 and making a note in the Porting to Python 3.4 section
of the What's New guide.

Cheers,
Nick.


 --
 --Guido van Rossum (python.org/~guido)
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 http://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
http://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Misc re.match() complaint

2013-07-15 Thread Guido van Rossum

On Mon, Jul 15, 2013 at 5:10 PM, MRAB pyt...@mrabarnett.plus.com wrote:
 On 16/07/2013 00:30, Gregory P. Smith wrote:


 On Mon, Jul 15, 2013 at 4:14 PM, Guido van Rossum gu...@python.org
 mailto:gu...@python.org wrote:

 In a discussion about mypy I discovered that the Python 3 version of
 the re module's Match object behaves subtly different from the Python
 2 version when the target string (i.e. the haystack, not the needle)
 is a buffer object.

 In Python 2, the type of the return value of group() is always either
 a Unicode string or an 8-bit string, and the type is determined by
 looking at the target string -- if the target is unicode, group()
 returns a unicode string, otherwise, group() returns an 8-bit string.
 In particular, if the target is a buffer object, group() returns an
 8-bit string. I think this is the appropriate behavior: otherwise
 using regular expression matching to extract a small substring from a
 large target string would unnecessarily keep the large target string
 alive as long as the substring is alive.

 But in Python 3, the behavior of group() has changed so that its
 return type always matches that of the target string. I think this is
 bad -- apart from the lifetime concern, it means that if your target
 happens to be a bytearray, the return value isn't even hashable!

 Does anyone remember whether this was a conscious decision? Is it too
 late to fix?


 Hmm, that is not what I'd expect either. I would never expect it to
 return a bytearray; I'd normally assume that .group() returned a bytes
 object if the input was binary data and a str object if the input was
 unicode data (str) regardless of specific types containing the input
 target data.

 I'm going to hazard a guess that not much, if anything, would be
 depending on getting a bytearray out of that. Fix this in 3.4? 3.3 and
 earlier users are stuck with an extra bytes() call and data copy in
 these cases I guess.

 I'm not sure I understand the complaint.

 I get this for Python 2.7:

 Python 2.7.5 (default, May 15 2013, 22:43:36) [MSC v.1500 32 bit (Intel)] on
 win
 32
 Type help, copyright, credits or license for more information.
 import array
 import re
 re.match(ra, array.array(b, a)).group()
 array('b', [97])

 It's the same even in Python 2.4.

Ah, but now try it with buffer():

 re.search('yz+', buffer('xyzzy')).group()
'yzz'


The equivalent in Python 3 (using memoryview) returns a memoryview:

 re.search(b'yz+', memoryview(b'xyzzy')).group()
memory at 0x10d03a688


And I still think that any return type for group() except bytes or str
is wrong. (Except possibly a subclass of these.)

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Misc re.match() complaint

2013-07-15 Thread Stephen J. Turnbull

Guido van Rossum writes:

  And I still think that any return type for group() except bytes or str
  is wrong. (Except possibly a subclass of these.)

I'm not sure I understand.  Do you mean in the context of the match
object API, where constructing (target, match.start(), match.end())
to get a group-like object that refers to the target rather than
copying the text is simple?  (Such objects are very useful in the
restricted application of constructing a programmable text editor.)

Or is this something deeper, that a group *is* a new object in
principle?

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Misc re.match() complaint

2013-07-15 Thread MRAB


On 16/07/2013 01:25, Guido van Rossum wrote:

On Mon, Jul 15, 2013 at 5:10 PM, MRAB pyt...@mrabarnett.plus.com wrote:

On 16/07/2013 00:30, Gregory P. Smith wrote:



On Mon, Jul 15, 2013 at 4:14 PM, Guido van Rossum gu...@python.org
mailto:gu...@python.org wrote:

In a discussion about mypy I discovered that the Python 3 version of
the re module's Match object behaves subtly different from the Python
2 version when the target string (i.e. the haystack, not the needle)
is a buffer object.

In Python 2, the type of the return value of group() is always either
a Unicode string or an 8-bit string, and the type is determined by
looking at the target string -- if the target is unicode, group()
returns a unicode string, otherwise, group() returns an 8-bit string.
In particular, if the target is a buffer object, group() returns an
8-bit string. I think this is the appropriate behavior: otherwise
using regular expression matching to extract a small substring from a
large target string would unnecessarily keep the large target string
alive as long as the substring is alive.

But in Python 3, the behavior of group() has changed so that its
return type always matches that of the target string. I think this is
bad -- apart from the lifetime concern, it means that if your target
happens to be a bytearray, the return value isn't even hashable!

Does anyone remember whether this was a conscious decision? Is it too
late to fix?


Hmm, that is not what I'd expect either. I would never expect it to
return a bytearray; I'd normally assume that .group() returned a bytes
object if the input was binary data and a str object if the input was
unicode data (str) regardless of specific types containing the input
target data.

I'm going to hazard a guess that not much, if anything, would be
depending on getting a bytearray out of that. Fix this in 3.4? 3.3 and
earlier users are stuck with an extra bytes() call and data copy in
these cases I guess.


I'm not sure I understand the complaint.

I get this for Python 2.7:

Python 2.7.5 (default, May 15 2013, 22:43:36) [MSC v.1500 32 bit (Intel)] on
win
32
Type help, copyright, credits or license for more information.

import array
import re
re.match(ra, array.array(b, a)).group()

array('b', [97])

It's the same even in Python 2.4.


Ah, but now try it with buffer():


re.search('yz+', buffer('xyzzy')).group()

'yzz'




The equivalent in Python 3 (using memoryview) returns a memoryview:


re.search(b'yz+', memoryview(b'xyzzy')).group()

memory at 0x10d03a688




And I still think that any return type for group() except bytes or str
is wrong. (Except possibly a subclass of these.)


On the other hand, I think that it's not unreasonable that the output
is the same type as the input. You could reason that what it's doing is
returning a slice of the input, and that slice should be the same type
as its source.

Incidentally, the regex module does what Python 3's re module currently
does, even in Python 2. Nobody's complained!

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Misc re.match() complaint

2013-07-15 Thread Guido van Rossum

On Mon, Jul 15, 2013 at 7:18 PM, MRAB pyt...@mrabarnett.plus.com wrote:
 On the other hand, I think that it's not unreasonable that the output
 is the same type as the input. You could reason that what it's doing is
 returning a slice of the input, and that slice should be the same type
 as its source.

By now I'm pretty sure that is why it changed.

But I am challenging how useful that is, compared to always returning
something immutable.

 Incidentally, the regex module does what Python 3's re module currently
 does, even in Python 2. Nobody's complained!

Well, you'd only see complaints from folks who (a) use the regex
module, (b) use it with a buffer object as the target string, and (c)
try to use the group() return value as a dict key. Each of these is
probably a small majority of all users.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Misc re.match() complaint

2013-07-15 Thread Nick Coghlan

On 16 July 2013 12:20, Guido van Rossum gu...@python.org wrote:
 On Mon, Jul 15, 2013 at 7:03 PM, Stephen J. Turnbull step...@xemacs.org 
 wrote:
 Or is this something deeper, that a group *is* a new object in
 principle?

 No, I just think of it as returning a string and I think it's most
 useful if that is always an immutable object, even if the target
 string is some other bytes buffer.

 FWIW, it feels as if the change in behavior is probably just due to
 how slices work.

I took a look at the way the 2.7 re code works, and the change does
indeed appear to be due to the difference in the way slices work for
buffer and memoryview objects:

Slicing a buffer creates an 8-bit string:

 buffer(babc)[0:1]
'a'

Slicing a memoryview creates another memoryview:

 memoryview(babc)[0:1]
memory at 0x7f3320541b98

Unfortunately, memoryview doesn't currently allow subclasses, so it
isn't easy to create a derivative that coerces to bytes on slicing :(

Cheers,
Nick.

--
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Misc re.match() complaint

Re: [Python-Dev] Misc re.match() complaint

Re: [Python-Dev] Misc re.match() complaint

Re: [Python-Dev] Misc re.match() complaint

Re: [Python-Dev] Misc re.match() complaint

Re: [Python-Dev] Misc re.match() complaint

Re: [Python-Dev] Misc re.match() complaint

Re: [Python-Dev] Misc re.match() complaint

Re: [Python-Dev] Misc re.match() complaint

Re: [Python-Dev] Misc re.match() complaint

Re: [Python-Dev] Misc re.match() complaint

Re: [Python-Dev] Misc re.match() complaint

Re: [Python-Dev] Misc re.match() complaint

[Python-Dev] Misc re.match() complaint

Re: [Python-Dev] Misc re.match() complaint

Re: [Python-Dev] Misc re.match() complaint

Re: [Python-Dev] Misc re.match() complaint

Re: [Python-Dev] Misc re.match() complaint

Re: [Python-Dev] Misc re.match() complaint

Re: [Python-Dev] Misc re.match() complaint

Re: [Python-Dev] Misc re.match() complaint

Re: [Python-Dev] Misc re.match() complaint

22 matches

Site Navigation

Mail list logo

Footer information