Re: [Python-Dev] thoughts on the bytes/string discussion

2010-07-07 Thread Stefan Behnel

Ronald Oussoren, 06.07.2010 16:51:

On 27 Jun, 2010, at 11:48, Greg Ewing wrote:


Stefan Behnel wrote:

Greg Ewing, 26.06.2010 09:58:

Would there be any sanity in having an option to compile Python
with UTF-8 as the internal string representation?

It would break Py_UNICODE, because the internal size of a unicode
character would no longer be fixed.


It's not fixed anyway with the 2-char build -- some characters are
represented using a pair of surrogates.


It is for practical purposes not even fixed in 4-char builds. In 4-char
builds every Unicode code points corresponds to one item in a python
unicode string, but a base characters with combining characters is still
a sequence of characters and should IMHO almost always be treated as a
single object. As an example, given s=be\N{COMBINING DIAERESIS} s[:2]
or s[2:] is almost certainly semanticly invalid.


Sure. However, this is not a problem for the purpose of the C-API, 
especially for Cython (which is the angle from which I brought this up). 
All Cython cares about is that it mimics CPython's sematics excactly when 
transforming code, and a CPython runtime will ignore surrogate pairs and 
combining characters during iteration and indexing, and when determining 
the string length. So a single character unicode string can currently be 
safely aliased by Py_UNICODE with correct Python semantics. That would no 
longer be the case if the internal representation switched to UTF-8 and/or 
if CPython started to take surrogates and combining characters into account 
when considering the string length.


Note that it's impossible to determine if a unicode string contains 
surrogate pairs because it's running on a narrow unicode build or because 
the user entered them into the string. But the user would likely expect the 
second case to treat them as separate code points, whereas the first is an 
implementation detail that should normally be invisible. Combining 
characters are a lot clearer here, as they can only be entered by users, so 
keeping them separate as provided is IMHO the expected behaviour.


I think the main theme here is that the interpretation of code points and 
their transformation for user interfaces and backends is left to the user 
code. Py_UNICODE represents a code point in the current system, including 
surrogate pair 'escapes'. And that would change if the underlying encoding 
switched to something other than UTF-16/UCS-4.


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] thoughts on the bytes/string discussion

2010-07-07 Thread M.-A. Lemburg
Ronald Oussoren wrote:
 
 On 27 Jun, 2010, at 11:48, Greg Ewing wrote:
 
 Stefan Behnel wrote:
 Greg Ewing, 26.06.2010 09:58:
 Would there be any sanity in having an option to compile
 Python with UTF-8 as the internal string representation?
 It would break Py_UNICODE, because the internal size of a unicode character 
 would no longer be fixed.

 It's not fixed anyway with the 2-char build -- some
 characters are represented using a pair of surrogates.
 
 It is for practical purposes not even fixed in 4-char builds. In 4-char 
 builds every Unicode code points corresponds to one item in a python unicode 
 string, but a base characters with combining characters is still a sequence 
 of characters and should IMHO almost always be treated as a single object. As 
 an example, given s=be\N{COMBINING DIAERESIS} s[:2] or s[2:] is almost 
 certainly semanticly invalid.

Just to clarify: Python uses code units for Unicode storage.

Whether those code units map to code points or glyphs depends
on the used Python build and the code points in question.

See
http://www.egenix.com/library/presentations/#PythonAndUnicode
for more background information (esp. page 8).

Note that using UTF-8 as internal storage format would not work
in Python, since Python is a Unicode producer, i.e. it needs to
be able to generate and work with code points that are not allowed
in UTF-8, e.g. lone surrogates.

Another reason not to use UTF-8 encoded code units is that slicing
based on code units could easily create invalid UTF-8 which would
then render the data unusable. This is a lot less likely to happen
with UCS2 or UCS4.

And finally: RAM is cheap and today's CPUs work better with 16- or
32-bit values than 8-bit characters.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jul 07 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2010-07-19: EuroPython 2010, Birmingham, UK11 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] thoughts on the bytes/string discussion

2010-07-07 Thread Greg Ewing

M.-A. Lemburg wrote:


Note that using UTF-8 as internal storage format would not work
in Python, since Python is a Unicode producer, i.e. it needs to
be able to generate and work with code points that are not allowed
in UTF-8, e.g. lone surrogates.


Well, it wouldn't strictly be UTF-8, any more than the
2-byte build is strictly UTF-16, in the sense that lone
surrogates can be produced.


Another reason not to use UTF-8 encoded code units is that slicing
based on code units could easily create invalid UTF-8 which would
then render the data unusable. This is a lot less likely to happen
with UCS2 or UCS4.


The use cases I had in mind for a 1-byte build are those for
which the alternative would be keeping everything in bytes.
Applications using a 1-byte build would need to be aware of
the fact and take care to slice strings at valid places. If
they were using bytes, they would have to face exactly the
same issues.


And finally: RAM is cheap and today's CPUs work better with 16- or
32-bit values than 8-bit characters.


Yet some people have reported significant performance benefits
for some applications from using a 2-byte build instead of a
4-byte build. I was just speculating whether a 1-byte build
might be of further advantage in a few specialised cases.

No matter how much RAM or processing speed you have, it's always
possible to find an application that stresses the limits.

--
Greg

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] thoughts on the bytes/string discussion

2010-07-07 Thread Antoine Pitrou
On Wed, 07 Jul 2010 11:13:09 +0200
M.-A. Lemburg m...@egenix.com wrote:
 
 And finally: RAM is cheap and today's CPUs work better with 16- or
 32-bit values than 8-bit characters.

The latter is wrong. There is no cost in accessing bytes
rather than words on modern CPUs.
(actually, bytes are cheaper overall since they cost less cache)

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] thoughts on the bytes/string discussion

2010-07-07 Thread Stephen J. Turnbull
Greg Ewing writes:

  The use cases I had in mind for a 1-byte build are those for
  which the alternative would be keeping everything in bytes.
  Applications using a 1-byte build would need to be aware of
  the fact and take care to slice strings at valid places. If
  they were using bytes, they would have to face exactly the
  same issues.

In other words, the people who want to use bytes have no less pain,
and the people who want to use characters suffer much greater pain.
How can this be a win?  If you live in an ASCII-only world, there are
a few APIs where bytes aren't allowed, and indeed it would be a win to
use those APIs on ASCII-encoded bytestrings.  And I don't mean
ISO-8859-1-only, either; UTF-8 is not compatible with ISO-8859-1 at
the byte level.

But the proposal Guido supports would address that by making those
APIs polymorphic.

   And finally: RAM is cheap and today's CPUs work better with 16- or
   32-bit values than 8-bit characters.
  
  Yet some people have reported significant performance benefits
  for some applications from using a 2-byte build instead of a
  4-byte build. I was just speculating whether a 1-byte build
  might be of further advantage in a few specialised cases.

Of course it would be.  But as soon as you want to do *any* I/O in
text mode with non-ASCII characters, you're in real pain.  What do you
do if a user cut/pastes some text containing proper quotation marks or
an en-dash at prompt in a terminal?  So polymorphism is a far better
way to optimize those special cases, as it allows a byte string in any
encoding to be treated as text, not just UTF-8.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] thoughts on the bytes/string discussion

2010-07-06 Thread Ronald Oussoren

On 27 Jun, 2010, at 11:48, Greg Ewing wrote:

 Stefan Behnel wrote:
 Greg Ewing, 26.06.2010 09:58:
 Would there be any sanity in having an option to compile
 Python with UTF-8 as the internal string representation?
 It would break Py_UNICODE, because the internal size of a unicode character 
 would no longer be fixed.
 
 It's not fixed anyway with the 2-char build -- some
 characters are represented using a pair of surrogates.

It is for practical purposes not even fixed in 4-char builds. In 4-char builds 
every Unicode code points corresponds to one item in a python unicode string, 
but a base characters with combining characters is still a sequence of 
characters and should IMHO almost always be treated as a single object. As an 
example, given s=be\N{COMBINING DIAERESIS} s[:2] or s[2:] is almost certainly 
semanticly invalid.

Ronald



smime.p7s
Description: S/MIME cryptographic signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] thoughts on the bytes/string discussion

2010-06-27 Thread Greg Ewing

Stefan Behnel wrote:

Greg Ewing, 26.06.2010 09:58:


Would there be any sanity in having an option to compile
Python with UTF-8 as the internal string representation?


It would break Py_UNICODE, because the internal size of a unicode 
character would no longer be fixed.


It's not fixed anyway with the 2-char build -- some
characters are represented using a pair of surrogates.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] thoughts on the bytes/string discussion

2010-06-27 Thread Eric Smith

On 6/27/2010 5:48 AM, Greg Ewing wrote:

Stefan Behnel wrote:

Greg Ewing, 26.06.2010 09:58:


Would there be any sanity in having an option to compile
Python with UTF-8 as the internal string representation?


It would break Py_UNICODE, because the internal size of a unicode
character would no longer be fixed.


It's not fixed anyway with the 2-char build -- some
characters are represented using a pair of surrogates.



But isn't this currently ignored everywhere in python's code?

Eric.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] thoughts on the bytes/string discussion

2010-06-27 Thread Greg Ewing

Eric Smith wrote:


But isn't this currently ignored everywhere in python's code?


It's true that code using a utf-8 build would have to be
aware of the fact much more often. But I'm thinking of
applications that would otherwise want to keep all their
strings encoded to save memory. If they do that, they
also need to deal with sequence items not corresponding
to characters. If they can handle that, they may be able
to handle utf-8 just as well.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] thoughts on the bytes/string discussion

2010-06-27 Thread R. David Murray
On Fri, 25 Jun 2010 15:40:52 -0700, Bill Janssen jans...@parc.com wrote:
 Guido van Rossum gu...@python.org wrote:
  So you're really just worried about space consumption. I'd like to see
  a lot of hard memory profiling data before I got overly worried about
  that.
 
 While I've seen some big Web pages, I think the email folks, who often
 have to process messages with attachments measuring in the tens of
 megabytes, have the stronger problems here, and I think speed may be
 more important than memory.  I've built both a Web server and an IMAP
 server in Python, and the IMAP server is where the issues of storage
 management really prevail.  If you have to convert a 20 MB encoded
 string into a Unicode string just to look at the headers as strings, you
 have issues.  (The Python email package doesn't do that, by the way.)

Unfortunately in the current Python3 email package (email5), this is no
longer true.  You have to decode everything *first* in order to pass it
through email (which presents a few problems when dealing with 8bit data,
as has been mentioned here before).

eamil6 intends to fix this, and once again allow you to decode to text
only the bits you actually need to access and manipulate.

--
R. David Murray  www.bitdance.com
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] thoughts on the bytes/string discussion

2010-06-26 Thread Greg Ewing

Tres Seaver wrote:


I do know for a fact that using a UCS2-compiled Python instead of the
system's UCS4-compiled Python leads to measurable, noticable drop in
memory consumption of long-running webserver processes using Unicode


Would there be any sanity in having an option to compile
Python with UTF-8 as the internal string representation?

--
Greg

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] thoughts on the bytes/string discussion

2010-06-26 Thread Stefan Behnel

Ian Bicking, 26.06.2010 00:26:

On Fri, Jun 25, 2010 at 4:02 PM, Guido van Rossum wrote:

On Fri, Jun 25, 2010 at 1:43 PM, Glyph Lefkowitz

I'd like a version of 'decode' which would give me a type that was, in

every

respect, unicode, and responded to all protocols exactly as other
unicode objects (or str objects, if you prefer py3 nomenclature ;-))

do,

but wouldn't actually copy any of that memory unless it really needed to
(for example, to pass to a C API that expected native wide characters),

and

that would hold on to the original bytes so that it could produce them on
demand if encoded to the same encoding again. So, as others in this

thread

have mentioned, the 'ABC' really implies some stuff about C APIs as well.


Well, there's the buffer API, so you can already create something that 
refers to an existing C buffer. However, with respect to a string, you will 
have to make sure the underlying buffer doesn't get freed while the string 
is still in use. That will be hard and sometimes impossible to do at the 
C-API level, even if the string is allowed to keep a reference to something 
that holds the buffer.


At least in lxml, such a feature would be completely worthless, as text is 
never held by any ref-counted Python wrapper object. It's only part of the 
XML tree, which is allowed to change at (more or less) any time, so the 
underlying char* buffer could just get freed without further notice. Adding 
a guard against that would likely have a larger impact on the performance 
than the decoding operations.




I'm not sure about the exact performance impact of such a class, which is
why I'd like the ability to implement it *outside* of the stdlib and see

how

it works on a project, and return with a proposal along with some data.
  There are also different ways to implement this, and other optimizations
(like ropes) which might be better.
You can almost do this today, but the lack of things like the

hypothetical

__rcontains__ does make it impossible to be totally transparent about

it.

But you'd still have to validate it, right? You wouldn't want to go on
using what you thought was wrapped UTF-8 if it wasn't actually valid
UTF-8 (or you'd be worse off than in Python 2). So you're really just
worried about space consumption. I'd like to see a lot of hard memory
profiling data before I got overly worried about that.


It wasn't my profiling, but I seem to recall that Fredrik Lundh specifically
benchmarked ElementTree with all-unicode and sometimes-ascii-bytes, and
found that using Python 2 strs in some cases provided notable advantages.  I
know Stefan copied ElementTree in this regard in lxml, maybe he also did a
benchmark or knows of one?


Actually, bytes vs. unicode doesn't make that a big difference in Py2 for 
lxml. ElementTree is a lot older, so I guess it made a larger difference 
when its code was written (and I even think I recall seeing numbers for 
lxml where it seemed to make a notable difference).


In lxml, text content is stored in the C tree of libxml2 as UTF-8 encoded 
char* text. On request, lxml creates a string object from it and returns 
it. In Py2, it checks for plain ASCII content first and returns a byte 
string for that. Only non-ASCII strings are returned as decoded unicode 
strings. In Py3, it always returns unicode strings.


When I run a little benchmark on lxml in Py2.6.5 that just reads some short 
text content from an Element object, I only see a tiny difference between 
unicode strings and byte strings. The gap obviously increases when the text 
gets longer, e.g. when I serialise the complete text content of an XML 
document to either a byte string or a unicode string. But even for 
documents in the megabyte range we are still talking about single 
milliseconds here, and the difference stays well below 10%. It's seriously 
hard to make that the performance bottleneck in an XML application.


Also, since the string objects are only instantiated at request, memory 
isn't an issue either. That's different for (c)ElementTree again, where 
string content is stored as Python objects. Four times the size even for 
plain ASCII strings (e.g. numbers, IDs or even trailing whitespace!) can 
well become a problem there, and can easily dominate the overall size of 
the in-memory tree. Plain ASCII content is surprisingly common in XML 
documents.


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] thoughts on the bytes/string discussion

2010-06-26 Thread Stefan Behnel

Greg Ewing, 26.06.2010 09:58:

Tres Seaver wrote:


I do know for a fact that using a UCS2-compiled Python instead of the
system's UCS4-compiled Python leads to measurable, noticable drop in
memory consumption of long-running webserver processes using Unicode


Would there be any sanity in having an option to compile
Python with UTF-8 as the internal string representation?


It would break Py_UNICODE, because the internal size of a unicode character 
would no longer be fixed.


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] thoughts on the bytes/string discussion

2010-06-26 Thread Stephen J. Turnbull
Greg Ewing writes:

  Would there be any sanity in having an option to compile
  Python with UTF-8 as the internal string representation?

Losing Py_UNICODE as mentioned by Stefan Behnel (IIRC) is just the
beginning of the pain.

If Emacs's experience is any guide, the cost in speed and complexity
of a variable-width internal representation is high.  There are a
number of tricks you can use, but basically everything becomes O(n)
for the natural implementation of most operations (such as indexing by
character).  You can get around that with a position cache, of course,
but that adds complexity, and really cuts into the space saving (and
worse, adds another chunk that may or may not be paged in when you
need it).

What we're considering is a system where buffers come in 1-, 2-, and
4-octet widechars, with automatic translation depending on content.
But the buffer is the primary random-access structure in Emacsen, so
optimizing it is probably worth our effort.  I doubt it would be worth
it for Python, but my intuitions here are not reliable.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] thoughts on the bytes/string discussion

2010-06-26 Thread Terry Reedy
The several posts in this and other threads go me to think about text 
versus number computing (which I am more familiar with).


For numbers, we have in Python three builtins, the general purpose ints 
and floats and the more specialized complex. Two other rational types 
can be imported for specialized uses. And then there are 3rd-party 
libraries like mpz and numpy with more number and array of number types.


What makes these all potentially work together is the special method 
system, including, in particular, the rather complete set of __rxxx__ 
number methods. The latter allow non-commutative operations to be mixed 
either way and ease mixed commutative operations.


For text, we have general purpose str and encoded bytes (and bytearry). 
I think these are sufficient for general use and I am not sure there 
should even be anything else in the stdlib. But I think it should be 
possible to experiment with and use specialized 3rd-party text classes 
just as one can with number classes.


I can imagine that inter-operation, when appropriate, might work better 
with addition of a couple of  missing __rxxx__ methods, such as the 
mentioned __rcontains__. Although adding such would affect the 
implementation of a core syntax feature, it would not affect syntax as 
such as seen by the user.


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] thoughts on the bytes/string discussion

2010-06-26 Thread Nick Coghlan
On Sun, Jun 27, 2010 at 8:11 AM, Terry Reedy tjre...@udel.edu wrote:
 I can imagine that inter-operation, when appropriate, might work better with
 addition of a couple of  missing __rxxx__ methods, such as the mentioned
 __rcontains__. Although adding such would affect the implementation of a
 core syntax feature, it would not affect syntax as such as seen by the user.

The problem with strings isn't really the binary operations like
__contains__ - adding __rcontains__ would be a fairly simple
extrapolation of the existing approaches.

Where it gets really messy for strings is the fact that whereas
invoking named methods directly on numbers is rare, invoking them on
strings is very common, and some of those methods (e.g. split(),
join(), __mod__()) allow or require an iterable rather than a single
object. This extends the range of use cases to be covered beyond those
with syntactic support to potentially include all string methods that
take arguments. Creating minimally surprising semantics for the
methods which accept iterables is also rather challenging.

It's an interesting idea, but I think it's overkill for the specific
problem of making it easier to perform more text-like manipulations in
a bytes-only domain.

Cheers,
NIck.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] thoughts on the bytes/string discussion

2010-06-25 Thread Stephen J. Turnbull
Ian Bicking writes:

  We've setup a system where we think of text as natively unicode, with
  encodings to put that unicode into a byte form.  This is certainly
  appropriate in a lot of cases.  But there's a significant class of problems
  where bytes are the native structure.  Network protocols are what we've been
  discussing, and are a notable case of that.  That is, b'/' is the most
  native sense of a path separator in a URL, or b':' is the most native sense
  of what separates a header name from a header value in HTTP.

IMHO, URIs don't have a native language in this sense.  Network
programmers do, however, and it is bytes.  Text-handling programmers
also do, and it is str.

  So with this idea in mind it makes more sense to me that *specific pieces of
  text* can be reasonably treated as both bytes and text.  All the string
  literals in urllib.parse.urlunspit() for example.
  
  The semantics I imagine are that special('/')+b'x'==b'/x' (i.e., it does not
  become special('/x')) and special('/')+x=='/x' (again it becomes str).  This
  avoids some of the cases of unicode or str infecting a system as they did in
  Python 2 (where you might pass in unicode and everything works fine until
  some non-ASCII is introduced).

I think you need to give explicit examples where this actually helps
in terms of type contagion.  I expect that it doesn't help at all,
especially not for the people whose native language for URIs is bytes.
These specials are still going to flip to unicode as soon as it comes
in, and that will be incompatible with the bytes they'll need later.
So they're still going to need to filter out unicode on input.

It looks like it would be useful for programmers of polymorphic
functions, though.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] thoughts on the bytes/string discussion

2010-06-25 Thread Ian Bicking
On Fri, Jun 25, 2010 at 5:06 AM, Stephen J. Turnbull step...@xemacs.orgwrote:

So with this idea in mind it makes more sense to me that *specific
 pieces of
   text* can be reasonably treated as both bytes and text.  All the string
   literals in urllib.parse.urlunspit() for example.
  
   The semantics I imagine are that special('/')+b'x'==b'/x' (i.e., it does
 not
   become special('/x')) and special('/')+x=='/x' (again it becomes str).
  This
   avoids some of the cases of unicode or str infecting a system as they
 did in
   Python 2 (where you might pass in unicode and everything works fine
 until
   some non-ASCII is introduced).

 I think you need to give explicit examples where this actually helps
 in terms of type contagion.  I expect that it doesn't help at all,
 especially not for the people whose native language for URIs is bytes.
 These specials are still going to flip to unicode as soon as it comes
 in, and that will be incompatible with the bytes they'll need later.
 So they're still going to need to filter out unicode on input.

 It looks like it would be useful for programmers of polymorphic
 functions, though.


I'm proposing these specials would be used in polymorphic functions, like
the functions in urllib.parse.  I would not personally use them in my own
code (unless of course I was writing my own polymorphic functions).

This also makes it less important that the objects be a full stand-in for
text, as their use should be isolated to specific functions, they aren't
objects that should be passed around much.  So you can easily identify and
quickly detect if you use unsupported operations on those text-like
objects.  (This is all a very different use case from bytes+encoding, I
think)

-- 
Ian Bicking  |  http://blog.ianbicking.org
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] thoughts on the bytes/string discussion

2010-06-25 Thread Stephen J. Turnbull
Ian Bicking writes:

  I'm proposing these specials would be used in polymorphic functions, like
  the functions in urllib.parse.  I would not personally use them in my own
  code (unless of course I was writing my own polymorphic functions).
  
  This also makes it less important that the objects be a full stand-in for
  text, as their use should be isolated to specific functions, they aren't
  objects that should be passed around much.  So you can easily identify and
  quickly detect if you use unsupported operations on those text-like
  objects.

OK.  That sounds reasonable to me, but I don't see any need for
a builtin type for it.  Inclusion in the stdlib is not quite a
no-brainer, but given Guido's endorsement of polymorphism, I can't
bring myself to go lower than +0.9 wink.

  (This is all a very different use case from bytes+encoding, I think)

Very much so.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] thoughts on the bytes/string discussion

2010-06-25 Thread Ian Bicking
On Fri, Jun 25, 2010 at 11:30 AM, Stephen J. Turnbull step...@xemacs.orgwrote:

 Ian Bicking writes:

   I'm proposing these specials would be used in polymorphic functions,
 like
   the functions in urllib.parse.  I would not personally use them in my
 own
   code (unless of course I was writing my own polymorphic functions).
  
   This also makes it less important that the objects be a full stand-in
 for
   text, as their use should be isolated to specific functions, they aren't
   objects that should be passed around much.  So you can easily identify
 and
   quickly detect if you use unsupported operations on those text-like
   objects.

 OK.  That sounds reasonable to me, but I don't see any need for
 a builtin type for it.  Inclusion in the stdlib is not quite a
 no-brainer, but given Guido's endorsement of polymorphism, I can't
 bring myself to go lower than +0.9 wink.


Agreed on a builtin; I think it would be fine to put something in the
strings module, and then in these examples code that used '/' would instead
use strings.ascii('/') (not sure so sure of what the name should be though).


-- 
Ian Bicking  |  http://blog.ianbicking.org
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] thoughts on the bytes/string discussion

2010-06-25 Thread Glyph Lefkowitz

On Jun 24, 2010, at 4:59 PM, Guido van Rossum wrote:

 Regarding the proposal of a String ABC, I hope this isn't going to
 become a backdoor to reintroduce the Python 2 madness of allowing
 equivalency between text and bytes for *some* strings of bytes and not
 others.

For my part, what I want out of a string ABC is simply the ability to do 
application-specific optimizations.

There are many applications where all input and output is text, but _must_ be 
UTF-8.  Even GTK uses UTF-8 as its native text representation, so output 
could just be display.

Right now, in Python 3, the only way to be correct about this is to copy 
every byte of input into 4 bytes of output, then copy each code point *back* 
into a single byte of output.  If all your application does is rewrite the 
occasional XML attribute, for example, this cost can be significant, if not 
overwhelming.

I'd like a version of 'decode' which would give me a type that was, in every 
respect, unicode, and responded to all protocols exactly as other unicode 
objects (or str objects, if you prefer py3 nomenclature ;-)) do, but wouldn't 
actually copy any of that memory unless it really needed to (for example, to 
pass to a C API that expected native wide characters), and that would hold on 
to the original bytes so that it could produce them on demand if encoded to the 
same encoding again. So, as others in this thread have mentioned, the 'ABC' 
really implies some stuff about C APIs as well.

I'm not sure about the exact performance impact of such a class, which is why 
I'd like the ability to implement it *outside* of the stdlib and see how it 
works on a project, and return with a proposal along with some data.  There are 
also different ways to implement this, and other optimizations (like ropes) 
which might be better.

You can almost do this today, but the lack of things like the hypothetical 
__rcontains__ does make it impossible to be totally transparent about it.___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] thoughts on the bytes/string discussion

2010-06-25 Thread Guido van Rossum
On Fri, Jun 25, 2010 at 1:43 PM, Glyph Lefkowitz
gl...@twistedmatrix.com wrote:

 On Jun 24, 2010, at 4:59 PM, Guido van Rossum wrote:

 Regarding the proposal of a String ABC, I hope this isn't going to
 become a backdoor to reintroduce the Python 2 madness of allowing
 equivalency between text and bytes for *some* strings of bytes and not
 others.

 For my part, what I want out of a string ABC is simply the ability to do
 application-specific optimizations.
 There are many applications where all input and output is text, but _must_
 be UTF-8.  Even GTK uses UTF-8 as its native text representation, so
 output could just be display.
 Right now, in Python 3, the only way to be correct about this is to copy
 every byte of input into 4 bytes of output, then copy each code point *back*
 into a single byte of output.  If all your application does is rewrite the
 occasional XML attribute, for example, this cost can be significant, if not
 overwhelming.
 I'd like a version of 'decode' which would give me a type that was, in every
 respect, unicode, and responded to all protocols exactly as other
 unicode objects (or str objects, if you prefer py3 nomenclature ;-)) do,
 but wouldn't actually copy any of that memory unless it really needed to
 (for example, to pass to a C API that expected native wide characters), and
 that would hold on to the original bytes so that it could produce them on
 demand if encoded to the same encoding again. So, as others in this thread
 have mentioned, the 'ABC' really implies some stuff about C APIs as well.
 I'm not sure about the exact performance impact of such a class, which is
 why I'd like the ability to implement it *outside* of the stdlib and see how
 it works on a project, and return with a proposal along with some data.
  There are also different ways to implement this, and other optimizations
 (like ropes) which might be better.
 You can almost do this today, but the lack of things like the hypothetical
 __rcontains__ does make it impossible to be totally transparent about it.

But you'd still have to validate it, right? You wouldn't want to go on
using what you thought was wrapped UTF-8 if it wasn't actually valid
UTF-8 (or you'd be worse off than in Python 2). So you're really just
worried about space consumption. I'd like to see a lot of hard memory
profiling data before I got overly worried about that.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] thoughts on the bytes/string discussion

2010-06-25 Thread Bill Janssen
Guido van Rossum gu...@python.org wrote:

 On Fri, Jun 25, 2010 at 1:43 PM, Glyph Lefkowitz
 gl...@twistedmatrix.com wrote:
 
  On Jun 24, 2010, at 4:59 PM, Guido van Rossum wrote:
 
  Regarding the proposal of a String ABC, I hope this isn't going to
  become a backdoor to reintroduce the Python 2 madness of allowing
  equivalency between text and bytes for *some* strings of bytes and not
  others.

I never actually replied to this...  Absolutely right, which is why you
might really want another kind of string, rather than a way to treat
some bytes values as strings in some places.  Both Python 2 and Python 3
are missing one of the three types.  Python 1 and 2 didn't have bytes,
and this caused problems because str was pressed into use to hold
arbitrary byte sequences.  (Python 2 str has other problems as well,
like losing track of the encoding.)  Python 3 doesn't have Python 2's
str (encoded string), and bytes are being pressed into use for that.
Each of these uses is an ad hoc hijack of an inappropriate type, and
additional frameworks not directly supported by the Python language are
being jury-rigged to try to support the uses.

On the other hand, this is all in the eye of the beholder.  Both byte
sequences and strings are horrible formless things; they remind me of
BLISS.  You seldom really have a byte sequence; what you have is an XDR
float or an encoded string or an IP header or an email message.
Similarly for strings; they are really file names or city names or
English sentences or URIs or other things with significant semantic
constraints not captured by the typical type system.  So, yes, there
*is* an inescapable equivalency between text and bytes for *some*
sequences of bytes (those that represent encoded strings) and not others
(those that contain the XDR float, for instance).  Creating a separate
encoded string type would be one way to keep that straight.

  For my part, what I want out of a string ABC is simply the ability to do
  application-specific optimizations.
  There are many applications where all input and output is text, but _must_
  be UTF-8.  Even GTK uses UTF-8 as its native text representation, so
  output could just be display.
  Right now, in Python 3, the only way to be correct about this is to copy
  every byte of input into 4 bytes of output, then copy each code point *back*
  into a single byte of output.  If all your application does is rewrite the
  occasional XML attribute, for example, this cost can be significant, if not
  overwhelming.
  I'd like a version of 'decode' which would give me a type that was, in every
  respect, unicode, and responded to all protocols exactly as other
  unicode objects (or str objects, if you prefer py3 nomenclature ;-)) do,
  but wouldn't actually copy any of that memory unless it really needed to
  (for example, to pass to a C API that expected native wide characters), and
  that would hold on to the original bytes so that it could produce them on
  demand if encoded to the same encoding again. So, as others in this thread
  have mentioned, the 'ABC' really implies some stuff about C APIs as well.

Seems like it.

  I'm not sure about the exact performance impact of such a class, which is
  why I'd like the ability to implement it *outside* of the stdlib and see how
  it works on a project, and return with a proposal along with some data.

Yes, exactly.

   There are also different ways to implement this, and other optimizations
  (like ropes) which might be better.
  You can almost do this today, but the lack of things like the hypothetical
  __rcontains__ does make it impossible to be totally transparent about it.
 
 But you'd still have to validate it, right? You wouldn't want to go on
 using what you thought was wrapped UTF-8 if it wasn't actually valid
 UTF-8 (or you'd be worse off than in Python 2).

Yes, but there are different ways to validate it that have different
performance impacts.  Simply trusting the source of the string, for
example, would be appropriate in some cases.

 So you're really just worried about space consumption. I'd like to see
 a lot of hard memory profiling data before I got overly worried about
 that.

While I've seen some big Web pages, I think the email folks, who often
have to process messages with attachments measuring in the tens of
megabytes, have the stronger problems here, and I think speed may be
more important than memory.  I've built both a Web server and an IMAP
server in Python, and the IMAP server is where the issues of storage
management really prevail.  If you have to convert a 20 MB encoded
string into a Unicode string just to look at the headers as strings, you
have issues.  (The Python email package doesn't do that, by the way.)

Bill
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] thoughts on the bytes/string discussion

2010-06-25 Thread Steve Holden
Glyph Lefkowitz wrote:
 
 On Jun 25, 2010, at 5:02 PM, Guido van Rossum wrote:
 
 But you'd still have to validate it, right? You wouldn't want to go on
 using what you thought was wrapped UTF-8 if it wasn't actually valid
 UTF-8 (or you'd be worse off than in Python 2). So you're really just
 worried about space consumption.
 
 So, yes, I am mainly worried about memory consumption, but don't
 underestimate the pure CPU cost of doing all the copying.  It's quite a
 bit faster to simply scan through a string than to scan and while you're
 scanning, keep faulting out the L2 cache while you're accessing some
 other area of memory to store the copy.
 
Yes, but you are already talking about optimizations that might be
significant for large-ish strings (where large-ish depends on exactly
where Moore's Law is currently delivering computational performance) -
the amount of cache consumed by a ten-byte string will slip by
unnoticed, but at L2 levels megabytes would effectively flush the cache.

 Plus, If I am decoding with the surrogateescape error handler (or its
 effective equivalent), then no, I don't need to validate it in advance;
 interpretation can be done lazily as necessary.  I realize that this is
 just GIGO, but I wouldn't be doing this on data that didn't have an
 explicitly declared or required encoding in the first place.
 
 I'd like to see a lot of hard memory profiling data before I got
 overly worried about that.
 
 I know of several Python applications that are already constrained by
 memory.  I don't have a lot of hard memory profiling data, but in an
 environment where you're spawning as many processes as you can in order
 to consume _all_ the physically available RAM for string processing, it
 stands to reason that properly decoding everything and thereby exploding
 everything out into 4x as much data (or 2x, if you're lucky) would
 result in a commensurate decrease in throughput.
 
Yes, UCS-4's impact does seem like to could be horrible for these use
cases. But knowing of several Python applications that are already
constrained by memory doesn't mean that it's a bad general decision.
Most users will never notice the difference, so we should try to
accommodate those who do notice a difference without inconveniencing the
rest too much.

 I don't think I could even reasonably _propose_ that such a project stop
 treating textual data as bytes, because there's no optimization strategy
 once that sort of architecture has been put into place. If your function
 says this takes unicode, then you just have to bite the bullet and
 decode it, or rewrite it again to have a different requirement.
 
That has always been my understanding. I regard it as a sort of
intellectual tax on the United States (and its Western collaborators)
for being too dim to realise that eventually they would end up selling
computers to people with more than 256 characters in their alphabet).
Sorry guys, but your computers are only as fast as you think they are
when you only talk to each other.

 So, right now, I don't know where I'd get the data with to make the
 argument in the first place :).  If there were some abstraction in the
 core's treatment of strings, though, and I could decode things and note
 their encoding without immediately paying this cost (or alternately,
 paying the cost to see if it's so bad, but with the option of managing
 it or optimizing it separately).  This is why I'm asking for a way for
 me to implement my own string type, and not for a change of behavior or
 an optimization in the stdlib itself: I could be wrong, I don't have a
 particularly high level of certainty in my performance estimates, but I
 think that my concerns are realistic enough that I don't want to embark
 on a big re-architecture of text-handling only to have it become a
 performance nightmare that needs to be reverted.
 
Recent experience with the thoroughness of the Python 3 release
preparations leads me to believe that *anything* new needs to prove its
worth outside the stdlib for a while.

 As Robert Collins pointed out, they already have performance issues
 related to encoding in Bazaar.  I know they've done a lot of profiling
 in that area, so I hope eventually someone from that project will show
 up with some data to demonstrate it :).  And I've definitely heard many,
 many anecdotes (some of them in this thread) about people distorting
 their data structures in various ways to avoid paying decoding cost in
 the ASCII/latin1 case, whether it's *actually* a significant performance
 issue or not.  I would very much like to tell those people Just call
 .decode(), and if it turns out to actually be a performance issue, you
 can always deal with it later, with a custom string type.  I'm
 confident that in *most* cases, it would not be.
 
Well that would be a nice win.

 Anyway, this may be a serious issue, but I increasingly feel like I'm
 veering into python-ideas territory, so perhaps I'll just have to burn
 this bridge when I 

[Python-Dev] thoughts on the bytes/string discussion

2010-06-24 Thread Bill Janssen
Here are a couple of ideas I'm taking away from the bytes/string
discussion.

First, it would probably be a good idea to have a String ABC.

Secondly, maybe the string situation in 2.x wasn't as broken as we
thought it was.  In particular, those who deal with lots of encoded
strings seemed to find it handy, and miss it in 3.x.  Perhaps strings
are more like numbers than we think.  We have separate types for int,
float, Decimal, etc.  But they're all numbers, and they all
cross-operate.  In 2.x, it seems there were two missing features: no
encoding attribute on str, which should have been there and should have
been required, and the default encoding being ASCII (I can't tell you
how many times I've had to fix that issue when a non-ASCII encoded str
was passed to some output function).

So maybe having a second string type in 3.x that consists of an encoded
sequence of bytes plus the encoding, call it estr, wouldn't have been
a bad idea.  It would probably have made sense to have estr cooperate
with the str type, in the same way that two different kinds of numbers
cooperate, promoting the result of an operation only when necessary.
This would automatically achieve the kind of polymorphic functionality
that Guido is suggesting, but without losing the ability to do

  x = e(ASCII)bar
  a = ''.join(foo, x)

(or whatever the syntax for such an encoded string literal would be --
I'm not claiming this is a good one) which presume would bind a to a
Unicode string foobar -- have to work out what gets promoted to what.

The language moratorium kind of makes this all theoretical, but building
a String ABC still would be a good start, and presumably isn't forbidden
by the moratorium.

Bill

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] thoughts on the bytes/string discussion

2010-06-24 Thread Michael Foord

On 24/06/2010 19:11, Brett Cannon wrote:

On Thu, Jun 24, 2010 at 10:38, Bill Janssenjans...@parc.com  wrote:
[SNIP]
   

The language moratorium kind of makes this all theoretical, but building
a String ABC still would be a good start, and presumably isn't forbidden
by the moratorium.
 

Because a new ABC would go into the stdlib (I assume in collections or
string) the moratorium does not apply.
   


Although it would require changes for builtin types like file to work 
with a new string ABC, right?


Michael


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk
   



--
http://www.ironpythoninaction.com/
http://www.voidspace.org.uk/blog

READ CAREFULLY. By accepting and reading this email you agree, on behalf of 
your employer, to release me from all obligations and waivers arising from any 
and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, 
clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and 
acceptable use policies (”BOGUS AGREEMENTS”) that I have entered into with your 
employer, its partners, licensors, agents and assigns, in perpetuity, without 
prejudice to my ongoing rights and privileges. You further represent that you 
have the authority to release me from any BOGUS AGREEMENTS on behalf of your 
employer.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] thoughts on the bytes/string discussion

2010-06-24 Thread Brett Cannon
On Thu, Jun 24, 2010 at 12:07, Michael Foord fuzzy...@voidspace.org.uk wrote:
 On 24/06/2010 19:11, Brett Cannon wrote:

 On Thu, Jun 24, 2010 at 10:38, Bill Janssenjans...@parc.com  wrote:
 [SNIP]


 The language moratorium kind of makes this all theoretical, but building
 a String ABC still would be a good start, and presumably isn't forbidden
 by the moratorium.


 Because a new ABC would go into the stdlib (I assume in collections or
 string) the moratorium does not apply.


 Although it would require changes for builtin types like file to work with a
 new string ABC, right?

Only if they wanted to rely on some concrete implementation of a
method contained within the ABC. Otherwise that's what abc.register
exists for.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] thoughts on the bytes/string discussion

2010-06-24 Thread Ian Bicking
On Thu, Jun 24, 2010 at 12:38 PM, Bill Janssen jans...@parc.com wrote:

 Here are a couple of ideas I'm taking away from the bytes/string
 discussion.

 First, it would probably be a good idea to have a String ABC.

 Secondly, maybe the string situation in 2.x wasn't as broken as we
 thought it was.  In particular, those who deal with lots of encoded
 strings seemed to find it handy, and miss it in 3.x.  Perhaps strings
 are more like numbers than we think.  We have separate types for int,
 float, Decimal, etc.  But they're all numbers, and they all
 cross-operate.  In 2.x, it seems there were two missing features: no
 encoding attribute on str, which should have been there and should have
 been required, and the default encoding being ASCII (I can't tell you
 how many times I've had to fix that issue when a non-ASCII encoded str
 was passed to some output function).


I've started to form a conceptual notion that I think fits these cases.

We've setup a system where we think of text as natively unicode, with
encodings to put that unicode into a byte form.  This is certainly
appropriate in a lot of cases.  But there's a significant class of problems
where bytes are the native structure.  Network protocols are what we've been
discussing, and are a notable case of that.  That is, b'/' is the most
native sense of a path separator in a URL, or b':' is the most native sense
of what separates a header name from a header value in HTTP.  To disallow
unicode URLs or unicode HTTP headers would be rather anti-social, especially
because unicode is now the native string type in Python 3 (as an aside for
the WSGI spec we've been talking about using native strings in some
positions like dictionary keys, meaning Python 2 str and Python 3 str, while
being more exacting in other areas such as a response body which would
always be bytes).

The HTTP spec and other network protocols seems a little fuzzy on this,
because it was written before unicode even existed, and even later activity
happened at a point when unicode and text weren't widely considered the
same thing like they are now.  But I think the original intention is
revealed in a more modern specification like WebSockets, where they are very
explicit that ':' is just shorthand for a particular byte, it is not text
in our new modern notion of the term.

So with this idea in mind it makes more sense to me that *specific pieces of
text* can be reasonably treated as both bytes and text.  All the string
literals in urllib.parse.urlunspit() for example.

The semantics I imagine are that special('/')+b'x'==b'/x' (i.e., it does not
become special('/x')) and special('/')+x=='/x' (again it becomes str).  This
avoids some of the cases of unicode or str infecting a system as they did in
Python 2 (where you might pass in unicode and everything works fine until
some non-ASCII is introduced).

The one place where this might be tricky is if you have an encoding that is
not ASCII compatible.  But we can't guard against every possibility.  So it
would be entirely wrong to take a string encoded with UTF-16 and start to
use b'/' with it.  But there are other nonsensical combinations already
possible, especially with polymorphic functions, we can't guard against all
of them.  Also I'm unsure if something like UTF-16 is in any way compatible
with the kind of legacy systems that use bytes.  Can you encode your
filesystem with UTF-16?  I don't think you could encode a cookie with it.

So maybe having a second string type in 3.x that consists of an encoded
 sequence of bytes plus the encoding, call it estr, wouldn't have been
 a bad idea.  It would probably have made sense to have estr cooperate
 with the str type, in the same way that two different kinds of numbers
 cooperate, promoting the result of an operation only when necessary.
 This would automatically achieve the kind of polymorphic functionality
 that Guido is suggesting, but without losing the ability to do

  x = e(ASCII)bar
  a = ''.join(foo, x)

 (or whatever the syntax for such an encoded string literal would be --
 I'm not claiming this is a good one) which presume would bind a to a
 Unicode string foobar -- have to work out what gets promoted to what.


I would be entirely happy without a literal syntax.  But as Phillip has
noted, this can't be implemented *entirely* in a library as there are some
constraints with the current str/bytes implementations.  Reading PEP 3003
I'm not clear if such changes are part of the moratorium?  They seem like
they would be (sadly), but it doesn't seem clearly noted.

I think there's a *different* use case for things like
bytes-in-a-utf8-encoding (e.g., to allow XML data to be decoded lazily), but
that could be yet another class, and maybe shouldn't be polymorphicly usable
as bytes (i.e., treat it as an optimized str representation that is
otherwise semantically equivalent).  A String ABC would formalize these
things.

-- 
Ian Bicking  |  http://blog.ianbicking.org

Re: [Python-Dev] thoughts on the bytes/string discussion

2010-06-24 Thread Guido van Rossum
I see it a little differently (though there is probably a common
concept lurking in here).

The protocols you mention are intentionally designed to be
encoding-neutral as long as the encoding is an ASCII superset. This
covers ASCII itself, Latin-1, Latin-N for other values of N, MacRoman,
Microsoft's code pages (most of them anyways), UTF-8, presumably at
least some of the Japanese encodings, and probably a host of others.
But it does not cover UTF-16, EBCDIC, and others. (Encodings that have
shift bytes that change the meaning of some or all ordinary ASCII
characters also aren't covered, unless such an encoding happens to
exclude the special characters that the protocol spec cares about).

The protocol specs typically go out of their way to specify what byte
values they use for syntactically significant positions (e.g. ':' in
headers, or '/' in URLs), while hand-waving about the meaning of what
goes in between since it is all typically treated as not of
syntactic significance. So you can write a parser that looks at bytes
exclusively, and looks for a bunch of ASCII punctuation characters
(e.g. '', '', '/', ''), and doesn't know or care whether the stuff
in between is encoded in Latin-15, MacRoman or UTF-8 -- it never looks
inside stretches of characters between the special characters and
just copies them. (Sometimes there may be *some* sections that are
required to be ASCII and there equivalence of a-z and A-Z is well
defined.)

But I wouldn't go so far as to claim that interpreting the protocols
as text is wrong. After all we're talking exclusively about protocols
that are designed intentionally to be directly human readable
(albeit as a fall-back option) -- the only tool you need to debug the
traffic on the wire or socket is something that knows which subset of
ASCII is considered printable and which renders everything else
safely as a hex escape or even a special unknown character (like
Unicode's ? inside a black diamond).

Depending on the requirements of a specific app (or framework) it may
be entirely reasonable to convert everything to Unicode and process
the resulting text; in other contexts it makes more sense to keep
everything as bytes. It also makes sense to have an interface library
to deal with a specific protocol that treats the protocol side as
bytes but interacts with the application using text, since that is
often how the application programmer wants to treat it anyway.

Of course, some protocols require the application programmer to be
aware of bytes as well in *some* cases -- examples are email and HTTP
which can be used to transfer text as well as binary data (e.g.
images). There is also the bootstrap problem where the wire data must
be partially parsed in order to find out the encoding to be used to
convert it to text. But that doesn't mean it's invalid to think about
it as text in many application contexts.

Regarding the proposal of a String ABC, I hope this isn't going to
become a backdoor to reintroduce the Python 2 madness of allowing
equivalency between text and bytes for *some* strings of bytes and not
others.

Finally, I do think that we should not introduce changes to the
fundamental behavior of text and bytes while the moratorium is in
place. Changes to specific stdlib APIs are fine however.

--Guido

On Thu, Jun 24, 2010 at 12:49 PM, Ian Bicking i...@colorstudy.com wrote:
 On Thu, Jun 24, 2010 at 12:38 PM, Bill Janssen jans...@parc.com wrote:

 Here are a couple of ideas I'm taking away from the bytes/string
 discussion.

 First, it would probably be a good idea to have a String ABC.

 Secondly, maybe the string situation in 2.x wasn't as broken as we
 thought it was.  In particular, those who deal with lots of encoded
 strings seemed to find it handy, and miss it in 3.x.  Perhaps strings
 are more like numbers than we think.  We have separate types for int,
 float, Decimal, etc.  But they're all numbers, and they all
 cross-operate.  In 2.x, it seems there were two missing features: no
 encoding attribute on str, which should have been there and should have
 been required, and the default encoding being ASCII (I can't tell you
 how many times I've had to fix that issue when a non-ASCII encoded str
 was passed to some output function).

 I've started to form a conceptual notion that I think fits these cases.

 We've setup a system where we think of text as natively unicode, with
 encodings to put that unicode into a byte form.  This is certainly
 appropriate in a lot of cases.  But there's a significant class of problems
 where bytes are the native structure.  Network protocols are what we've been
 discussing, and are a notable case of that.  That is, b'/' is the most
 native sense of a path separator in a URL, or b':' is the most native sense
 of what separates a header name from a header value in HTTP.  To disallow
 unicode URLs or unicode HTTP headers would be rather anti-social, especially
 because unicode is now the native string type in Python 3 (as an aside for
 

Re: [Python-Dev] thoughts on the bytes/string discussion

2010-06-24 Thread Ian Bicking
On Thu, Jun 24, 2010 at 3:59 PM, Guido van Rossum gu...@python.org wrote:

 The protocol specs typically go out of their way to specify what byte
 values they use for syntactically significant positions (e.g. ':' in
 headers, or '/' in URLs), while hand-waving about the meaning of what
 goes in between since it is all typically treated as not of
 syntactic significance. So you can write a parser that looks at bytes
 exclusively, and looks for a bunch of ASCII punctuation characters
 (e.g. '', '', '/', ''), and doesn't know or care whether the stuff
 in between is encoded in Latin-15, MacRoman or UTF-8 -- it never looks
 inside stretches of characters between the special characters and
 just copies them. (Sometimes there may be *some* sections that are
 required to be ASCII and there equivalence of a-z and A-Z is well
 defined.)


Yes, these are the specific characters that I think we can handle
specially.  For instance, the list of all string literals used by urlsplit
and urlunsplit:
'//'
'/'
':'
'?'
'#'
''
'http'
A list of all valid scheme characters (a-z etc)
Some lists for scheme-specific parsing (which all contain valid scheme
characters)

All of these are constrained to ASCII, and must be constrained to ASCII, and
everything else in a URL is treated as basically opaque.

So if we turned these characters into byte-or-str objects I think we'd
basically be true to the intent of the specs, and in a practical sense we'd
be able to make these functions polymorphic.  I suspect this same pattern
will be present most places where people want polymorphic behavior.

For now we could do something incomplete and just avoid using operators we
can't overload (is it possible to at least make them produce a readable
exception?)

I think we'll avoid a lot of the confusion that was present with Python 2 by
not making the coercions transitive.  For instance, here's something that
would work in Python 2:

  urlunsplit(('http', 'example.com', '/foo', u'bar=baz', ''))

And you'd get out a unicode string, except that would break the first time
that query string (u'bar=baz') was not ASCII (but not until then!)

Here's the urlunsplit code:

def urlunsplit(components):
scheme, netloc, url, query, fragment = components
if netloc or (scheme and scheme in uses_netloc and url[:2] != '//'):
if url and url[:1] != '/': url = '/' + url
url = '//' + (netloc or '') + url
if scheme:
url = scheme + ':' + url
if query:
url = url + '?' + query
if fragment:
url = url + '#' + fragment
return url

If all those literals were this new special kind of string, if you call:

  urlunsplit((b'http', b'example.com', b'/foo', 'bar=baz', b''))

You'd end up constructing the URL b'http://example.com/foo' and then
running:

url = url + special('?') + query

And that would fail because b'http://example.com/foo' + special('?') would
be b'http://example.com/foo?' and you cannot add that to the str 'bar=baz'.
So we'd be avoiding the Python 2 craziness.

-- 
Ian Bicking  |  http://blog.ianbicking.org
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] thoughts on the bytes/string discussion

2010-06-24 Thread Antoine Pitrou
On Thu, 24 Jun 2010 20:07:41 +0100
Michael Foord fuzzy...@voidspace.org.uk wrote:
 
 Although it would require changes for builtin types like file to work 
 with a new string ABC, right?

There is no builtin file type in 3.x.
Besides, it is not an ABC-level problem; the IO layer is written in C
(although there's still the Python implementation to play with), which
would mandate an abstract C API to access unicode-like objects
(similarly as there's already the buffer API to access bytes-like
objects).

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] thoughts on the bytes/string discussion

2010-06-24 Thread Terry Reedy

On 6/24/2010 1:38 PM, Bill Janssen wrote:


Secondly, maybe the string situation in 2.x wasn't as broken as we
thought it was.  In particular, those who deal with lots of encoded
strings seemed to find it handy, and miss it in 3.x.  Perhaps strings
are more like numbers than we think.  We have separate types for int,
float, Decimal, etc.  But they're all numbers, and they all
cross-operate.


No they do not. Decimal only mixes properly with ints, but not with 
anything else, sometime with surprising and havoc-creating ways:

 Decimal(0) == float(0)
False

I believe that and other comparisons may be fixed in 3.2, but I know 
there was lots of discussion of whether float + decimal should return a 
float or decimal, with good arguments both ways. To put it another way, 
there are potential problems with either choice. Automatic mixed-mode 
arithmetic is not always a slam-dunk, no-problem choise.


That aside, there are a couple of places where I think the comparison 
breaks down. If one adds a thousand ints and then a float, there is only 
the final number to convert. If one adds a thousand bytes and then a 
unicode, there is the concantenation of the thousand bytes to convert. 
Or short the result be the concatenation of a thousand unicode 
conversions. This brings up the distributivity (or not) of conversion 
over summation. In general, float(i) + float(j) = float(i+j), for i,j 
ints. I an not sure the same is true if i,j are bytes with some encoding 
and the conversion is unicode. Does it depend on the encoding?


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] thoughts on the bytes/string discussion

2010-06-24 Thread Guido van Rossum
On Thu, Jun 24, 2010 at 2:44 PM, Ian Bicking i...@colorstudy.com wrote:
 I think we'll avoid a lot of the confusion that was present with Python 2 by
 not making the coercions transitive.  For instance, here's something that
 would work in Python 2:

   urlunsplit(('http', 'example.com', '/foo', u'bar=baz', ''))

 And you'd get out a unicode string, except that would break the first time
 that query string (u'bar=baz') was not ASCII (but not until then!)

Actually, that wouldn't be a problem. The problem would be this:

   urlunsplit(('http', 'example.com', u'/foo', 'bar=baz', ''))

(I moved the u prefix from bar=baz to /foo.) And this would break
when instead of baz there was some non-ASCII UTF-8, e.g.


urlunsplit(('http', 'example.com', u'/foo', 'bar=\xe1\x88\xb4', ''))
-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] thoughts on the bytes/string discussion

2010-06-24 Thread Terry Reedy

On 6/24/2010 4:59 PM, Guido van Rossum wrote:


But I wouldn't go so far as to claim that interpreting the protocols
as text is wrong. After all we're talking exclusively about protocols
that are designed intentionally to be directly human readable


I agree that the claim ':' is just a byte is a bit shortsighted.

If the designers of the protocols had intended to use uninterpreted 
bytes as protocol markers, they could and I suspect would have used 
unused control codes, of which there are several. Then there would have 
been no need for escape mechanisms to put things like : into content text.


I am very sure that the reason for specifying *ascii* byte values was to 
be crysal clear as to what *character* was meant and to *exclude* use on 
the internet of the main imcompatible competitor encoding -- IBM's 
EBCDIC -- which IBM used in all of *its* networks. Until the IBM PC came 
out in the early 1980s (and IBM originally saw that as a minor sideline 
and something of a toy), there was a battle over byte encodings between 
IBM and everyone else.


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] thoughts on the bytes/string discussion

2010-06-24 Thread Greg Ewing

Terry Reedy wrote:

On 6/24/2010 1:38 PM, Bill Janssen wrote:


We have separate types for int,
float, Decimal, etc.  But they're all numbers, and they all
cross-operate.


No they do not. Decimal only mixes properly with ints, but not with 
anything else


I think there are also some important differences between
numbers and strings concerning how they interact with C code.

In C there are really only two choices for representing a
Python number in a way that C code can directly operate on --
long or double -- and there is a set of functions for coercing a
Python object into one of these that C code almost universally
uses. So a new number type only has to implement the appropriate
conversion methods to be usable by all of that C code.

On the other hand, the existing C code that operates on Python
strings often assumes that it has a particular internal
representation. A new abstract string-access API would have to
be devised, and all existing C code updated to use it. Also,
this new API would not be as easy to use as the number API,
because it would involve asking for the data in some specified
encoding, which would require memory allocation and management.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com