Re: [Python-Dev] bytes / unicode

2010-06-28 Thread Nick Coghlan
On Mon, Jun 28, 2010 at 6:28 PM, Greg Ewing wrote: > R. David Murray wrote: > >> Having such a poly_str type would probably make my life easier. > > A thought on this poly_str type: perhaps it could be > called "ascii", since that's what it would have to be > restricted to, and have > >  a'xxx' >

Re: [Python-Dev] bytes / unicode

2010-06-28 Thread R. David Murray
On Mon, 28 Jun 2010 13:55:26 +0530, Senthil Kumaran wrote: > On Mon, Jun 28, 2010 at 08:28:45PM +1200, Greg Ewing wrote: > > Thinking way outside the square, and probably the pale > > as well, maybe @ could be pressed into service as an > > infix operator, with > > > > s...@i > > > > being equ

Re: [Python-Dev] bytes / unicode

2010-06-28 Thread Senthil Kumaran
On Mon, Jun 28, 2010 at 08:28:45PM +1200, Greg Ewing wrote: > A thought on this poly_str type: perhaps it could be > called "ascii", since that's what it would have to be > restricted to, and have > > a'xxx' > > as a literal syntax for it, seeing as literals seem to > be one of its main use cas

Re: [Python-Dev] bytes / unicode

2010-06-28 Thread Greg Ewing
R. David Murray wrote: Having such a poly_str type would probably make my life easier. A thought on this poly_str type: perhaps it could be called "ascii", since that's what it would have to be restricted to, and have a'xxx' as a literal syntax for it, seeing as literals seem to be one of

Re: [Python-Dev] bytes / unicode

2010-06-27 Thread R. David Murray
I've been watching this discussion with intense interest, but have been so lagged in following the thread that I haven't replied. I got caught up today On Sun, 27 Jun 2010 15:53:59 +1000, Nick Coghlan wrote: > The difference is that we have three classes of algorithm here: > - those that work

Re: [Python-Dev] bytes / unicode

2010-06-27 Thread P.J. Eby
At 03:53 PM 6/27/2010 +1000, Nick Coghlan wrote: We could talk about this even longer, but the most effective way forward is going to be a patch that improves the URL parsing situation. Certainly, it's the only practical solution for the immediate problems in 3.2. I only mentioned that I "hate

Re: [Python-Dev] bytes / unicode

2010-06-27 Thread Stephen J. Turnbull
P.J. Eby writes: > At 12:42 PM 6/26/2010 +0900, Stephen J. Turnbull wrote: > >What I'm saying here is that if bytes are the signal of validity, and > >the stdlib functions preserve validity, then it's better to have the > >stdlib functions object to unicode data as an argument. Compare the >

Re: [Python-Dev] bytes / unicode

2010-06-27 Thread Antoine Pitrou
On Sat, 26 Jun 2010 23:49:11 -0400 "P.J. Eby" wrote: > > Remember, bytes and strings already have to detect mixed-type > operations. Not in Python 3. They just raise a TypeError on bad ("mixed-type") arguments. Regards Antoine. ___ Python-Dev mail

Re: [Python-Dev] bytes / unicode

2010-06-26 Thread Nick Coghlan
On Sun, Jun 27, 2010 at 1:49 PM, P.J. Eby wrote: > I just hate the idea that functions taking strings should have to be > *rewritten* to be explicitly type-agnostic.  It seems *so* un-Pythonic... >  like if all the bitmasking functions you'd ever written using 32-bit int > constants had to be rewr

Re: [Python-Dev] bytes / unicode

2010-06-26 Thread P.J. Eby
At 12:43 PM 6/27/2010 +1000, Nick Coghlan wrote: While full support for third party strings and byte sequence implementations is an interesting idea, I think it's overkill for the specific problem of making it easier to write str/bytes agnostic functions for tasks like URL parsing. OTOH, to wri

Re: [Python-Dev] bytes / unicode

2010-06-26 Thread Nick Coghlan
On Sun, Jun 27, 2010 at 4:17 AM, P.J. Eby wrote: > The idea that I'm proposing is that the basic string and byte types should > defer to "user-defined" string types for mixed type operations, so that > polymorphism of string-manipulation functions is the *default* case, rather > than a *special* c

Re: [Python-Dev] bytes / unicode

2010-06-26 Thread P.J. Eby
At 12:42 PM 6/26/2010 +0900, Stephen J. Turnbull wrote: What I'm saying here is that if bytes are the signal of validity, and the stdlib functions preserve validity, then it's better to have the stdlib functions object to unicode data as an argument. Compare the alternative: it returns a unicode

Re: [Python-Dev] bytes / unicode

2010-06-25 Thread Stephen J. Turnbull
P.J. Eby writes: > it's just that if you already have the bytes, and all you want to > do is tag them (e.g. the WSGI headers case), the extra encoding > step seems pointless. Well, I'll have to concede that unless and until I get involved in the WSGI development effort. > >But with your arch

Re: [Python-Dev] bytes / unicode

2010-06-25 Thread P.J. Eby
At 01:18 AM 6/26/2010 +0900, Stephen J. Turnbull wrote: It seems to me what is wanted here is something like Perl's taint mechanism, for *both* kinds of strings. Am I missing something? You could certainly view it as a kind of tainting. The part where the type would be bytes-based is indeed

Re: [Python-Dev] bytes / unicode

2010-06-25 Thread Stephen J. Turnbull
Ian Bicking writes: > I don't get what you are arguing against. Are you worried that if > we make URL code polymorphic that this will mean some code will > treat URLs as bytes, and that code will be incompatible with URLs > as text? No one is arguing we remove text support from any of > the

Re: [Python-Dev] bytes / unicode

2010-06-25 Thread Stephen J. Turnbull
P.J. Eby writes: > I do know the ultimate target codec -- that's the point. > > IOW, I want to be able to do to all my operations by passing > target-encoded strings to polymorphic functions. IOW, you *do* have text and (ignoring efficiency issues) could just as well use str. But That Other

Re: [Python-Dev] bytes / unicode

2010-06-25 Thread Ian Bicking
On Fri, Jun 25, 2010 at 2:05 AM, Stephen J. Turnbull wrote: > > But join('x', 'y') -> 'x/y' and join(b'x', b'y') -> b'x/y' make > > sense to me. > > > > So, actually, I *don't* understand what you mean by needing LBYL. > > Consider docutils. Some folks assert that URIs *are* bytes and should >

Re: [Python-Dev] bytes / unicode

2010-06-25 Thread P.J. Eby
At 04:49 PM 6/25/2010 +0900, Stephen J. Turnbull wrote: P.J. Eby writes: > This doesn't have to be in the functions; it can be in the > *types*. Mixed-type string operations have to do type checking and > upcasting already, but if the protocol were open, you could make an > encoded-bytes ty

Re: [Python-Dev] bytes / unicode

2010-06-25 Thread Stephen J. Turnbull
P.J. Eby writes: > This doesn't have to be in the functions; it can be in the > *types*. Mixed-type string operations have to do type checking and > upcasting already, but if the protocol were open, you could make an > encoded-bytes type that would handle the error checking. Don't you rea

Re: [Python-Dev] bytes / unicode

2010-06-25 Thread Stephen J. Turnbull
Guido van Rossum writes: > On Thu, Jun 24, 2010 at 1:12 AM, Stephen J. Turnbull > wrote: > Understood, but both the majority of str/bytes methods and several > existing APIs (e.g. many in the os module, like os.listdir()) do it > this way. Understood. > Also, IMO a polymorphic function s

Re: [Python-Dev] bytes / unicode

2010-06-24 Thread Nick Coghlan
On Fri, Jun 25, 2010 at 1:41 AM, Guido van Rossum wrote: > I don't think we should abuse sum for this. A simple idiom to get the > *empty* string of a particular type is x[:0] so you could write > something like this to concatenate a list or strings or bytes: > xs[:0].join(xs). Note that if xs is

Re: [Python-Dev] bytes / unicode

2010-06-24 Thread Nick Coghlan
On Fri, Jun 25, 2010 at 3:07 AM, P.J. Eby wrote: > (Btw, in some earlier emails, Stephen, you implied that this could be fixed > with codecs -- but it can't, because the problem isn't with the bytes > containing invalid Unicode, it's with the Unicode containing invalid bytes > -- i.e., characters

Re: [Python-Dev] bytes / unicode

2010-06-24 Thread P.J. Eby
At 05:12 PM 6/24/2010 +0900, Stephen J. Turnbull wrote: Guido van Rossum writes: > For example: how we can make the suite of functions used for URL > processing more polymorphic, so that each developer can choose for > herself how URLs need to be treated in her application. While you have co

Re: [Python-Dev] bytes / unicode

2010-06-24 Thread Baptiste Carvello
P.J. Eby a écrit : [...] stdlib constants are almost always ASCII, and the main use cases for ebytes would involve ascii-extended encodings.) Then, how about a new "ascii string" literal? This would produce a special kind of string that would coerce to a normal string when mixed with a str, a

Re: [Python-Dev] bytes / unicode

2010-06-24 Thread Guido van Rossum
On Thu, Jun 24, 2010 at 8:25 AM, Nick Coghlan wrote: > On Fri, Jun 25, 2010 at 12:33 AM, Guido van Rossum wrote: >> Also, IMO a polymorphic function should *not* accept *mixed* >> bytes/text input -- join('x', b'y') should be rejected. But join('x', >> 'y') -> 'x/y' and join(b'x', b'y') -> b'x/y'

Re: [Python-Dev] bytes / unicode

2010-06-24 Thread Nick Coghlan
On Fri, Jun 25, 2010 at 12:33 AM, Guido van Rossum wrote: > Also, IMO a polymorphic function should *not* accept *mixed* > bytes/text input -- join('x', b'y') should be rejected. But join('x', > 'y') -> 'x/y' and join(b'x', b'y') -> b'x/y' make sense to me. A policy of allowing arguments to be ei

Re: [Python-Dev] bytes / unicode

2010-06-24 Thread Guido van Rossum
On Thu, Jun 24, 2010 at 1:12 AM, Stephen J. Turnbull wrote: > Guido van Rossum writes: > >  > For example: how we can make the suite of functions used for URL >  > processing more polymorphic, so that each developer can choose for >  > herself how URLs need to be treated in her application. > > Wh

Re: [Python-Dev] bytes / unicode

2010-06-24 Thread Michael Foord
On 24/06/2010 11:58, M.-A. Lemburg wrote: Lennart Regebro wrote: On Tue, Jun 22, 2010 at 20:07, James Y Knight wrote: Yeah. This is a real issue I have with the direction Python3 went: it pushes you into decoding everything to unicode early, even when you don't care -- Well,

Re: [Python-Dev] bytes / unicode

2010-06-24 Thread M.-A. Lemburg
Lennart Regebro wrote: > On Tue, Jun 22, 2010 at 20:07, James Y Knight wrote: >> Yeah. This is a real issue I have with the direction Python3 went: it pushes >> you into decoding everything to unicode early, even when you don't care -- > > Well, yes, maybe even if *you* don't care. But often the

Re: [Python-Dev] bytes / unicode

2010-06-24 Thread Lennart Regebro
On Tue, Jun 22, 2010 at 20:07, James Y Knight wrote: > Yeah. This is a real issue I have with the direction Python3 went: it pushes > you into decoding everything to unicode early, even when you don't care -- Well, yes, maybe even if *you* don't care. But often the functions you need to call must

Re: [Python-Dev] bytes / unicode

2010-06-24 Thread Stephen J. Turnbull
Guido van Rossum writes: > For example: how we can make the suite of functions used for URL > processing more polymorphic, so that each developer can choose for > herself how URLs need to be treated in her application. While you have come down on the side of polymorphism (as opposed to separat

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Toshio Kuratomi
On Wed, Jun 23, 2010 at 11:35:12PM +0200, Antoine Pitrou wrote: > On Wed, 23 Jun 2010 17:30:22 -0400 > Toshio Kuratomi wrote: > > Note that this assumption seems optimistic to me. I started talking to > > Graham > > Dumpleton, author of mod_wsgi a couple years back because mod_wsgi and paste > >

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Antoine Pitrou
On Wed, 23 Jun 2010 17:30:22 -0400 Toshio Kuratomi wrote: > Note that this assumption seems optimistic to me. I started talking to Graham > Dumpleton, author of mod_wsgi a couple years back because mod_wsgi and paste > do decoding of bytes to unicode at different layers which caused problems > fo

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Toshio Kuratomi
On Wed, Jun 23, 2010 at 09:36:45PM +0200, Antoine Pitrou wrote: > On Wed, 23 Jun 2010 14:23:33 -0400 > Tres Seaver wrote: > > - - the slow adoption / porting rate of major web frameworks and libraries > > to Python 3. > > Some of the major web frameworks and libraries have a ton of > dependenci

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Antoine Pitrou
On Wed, 23 Jun 2010 14:23:33 -0400 Tres Seaver wrote: > > Perhaps such decisions need revisiting in light of subsequent experience > / pain / learning. E.g: > > - - the repeated inability of the web-sig to converge on appropriate > semantics for a Python3-compatible version of the WSGI spec;

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Bill Janssen wrote: > The bigger problem seems to be that we're revisiting the design > discussion about urllib.parse from the summer of 2008. See > http://bugs.python.org/issue3300 if you want to recall how we hashed > this out 2 years ago. I didn'

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Glyph Lefkowitz
On Jun 22, 2010, at 8:57 PM, Robert Collins wrote: > bzr has a cache of decoded strings in it precisely because decode is > slow. We accept slowness encoding to the users locale because thats > typically much less data to examine than we've examined while > generating the commit/diff/whatever. We

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Ian Bicking
Oops, I forgot some important quoting (important for the algorithm, maybe not actually for the discussion)... from urllib.parse import urlsplit, urlunsplit import encodings.idna # urllib.parse.quote both always returns str, and is not as conservative in quoting as required here... def quote_unsaf

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Bill Janssen
Guido van Rossum wrote: > So I propose that we drop the discussion "are URLs text or bytes" and > try to find something more pragmatic to discuss. > > For example: how we can make the suite of functions used for URL > processing more polymorphic, so that each developer can choose for > herself h

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Ian Bicking
On Wed, Jun 23, 2010 at 10:30 AM, Tres Seaver wrote: > Stephen J. Turnbull wrote: > > > We do need str-based implementations of modules like urllib. > > > Why would that be? URLs aren't text, and never will be. The fact that > to the eye they may seem to be text-ish doesn't make them text. Th

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Bill Janssen
Tres Seaver wrote: > Stephen J. Turnbull wrote: > > > We do need str-based implementations of modules like urllib. > > Why would that be? URLs aren't text, and never will be. The fact that > to the eye they may seem to be text-ish doesn't make them text. This URLs are exactly text (strings,

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Barry Warsaw
On Jun 23, 2010, at 08:43 AM, Guido van Rossum wrote: >So I propose that we drop the discussion "are URLs text or bytes" and >try to find something more pragmatic to discuss. email has exactly the same question, and the answer is "yes". >For example: how we can make the suite of functions used

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Guido van Rossum
On Wed, Jun 23, 2010 at 8:30 AM, Tres Seaver wrote: > Stephen J. Turnbull wrote: > >> We do need str-based implementations of modules like urllib. > > Why would that be?  URLs aren't text, and never will be.  The fact that > to the eye they may seem to be text-ish doesn't make them text.  This > *

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Stephen J. Turnbull wrote: > We do need str-based implementations of modules like urllib. Why would that be? URLs aren't text, and never will be. The fact that to the eye they may seem to be text-ish doesn't make them text. This *is* a case where

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread P.J. Eby
At 08:34 PM 6/22/2010 -0400, Glyph Lefkowitz wrote: I suspect the practical problem here is that there's no CharacterString ABC That, and the absence of a string coercion protocol so that mixing your custom string with standard strings will do the right thing for your intended use.

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Nick Coghlan
On Wed, Jun 23, 2010 at 7:18 PM, M.-A. Lemburg wrote: > Note that the point of using a builtin method was to get > better performance. Such type adaptions are often needed in > loops, so adding a few extra Python function calls just to > convert a str object to a bytes object or vice-versa is a >

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread M.-A. Lemburg
Nick Coghlan wrote: > On Wed, Jun 23, 2010 at 4:09 AM, M.-A. Lemburg wrote: >> It would be great if we could have something like the above as >> builtin method: >> >> x.split('&'.as(x)) > > As per my other message, another possible (and reasonably intuitive) > spelling would be: > > x.split(x.

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Stephen J. Turnbull
James Y Knight writes: > The surrogateescape method is a nice workaround for this, but I can't > help thinking that it might've been better to just treat stuff as > possibly-invalid-but-probably-utf8 byte-strings from input, through > processing, to output. This is the world we already

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Stephen J. Turnbull
Ian Bicking writes: > Just for perspective, I don't know if I've ever wanted to deal with a URL > like that. Ditto, I do many times a day for Japanese media sites and Wikipedia. > I know how it is supposed to work, and I know what a browser does > with that, but so many tools will clean that

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Robert Collins
On Wed, Jun 23, 2010 at 12:25 PM, Glyph Lefkowitz wrote: > I can also appreciate what's been said in this thread a bunch of times: to my > knowledge, nobody has actually shown a profile of an application where > encoding is significant overhead.  I believe that encoding _will_ be a > significan

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Mike Klaas
On Tue, Jun 22, 2010 at 4:23 PM, Ian Bicking wrote: > This reminds me of the optimization ElementTree and lxml made in Python 2 > (not sure what they do in Python 3?) where they use str when a string is > ASCII to avoid the memory and performance overhead of unicode. An optimization that forces

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Glyph Lefkowitz
On Jun 22, 2010, at 7:23 PM, Ian Bicking wrote: > This is a place where bytes+encoding might also have some benefit. XML is > someplace where you might load a bunch of data but only touch a little bit of > it, and the amount of data is frequently large enough that the efficiencies > are impor

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Glyph Lefkowitz
On Jun 22, 2010, at 2:07 PM, James Y Knight wrote: > Yeah. This is a real issue I have with the direction Python3 went: it pushes > you into decoding everything to unicode early, even when you don't care -- > all you really wanted to do is pass it from one API to another, with some > well-defi

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Glyph Lefkowitz
On Jun 22, 2010, at 12:53 PM, Guido van Rossum wrote: > On Mon, Jun 21, 2010 at 11:47 PM, Raymond Hettinger > wrote: >> >> On Jun 21, 2010, at 10:31 PM, Glyph Lefkowitz wrote: >> >> This is a common pain-point for porting software to 3.x - you had a >> string, it kinda worked most of the time

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread P.J. Eby
At 07:41 AM 6/23/2010 +1000, Nick Coghlan wrote: Then my example above could be made polymorphic (for ASCII compatible encodings) by writing: [x for x in seq if x.endswith(x.coerce("b"))] I'm trying to see downsides to this idea, and I'm not really seeing any (well, other than 2.7 being almos

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Ian Bicking
On Tue, Jun 22, 2010 at 11:17 AM, Guido van Rossum wrote: > (2) Data sources. > > These can be functions that produce new data from non-string data, > e.g. str(), read it from a named file, etc. An example is read() > vs. write(): it's easy to create a (hypothetical) polymorphic stream > object t

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Michael Foord
On 22/06/2010 19:07, James Y Knight wrote: On Jun 22, 2010, at 1:03 PM, Ian Bicking wrote: Similarly I'd expect (from experience) that a programmer using Python to want to take the same approach, sticking with unencoded data in nearly all situations. Yeah. This is a real issue I have with th

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Michael Foord
On 22/06/2010 22:40, Robert Collins wrote: On Wed, Jun 23, 2010 at 6:09 AM, M.-A. Lemburg wrote: return constant.encode('utf-8') So now you can write x.split(literal_as('&', x)). This polymorphism is what we used in Python2 a lot to write code that works for both Unico

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Nick Coghlan
On Wed, Jun 23, 2010 at 4:09 AM, M.-A. Lemburg wrote: > It would be great if we could have something like the above as > builtin method: > > x.split('&'.as(x)) As per my other message, another possible (and reasonably intuitive) spelling would be: x.split(x.coerce('&')) Writing it as a helper

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Nick Coghlan
On Wed, Jun 23, 2010 at 2:17 AM, Guido van Rossum wrote: > (1) Literals. > > If you write something like x.split('&') you are implicitly assuming x > is text. I don't see a very clean way to overcome this; you'll have to > implement some kind of type check e.g. > >    x.split('&') if isinstance(x,

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Robert Collins
On Wed, Jun 23, 2010 at 6:09 AM, M.-A. Lemburg wrote: >>           return constant.encode('utf-8') >> >> So now you can write x.split(literal_as('&', x)). > > This polymorphism is what we used in Python2 a lot to write > code that works for both Unicode and 8-bit strings. > > Unfortunately, this

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Ian Bicking
On Tue, Jun 22, 2010 at 1:07 PM, James Y Knight wrote: > The surrogateescape method is a nice workaround for this, but I can't help > thinking that it might've been better to just treat stuff as > possibly-invalid-but-probably-utf8 byte-strings from input, through > processing, to output. It seem

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Terry Reedy
On 6/22/2010 12:53 PM, Guido van Rossum wrote: On Mon, Jun 21, 2010 at 11:47 PM, Raymond Hettinger wrote: On Jun 21, 2010, at 10:31 PM, Glyph Lefkowitz wrote: This is a common pain-point for porting software to 3.x - you had a string, it kinda worked most of the time before, but now you n

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Terry Reedy
On 6/22/2010 1:22 AM, Glyph Lefkowitz wrote: The thing that I have heard in passing from a couple of folks with experience in this area is that some older software in asia would present characters differently if they were originally encoded in a "japanese" encoding versus a "chinese" encoding, e

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread M.-A. Lemburg
Guido van Rossum wrote: > [Just addressing one little issue here; generally I'm just happy that > we're discussing this issue in such detail from so many points of > view.] > > On Mon, Jun 21, 2010 at 10:50 PM, Toshio Kuratomi wrote: >> [...] Would urljoin(b_base, b_subdir) => bytes and >> urljoi

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread James Y Knight
On Jun 22, 2010, at 1:03 PM, Ian Bicking wrote: Similarly I'd expect (from experience) that a programmer using Python to want to take the same approach, sticking with unencoded data in nearly all situations. Yeah. This is a real issue I have with the direction Python3 went: it pushes you

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Guido van Rossum
On Mon, Jun 21, 2010 at 11:47 PM, Raymond Hettinger wrote: > > On Jun 21, 2010, at 10:31 PM, Glyph Lefkowitz wrote: > >   This is a common pain-point for porting software to 3.x - you had a > string, it kinda worked most of the time before, but now you need to keep > track of text too and the func

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Toshio Kuratomi
On Tue, Jun 22, 2010 at 08:31:13PM +0900, Stephen J. Turnbull wrote: > Toshio Kuratomi writes: > > unicode handling redesign. I'm stating my reading of the RFC not to defend > > the use case Philip has, but because I think that the outlook that non-text > > uris (before being percentencoded) ar

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Ian Bicking
On Tue, Jun 22, 2010 at 6:31 AM, Stephen J. Turnbull wrote: > Toshio Kuratomi writes: > > > I'll definitely buy that. Would urljoin(b_base, b_subdir) => bytes and > > urljoin(u_base, u_subdir) => unicode be acceptable though? > > Probably. > > But it doesn't matter what I say, since Guido has d

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Guido van Rossum
[Just addressing one little issue here; generally I'm just happy that we're discussing this issue in such detail from so many points of view.] On Mon, Jun 21, 2010 at 10:50 PM, Toshio Kuratomi wrote: >[...] Would urljoin(b_base, b_subdir) => bytes and > urljoin(u_base, u_subdir) => unicode be acc

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Stephen J. Turnbull
Toshio Kuratomi writes: > I'll definitely buy that. Would urljoin(b_base, b_subdir) => bytes and > urljoin(u_base, u_subdir) => unicode be acceptable though? Probably. But it doesn't matter what I say, since Guido has defined that as "polymorphism" and approved it in principle. > (I think

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Stephen J. Turnbull
Glyph Lefkowitz writes: > On Jun 21, 2010, at 10:58 PM, Stephen J. Turnbull wrote: > > Note also that the "complete solution" argument cuts both ways. Eg, a > > "complete" solution should implement UTS 39 "confusables detection"[1] > > and IDNA[2]. Good luck doing that with bytes! > > And

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Raymond Hettinger
On Jun 21, 2010, at 10:31 PM, Glyph Lefkowitz wrote: > This is a common pain-point for porting software to 3.x - you had a string, > it kinda worked most of the time before, but now you need to keep track of > text too and the functions which seemed to work on bytes no longer do. Thanks Glyph

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Glyph Lefkowitz
On Jun 21, 2010, at 10:58 PM, Stephen J. Turnbull wrote: > The RFC says that URIs are text, and therefore they can (and IMO > should) be operated on as text in the stdlib. No, *blue* is the best color for a shed. Oops, wait, let me try that again. While I broadly agree with this statement, it

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Toshio Kuratomi
On Tue, Jun 22, 2010 at 11:58:57AM +0900, Stephen J. Turnbull wrote: > Toshio Kuratomi writes: > > > One comment here -- you can also have uri's that aren't decodable into > their > > true textual meaning using a single encoding. > > > > Apache will happily serve out uris that have utf-8, sh

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Glyph Lefkowitz
On Jun 21, 2010, at 2:17 PM, P.J. Eby wrote: > One issue I remember from my "enterprise" days is some of the Asian-language > developers at NTT/Verio explaining to me that unicode doesn't actually solve > certain issues -- that there are use cases where you really *do* need "bytes > plus encodi

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Stephen J. Turnbull
Robert Collins writes: > Perhaps you mean 3986 ? :) Thank you for the correction. > >    A URI is an identifier consisting of a sequence of characters > >    matching the syntax rule named in Section 3. > > > > (where the phrase "sequence of characters" appears in all ancestors I > > foun

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Stephen J. Turnbull
Toshio Kuratomi writes: > One comment here -- you can also have uri's that aren't decodable into their > true textual meaning using a single encoding. > > Apache will happily serve out uris that have utf-8, shift-jis, and > euc-jp components inside of their path but the textual > representa

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Terry Reedy
On 6/21/2010 1:29 PM, Guido van Rossum wrote: Actually, the big problem with Python 2 is that if you mix str and unicode, things work or crash depending on whether any of the str objects involved contain non-ASCII bytes. If one API decides to upgrade to Unicode, the result, when passed to anoth

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Terry Reedy
On 6/21/2010 1:29 PM, P.J. Eby wrote: At 05:49 PM 6/21/2010 +0100, Michael Foord wrote: Why is your proposed bstr wrapper not practical to implement outside the core and use in your own libraries and frameworks? __contains__ doesn't have a converse operation, so you can't code a type that work

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Robert Collins
2010/6/21 Stephen J. Turnbull : > Robert Collins writes: > >  > Also, url's are bytestrings - by definition; > > Eh?  RFC 3896 explicitly says ?Definitions of Managed Objects for the DS3/E3 Interface Type Perhaps you mean 3986 ? :) >    A URI is an identifier consisting of a sequence of characte

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread P.J. Eby
At 10:29 AM 6/21/2010 -0700, Guido van Rossum wrote: Perhaps there are more situations where a polymorphic API would be helpful. Such APIs are not always so easy to implement, because they have to be careful with literals or other constants (and even more so mutable state) used internally -- but

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread P.J. Eby
At 12:56 PM 6/21/2010 -0400, Toshio Kuratomi wrote: One comment here -- you can also have uri's that aren't decodable into their true textual meaning using a single encoding. Apache will happily serve out uris that have utf-8, shift-jis, and euc-jp components inside of their path but the textual

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Terry Reedy
On 6/21/2010 8:51 AM, Nick Coghlan wrote: I don't know that the "all is well" camp actually exists. The camp that I do see existing is the one that says "without a bug report, inconsistencies in the standard library's unicode handling won't get fixed". The issues picked up by the regression te

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread P.J. Eby
At 05:49 PM 6/21/2010 +0100, Michael Foord wrote: Why is your proposed bstr wrapper not practical to implement outside the core and use in your own libraries and frameworks? __contains__ doesn't have a converse operation, so you can't code a type that works around this (Python 3.1 shown): >>

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Guido van Rossum
On Mon, Jun 21, 2010 at 9:46 AM, P.J. Eby wrote: > At 10:51 PM 6/21/2010 +1000, Nick Coghlan wrote: >> >> It may be that there are places where we need to rewrite standard >> library algorithms to be bytes/str neutral (e.g. by using length one >> slices instead of indexing). It may be that there a

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Terry Reedy
On 6/20/2010 11:56 PM, Terry Reedy wrote: The specific example is >>> urllib.parse.parse_qsl('a=b%e0') [('a', 'b�')] where the character after 'b' is white ? in dark diamond, indicating an error. parse_qsl() splits that input on '=' and sends each piece to urllib.parse.unquote unquote() atte

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Toshio Kuratomi
On Tue, Jun 22, 2010 at 01:08:53AM +0900, Stephen J. Turnbull wrote: > Lennart Regebro writes: > > > 2010/6/21 Stephen J. Turnbull : > > > IMO, the UI is right.  "Something" like the above "ought" to work. > > > > Right. That said, many times when you want to do urlparse etc they > > might b

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread P.J. Eby
At 01:08 AM 6/22/2010 +0900, Stephen J. Turnbull wrote: But if you need that "everywhere", what's so hard about def urljoin_wrapper (base, subdir): return urljoin(str(base, 'latin-1'), subdir).encode('latin-1') Now, note how that pattern fails as soon as you want to use non-ISO-8859-1 langu

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Michael Foord
On 21/06/2010 17:46, P.J. Eby wrote: At 10:51 PM 6/21/2010 +1000, Nick Coghlan wrote: It may be that there are places where we need to rewrite standard library algorithms to be bytes/str neutral (e.g. by using length one slices instead of indexing). It may be that there are more APIs that need t

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread P.J. Eby
At 10:51 PM 6/21/2010 +1000, Nick Coghlan wrote: It may be that there are places where we need to rewrite standard library algorithms to be bytes/str neutral (e.g. by using length one slices instead of indexing). It may be that there are more APIs that need to grow "encoding" keyword arguments th

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Stephen J. Turnbull
Lennart Regebro writes: > 2010/6/21 Stephen J. Turnbull : > > IMO, the UI is right.  "Something" like the above "ought" to work. > > Right. That said, many times when you want to do urlparse etc they > might be binary, and you might want binary. So maybe the methods > should work with both?

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Nick Coghlan
On Mon, Jun 21, 2010 at 12:30 PM, P.J. Eby wrote: > I also find it weird that there seem to be two camps on this subject, one of > which claims that All Is Well And There Is No Problem -- but I do not recall > seeing anyone who was in the "What do I do; this doesn't seem ready" camp > who switched

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Lennart Regebro
2010/6/21 Stephen J. Turnbull : > IMO, the UI is right.  "Something" like the above "ought" to work. Right. That said, many times when you want to do urlparse etc they might be binary, and you might want binary. So maybe the methods should work with both? -- Lennart Regebro: http://regebro.wordp

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Stephen J. Turnbull
Robert Collins writes: > Also, url's are bytestrings - by definition; Eh? RFC 3896 explicitly says A URI is an identifier consisting of a sequence of characters matching the syntax rule named in Section 3. (where the phrase "sequence of characters" appears in all ancestors I found ba

Re: [Python-Dev] bytes / unicode

2010-06-20 Thread Lennart Regebro
On Sun, Jun 20, 2010 at 23:55, Benjamin Peterson wrote: > There are not many tools for treating bytes as text. Well, what tools would you need that can be used also on bytes? Bytes objects has a lot of the same methods like strings do, and that will cover 99% of the cases. Most text tools assume

Re: [Python-Dev] bytes / unicode

2010-06-20 Thread Terry Reedy
On 6/20/2010 9:33 PM, P.J. Eby wrote: At 07:33 PM 6/20/2010 -0400, Terry Reedy wrote: Do you have in mind any tools that could and should operate on both, but do not? From http://mail.python.org/pipermail/web-sig/2009-September/004105.html : Thank for the concrete examples in this and your

Re: [Python-Dev] bytes / unicode

2010-06-20 Thread P.J. Eby
At 11:47 PM 6/20/2010 +0200, Antoine Pitrou wrote: On Sun, 20 Jun 2010 14:40:56 -0400 "P.J. Eby" wrote: > > Actually, I would say that it's more that (in the network protocol > case) we *have* bytes, some of which we would like to *treat* as > text, yet do not wish to constantly convert back and

Re: [Python-Dev] bytes / unicode

2010-06-20 Thread P.J. Eby
At 07:33 PM 6/20/2010 -0400, Terry Reedy wrote: Do you have in mind any tools that could and should operate on both, but do not? From http://mail.python.org/pipermail/web-sig/2009-September/004105.html : """The problem which arises is that unquoting of URLs in Python 3.X stdlib can only be don

Re: [Python-Dev] bytes / unicode

2010-06-20 Thread Terry Reedy
On 6/20/2010 5:55 PM, Benjamin Peterson wrote: 2010/6/20 Antoine Pitrou: On Sun, 20 Jun 2010 14:40:56 -0400 "P.J. Eby" wrote: Actually, I would say that it's more that (in the network protocol case) we *have* bytes, some of which we would like to *treat* as text, yet do not wish to constantly

  1   2   >