Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-31 Thread Steve Holden
Adam Olsen wrote: On 10/30/05, François Pinard [EMAIL PROTECTED] wrote: All development is done in house by French people. All documentation, external or internal, comments, identifier and function names, everything is in French. Some of the developers here have had a long programming life,

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-31 Thread Greg Ewing
François Pinard wrote: All development is done in house by French people. All documentation, external or internal, comments, identifier and function names, everything is in French. There's nothing stopping you from creating your own Frenchified version of Python that lets you use all the

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-31 Thread François Pinard
[Greg Ewing] All development is done in house by French people. All documentation, external or internal, comments, identifier and function names, everything is in French. There's nothing stopping you from creating your own Frenchified version of Python that lets you use all the

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-30 Thread François Pinard
[Martin von Löwis] My canonical example is François Pinard, who keeps requesting it, saying that local people where surprised they couldn't use accented characters in Python. Perhaps that's because he actually is Quebecian :-) I presume I should comment a bit on this. People here are

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-30 Thread Adam Olsen
On 10/30/05, François Pinard [EMAIL PROTECTED] wrote: All development is done in house by French people. All documentation, external or internal, comments, identifier and function names, everything is in French. Some of the developers here have had a long programming life, while they only

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-26 Thread Bengt Richter
At 11:43 2005-10-24 +0200, M.-A. Lemburg wrote: Bengt Richter wrote: Please bear with me for a few paragraphs ;-) Please note that source code encoding doesn't really have anything to do with the way the interpreter executes the program - it's merely a way to tell the parser how to convert

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-25 Thread M.-A. Lemburg
Neil Hodgson wrote: M.-A. Lemburg: Unicode has the concept of combining code points, e.g. you can store an é (e with a accent) as e + '. Now if you slice off the accent, you'll break the character that you encoded using combining code points. ... next_indextype(u, index) - integer

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-25 Thread M.-A. Lemburg
Bengt Richter wrote: At 11:43 2005-10-24 +0200, M.-A. Lemburg wrote: Bengt Richter wrote: Please bear with me for a few paragraphs ;-) Please note that source code encoding doesn't really have anything to do with the way the interpreter executes the program - it's merely a way to tell the

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-25 Thread Martin v. Löwis
Bill Janssen wrote: I just got mail this morning from a researcher who wants exactly what Martin described, and wondered why the default MacPython 2.4.2 didn't provide it by default. :-) If all he wants is to represent Deseret, he can do so in a 16-bit Unicode type, too: Python supports

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-25 Thread Bill Janssen
I think he was more interested in the invariant Martin proposed, that len(\U0001) should always be the same and should always be 1. Bill ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-25 Thread Guido van Rossum
On 10/25/05, Bill Janssen [EMAIL PROTECTED] wrote: I think he was more interested in the invariant Martin proposed, that len(\U0001) should always be the same and should always be 1. Yes but why? What does this invariant do for him? -- --Guido van Rossum (home page:

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-25 Thread Martin v. Löwis
Guido van Rossum wrote: Yes but why? What does this invariant do for him? I don't know about this person, but there are a few things that don't work properly in UTF-16 mode: - the Unicode character database fails to lookup things. u\U0001D670.isupper() gives false, but should give true

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-25 Thread Neil Hodgson
M.-A. Lemburg: You mean a slice that slices out the next indextype ? Yes. This sounds a lot like you'd want iterators for the various index types. Should be possible to implement on top of the proposed APIs, e.g. itergraphemes(u), itercodepoints(u), etc. Iterators may be helpful, but

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread Martin v. Löwis
Neil Hodgson wrote: I'd like to more tightly define Unicode strings for Python 3000. Currently, Unicode strings may be implemented with either 2 byte (UCS-2) or 4 byte (UTF-32) elements. Python should allow strings to contain any Unicode character and should be indexable yielding

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread Martin v. Löwis
Phillip J. Eby wrote: I'm tempted to say it would be even better if there was a command line option that could be used to force all binary opens to result in bytes, and require all text opens to specify an encoding. For Python 3000? -1. There shouldn't be command line switches that have that

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread Neil Hodgson
Martin v. Löwis: That's very tricky. If you have multiple implementations, you make usage at the C API difficult. If you make it either UTF-8 or UTF-32, you make PythonWin difficult. If you make it UTF-16, you make indexing difficult. For Windows, the code will get a little uglier,

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread M.-A. Lemburg
Neil Hodgson wrote: Guido van Rossum: Folks, please focus on what Python 3000 should do. I'm thinking about making all character strings Unicode (possibly with different internal representations a la NSString in Apple's Objective C) and introduce a separate mutable bytes array data type. But

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread M.-A. Lemburg
Bengt Richter wrote: Please bear with me for a few paragraphs ;-) Please note that source code encoding doesn't really have anything to do with the way the interpreter executes the program - it's merely a way to tell the parser how to convert string literals (currently on the Unicode ones) into

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread Bill Janssen
I'm thinking about making all character strings Unicode (possibly with different internal representations a la NSString in Apple's Objective C) and introduce a separate mutable bytes array data type. But I could use some validation or feedback on this idea from actual practitioners. +1 from

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread Bill Janssen
Python should allow strings to contain any Unicode character and should be indexable yielding characters rather than half characters. Therefore Python strings should appear to be UTF-32. +1. Bill ___ Python-Dev mailing list Python-Dev@python.org

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread Martin v. Löwis
Neil Hodgson wrote: For Windows, the code will get a little uglier, needing to perform an allocation/encoding and deallocation more often then at present but I don't think there will be a speed degradation as Windows is currently performing a conversion from 8 bit to UTF-16 inside many

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread Antoine Pitrou
There are many design alternatives: one option would be to support *three* internal representations in a single type, generating the others from the one operation existing as needed. The default, initial representation might be UTF-8, with UCS-4 only being generated when indexing occurs, and

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread Guido van Rossum
On 10/24/05, Martin v. Löwis [EMAIL PROTECTED] wrote: Indeed. My guess is that indexing is more common than you think, especially when iterating over the string. Of course, iteration could also operate on UTF-8, if you introduced string iterator objects. Python's slice-and-dice model pretty

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread Guido van Rossum
On 10/24/05, Martin v. Löwis [EMAIL PROTECTED] wrote: Guido van Rossum wrote: Changing the APIs would be much work, although perhaps not impossible of Python 3000. For example, Raymond Hettinger's partition() API doesn't refer to indices at all, and can replace many uses of find() or

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread Neil Hodgson
M.-A. Lemburg: Unicode has the concept of combining code points, e.g. you can store an é (e with a accent) as e + '. Now if you slice off the accent, you'll break the character that you encoded using combining code points. ... next_indextype(u, index) - integer Returns the

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread Bill Janssen
- yet others think: I want all of Unicode, with proper, efficient indexing, so I want four bytes per char. I doubt the last one though. Probably they really don't want efficient indexing, they want to perform higher-level operations that currently are only possible using efficient

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread Guido van Rossum
On 10/24/05, Bill Janssen [EMAIL PROTECTED] wrote: - yet others think: I want all of Unicode, with proper, efficient indexing, so I want four bytes per char. I doubt the last one though. Probably they really don't want efficient indexing, they want to perform higher-level operations

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread Greg Ewing
Guido van Rossum wrote: I think the API should reflect the representation *to some extend*, namely it shouldn't claim to have operations that are typically thought of as O(1) that can only be implemented as O(n). Maybe a compromise could be reached by using a btree of chunks or something, so

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread Greg Ewing
Guido van Rossum wrote: Python's slice-and-dice model pretty much ensures that indexing is common. Almost everything is ultimately represented as indices: regex search results have the index in the API, find()/index() return indices, many operations take a start and/or end index. Maybe the

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-24 Thread Bill Janssen
Guido writes: Oh, I don't doubt that they want it. But often they don't *need* it, and the higher-level goal they are trying to accomplish can be dealt with better in a different way. (Sort of my response to people asking for static typing in Python as well. :-) I suppose that's true. But

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-23 Thread Jason Orendorff
-1 on keeping the source encoding of string literals. Python should definitely decode them at compile time. -1 on decoding implicitly as needed. This causes decoding to happen late, in unpredictable places. Decodes can fail; they should happen as early and as close to the data source as

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-23 Thread Bob Ippolito
On Oct 23, 2005, at 3:10 PM, Jason Orendorff wrote: -1 on decoding implicitly as needed. This causes decoding to happen late, in unpredictable places. Decodes can fail; they should happen as early and as close to the data source as possible. That's not necessarily true... Some codecs can't

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-23 Thread Stephan Richter
On Sunday 23 October 2005 18:10, Jason Orendorff wrote: -1 on keeping the source encoding of string literals.  Python should definitely decode them at compile time. -1 on decoding implicitly as needed.  This causes decoding to happen late, in unpredictable places.  Decodes can fail; they

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-23 Thread Guido van Rossum
Folks, please focus on what Python 3000 should do. I'm thinking about making all character strings Unicode (possibly with different internal representations a la NSString in Apple's Objective C) and introduce a separate mutable bytes array data type. But I could use some validation or feedback on

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-23 Thread Bob Ippolito
On Oct 23, 2005, at 6:06 PM, Guido van Rossum wrote: Folks, please focus on what Python 3000 should do. I'm thinking about making all character strings Unicode (possibly with different internal representations a la NSString in Apple's Objective C) and introduce a separate mutable bytes array

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-23 Thread Phillip J. Eby
At 06:06 PM 10/23/2005 -0700, Guido van Rossum wrote: Folks, please focus on what Python 3000 should do. I'm thinking about making all character strings Unicode (possibly with different internal representations a la NSString in Apple's Objective C) and introduce a separate mutable bytes array

[Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-22 Thread Bengt Richter
Please bear with me for a few paragraphs ;-) One aspect of str-type strings is the efficiency afforded when all the encoding really is ascii. If the internal encoding were e.g. fixed utf-16le for strings, maybe with today's computers it would still be efficient enough for most actual string

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-17 Thread Martin v. Löwis
Martin Blais wrote: Yes. setdefaultencoding() is removed from sys by site.py. To get it again you must reload sys. Thanks. Actually, I should take the opportunity to advise people that setdefaultencoding doesn't really work. With the default default encoding, strings and Unicode objects hash

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-16 Thread Martin Blais
On 10/15/05, Reinhold Birkenfeld [EMAIL PROTECTED] wrote: Martin Blais wrote: On 10/3/05, Michael Hudson [EMAIL PROTECTED] wrote: Martin Blais [EMAIL PROTECTED] writes: How hard would that be to implement? import sys reload(sys) sys.setdefaultencoding('undefined') Hmmm any

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-15 Thread Martin Blais
On 10/3/05, Michael Hudson [EMAIL PROTECTED] wrote: Martin Blais [EMAIL PROTECTED] writes: How hard would that be to implement? import sys reload(sys) sys.setdefaultencoding('undefined') Hmmm any particular reason for the call to reload() here?

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-15 Thread Reinhold Birkenfeld
Martin Blais wrote: On 10/3/05, Michael Hudson [EMAIL PROTECTED] wrote: Martin Blais [EMAIL PROTECTED] writes: How hard would that be to implement? import sys reload(sys) sys.setdefaultencoding('undefined') Hmmm any particular reason for the call to reload() here? Yes.

[Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-03 Thread Martin Blais
Hi. Like a lot of people (or so I hear in the blogosphere...), I've been experiencing some friction in my code with unicode conversion problems. Even when being super extra careful with the types of str's or unicode objects that my variables can contain, there is always some case or oversight

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-03 Thread Michael Hudson
Martin Blais [EMAIL PROTECTED] writes: What if we could completely disable the implicit conversions between unicode and str? In other words, if you would ALWAYS be forced to call either .encode() or .decode() to convert between one and the other... wouldn't that help a lot deal with that

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-03 Thread Antoine Pitrou
Le lundi 03 octobre 2005 à 02:09 -0400, Martin Blais a écrit : What if we could completely disable the implicit conversions between unicode and str? This would be very annoying when dealing with some modules or libraries where the type (str / unicode) returned by a function depends on the

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-03 Thread Fredrik Lundh
Antoine Pitrou wrote: A good rule of thumb is to convert to unicode everything that is semantically textual and isn't pure ASCII. (anyone who are tempted to argue otherwise should benchmark their applications, both speed- and memorywise, and be prepared to come up with very strong arguments

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-03 Thread Antoine Pitrou
Le lundi 03 octobre 2005 à 14:59 +0200, Fredrik Lundh a écrit : Antoine Pitrou wrote: A good rule of thumb is to convert to unicode everything that is semantically textual and isn't pure ASCII. How can you be sure that something that is /semantically textual/ will always remain pure

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-03 Thread Martin Blais
On 10/3/05, M.-A. Lemburg [EMAIL PROTECTED] wrote: I'm not sure it's a sensible default. Me neither, especially since this would make it impossible to write polymorphic code - e.g. ', '.join(list) wouldn't work anymore if list contains Unicode; dito for u', '.join(list) with list

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-03 Thread Jim Fulton
Martin Blais wrote: Hi. Like a lot of people (or so I hear in the blogosphere...), I've been experiencing some friction in my code with unicode conversion problems. Even when being super extra careful with the types of str's or unicode objects that my variables can contain, there is always

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-03 Thread Jim Fulton
M.-A. Lemburg wrote: Michael Hudson wrote: Martin Blais [EMAIL PROTECTED] writes: What if we could completely disable the implicit conversions between unicode and str? In other words, if you would ALWAYS be forced to call either .encode() or .decode() to convert between one and the other...

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-03 Thread Fredrik Lundh
Jim Fulton wrote: I would argue that it's evil to change the default encoding in the first place, except in this case to disable implicit encoding or decoding. absolutely. unfortunately, all attempts to add such information to the sys module documentation seem to have failed... (last time I

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-03 Thread Josiah Carlson
Antoine Pitrou [EMAIL PROTECTED] wrote: Le lundi 03 octobre 2005 à 14:59 +0200, Fredrik Lundh a écrit : Antoine Pitrou wrote: A good rule of thumb is to convert to unicode everything that is semantically textual and isn't pure ASCII. How can you be sure that something that

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

2005-10-03 Thread Fredrik Lundh
Josiah Carlson wrote: and isn't pure ASCII. How can you be sure that something that is /semantically textual/ will always remain pure ASCII ? That's contradictory, unless your software never goes out of the anglo-saxon world (and even...). Non-unicode text input widgets. Works