Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Adam Olsen wrote: On 10/30/05, François Pinard [EMAIL PROTECTED] wrote: All development is done in house by French people. All documentation, external or internal, comments, identifier and function names, everything is in French. Some of the developers here have had a long programming life, while they only barely read English. It is surely a constant frustration, for some of us, having to mangle identifiers by ravelling out their necessary diacritics. It does not look good, it does not smell good, and in many cases, mangling identifiers significantly decreases program legibility. Hear, hear! Not all the world uses english, and restricting them to latin characters simply means it's not readable in any language. It doesn't make it any more readable for those of us who only understand english. +1 on internationalized identifiers. While I agree with the sentiments expressed, I think we should not underestimate the practical problems that moving away fr Therefore, if such steps are really going to be considered, I would really like to see them introduced in such a way that no breakage occurs for existing users, even the parochial ones who feel they (and their programs) don't need to understand anything but ASCII. If this means starting out with the features conditionally compiled, despite the added cost of the #ifdefs that would thereby be engendered I think that would be a good idea. We can fix their programs by making Unicode the default string type, but it will take much longer to fix their thinking. regards Steve -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC www.holdenweb.com PyCon TX 2006 www.python.org/pycon/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
François Pinard wrote: All development is done in house by French people. All documentation, external or internal, comments, identifier and function names, everything is in French. There's nothing stopping you from creating your own Frenchified version of Python that lets you use all the characters you want, for your own in-house use. -- Greg Ewing, Computer Science Dept, +--+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | [EMAIL PROTECTED] +--+ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
[Greg Ewing] All development is done in house by French people. All documentation, external or internal, comments, identifier and function names, everything is in French. There's nothing stopping you from creating your own Frenchified version of Python that lets you use all the characters you want, for your own in-house use. No doubt that we, you and me and everybody, could all have our own little version of Python. :-) To tell all the truth, the very topic of your suggestion has already been discussed in-house already, and the decision has been to stick to Python mainstream. We could not justify to our administration that we start modifying our sources, in such a way that we ought to invest maintainance each time a new Python version appears, forever. On the other hand, we may reasonably guess that many people in this world would love being as comfortable as possible using Python, while naming identifiers naturally. It is not so unreasonable that we keep some _hope_ that Guido will soon choose to help us all, not only me. -- François Pinard http://pinard.progiciels-bpi.ca ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
[Martin von Löwis] My canonical example is François Pinard, who keeps requesting it, saying that local people where surprised they couldn't use accented characters in Python. Perhaps that's because he actually is Quebecian :-) I presume I should comment a bit on this. People here are not surprised they couldn't use accented characters, they are rather saddened, and some hoped that Python would offer that possibility, one of these days. Also given that here, every production program or system has been progressively rewritten in Python, slowly at first, and more aggressively while the confidence was building up, to the point not much of the non-Python things remain by now. So, all our hopes are concentrated into a single language. All development is done in house by French people. All documentation, external or internal, comments, identifier and function names, everything is in French. Some of the developers here have had a long programming life, while they only barely read English. It is surely a constant frustration, for some of us, having to mangle identifiers by ravelling out their necessary diacritics. It does not look good, it does not smell good, and in many cases, mangling identifiers significantly decreases program legibility. Now, I keep reading strange arguments from people opposing that we use national letters in identifiers, disturbed by the fact they would have a hard time reading our code or publishing it. Even worse, some want to protect us (and the world) against ourselves, using made up, irrational arguments, producing false logic out of their own emotions and feelings. They would like us to think, write, and publish in English. Is it some anachronical colonialism? Quite possible. It surely has some success, as you may find some French people that will only swear in English! :-) For one, in my programming life, I surely chose to write a lot of English code, and I still think English is a good vehicle to planetary communication. However, I like it to my choice. I always felt much opened and collaborative with similarly minded people, and for them, happily rewrote my things from French to English in view of sharing, whenever I saw some mutual advantage to it. I resent when people want to force me into English when I have no real reason to do so. Let me choose to use my own language, as nicely as I can, when working in-shop with people sharing this language with me, for programs that will likely never be published outside anyway. Internationalisation is already granted in our overall view of today's programming, as a way for letting people be comfortable with computers, each in his/her own language. This comfort should extend widely to naming main programming objects (functions, classes, variables, modules) as legibly as possible. Here, I mean legible in an ideal way for the team or the local community, and not necessarily legible to the whole planet. It does not always have to be planetary, you know. For keywords, the need is less stringent, as syntactical constructs are part of a language. When English is opaque to a programmer, he/she can easily learn that small set of words making the syntax, understanding their effect, even while not necessarily understanding the real English meaning of those keywords. This is not a real obstacle in practice. It is true that many Python tools are not prepared to handle internationalised identifiers, and it is very unlikely that these tools will get ready before Python opens itself to internationalised identifiers. Let's open Python first, tools will undoubtedly follow. There will be some adaptation period, but after some while, everything will fall in place, things will become smooth again and just natural to everybody, to the point many of us might remember the current times, and wonder what was all that fuss about. :-) -- François Pinard http://pinard.progiciels-bpi.ca ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
On 10/30/05, François Pinard [EMAIL PROTECTED] wrote: All development is done in house by French people. All documentation, external or internal, comments, identifier and function names, everything is in French. Some of the developers here have had a long programming life, while they only barely read English. It is surely a constant frustration, for some of us, having to mangle identifiers by ravelling out their necessary diacritics. It does not look good, it does not smell good, and in many cases, mangling identifiers significantly decreases program legibility. Hear, hear! Not all the world uses english, and restricting them to latin characters simply means it's not readable in any language. It doesn't make it any more readable for those of us who only understand english. +1 on internationalized identifiers. -- Adam Olsen, aka Rhamphoryncus ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
At 11:43 2005-10-24 +0200, M.-A. Lemburg wrote: Bengt Richter wrote: Please bear with me for a few paragraphs ;-) Please note that source code encoding doesn't really have anything to do with the way the interpreter executes the program - it's merely a way to tell the parser how to convert string literals (currently on the Unicode ones) into constant Unicode objects within the program text. It's also a nice way to let other people know what kind of encoding you used to write your comments ;-) Nothing more. I think somehow I didn't make things clear, sorry ;-) As I tried to show in the example of module_a.cs vs module_b.cs, the source encoding currently results in two different str-type strings representing the source _character_ sequence, which is the _same_ in both cases. To make it more clear, try the following little program (untested except on NT4 with Python 2.4b1 (#56, Nov 3 2004, 01:47:27) [GCC 3.2.3 (mingw special 20030504-1)] on win32 ;-): t_srcenc.py import os def test(): open('module_a.py','wb').write( # -*- coding: latin-1 -*- + os.linesep + cs = '\xfcber-cool' + os.linesep) open('module_b.py','wb').write( # -*- coding: utf-8 -*- + os.linesep + cs = '\xc3\xbcber-cool' + os.linesep) # show that we have two modules differing only in encoding: print ''.join(line.decode('latin-1') for line in open('module_a.py')) print ''.join(line.decode('utf-8') for line in open('module_b.py')) # see how results are affected: import module_a, module_b print module_a.cs + ' =?= ' + module_b.cs print module_a.cs.decode('latin-1') + ' =?= ' + module_b.cs.decode('utf-8') if __name__ == '__main__': test() --- The result copied from NT4 console to clipboard and pasted into eudora: __ [17:39] C:\pywk\python-devpy24 t_srcenc.py # -*- coding: latin-1 -*- cs = 'über-cool' # -*- coding: utf-8 -*- cs = 'über-cool' nber-cool =?= ++ber-cool über-cool =?= über-cool __ (I'd say NT did the best it could, rendering the the copied cp437 superscript n as the 'n' above, and the '++' coming from the cp437 box characters corresponding to the '\xc3\xbc'. Not sure how it will show on your screen, but try the program to see ;-) Once a module is compiled, there's no distinction between a module using the latin-1 source code encoding or one using the utf-8 encoding. ISTM module_a.cs and module_b.cs can readily be distinguished after compilation, whereas the sources displayed according to their declared encodings as above (or as e.g. different editors using different native encoding might) cannot (other than the encoding cookie itself) ;-) Perhaps you meant something else? Thanks, You're welcome. Regards, Bengt Richter ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Neil Hodgson wrote: M.-A. Lemburg: Unicode has the concept of combining code points, e.g. you can store an é (e with a accent) as e + '. Now if you slice off the accent, you'll break the character that you encoded using combining code points. ... next_indextype(u, index) - integer Returns the Unicode object index for the start of the next indextype found after u[index] or -1 in case no next element of this type exists. Should entity breakage be further discouraged by returning a slice here rather than an object index? You mean a slice that slices out the next indextype ? Something like: i = first_grapheme(u) x = 0 while x width and u[i] != \n: x, _ = draw(u[i], (x, y)) i = next_grapheme(u, i) This sounds a lot like you'd want iterators for the various index types. Should be possible to implement on top of the proposed APIs, e.g. itergraphemes(u), itercodepoints(u), etc. Note that what most people refer to as character is a grapheme in Unicode speak. Given that interpretation, breaking Unicode characters is something you won't ever work around with by using larger code units such as UCS4 compatible ones. Furthermore, you should also note that surrogates (two code units encoding one code point) are part of Unicode life. While you don't need them when storing Unicode in UCS4 code units, they can still be part of the Unicode data and the programmer has to be aware of these. I personally, don't think that slicing Unicode is such a big issue. If you know what you are doing, things tend not to break - which is true for pretty much everything you do in programming ;-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 25 2005) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Bengt Richter wrote: At 11:43 2005-10-24 +0200, M.-A. Lemburg wrote: Bengt Richter wrote: Please bear with me for a few paragraphs ;-) Please note that source code encoding doesn't really have anything to do with the way the interpreter executes the program - it's merely a way to tell the parser how to convert string literals (currently on the Unicode ones) into constant Unicode objects within the program text. It's also a nice way to let other people know what kind of encoding you used to write your comments ;-) Nothing more. I think somehow I didn't make things clear, sorry ;-) As I tried to show in the example of module_a.cs vs module_b.cs, the source encoding currently results in two different str-type strings representing the source _character_ sequence, which is the _same_ in both cases. I don't follow you here. The source code encoding is only applied to Unicode literals (you are using string literals in your example). String literals are passed through as-is. Whether or not you editor will use the source code encoding marker is really up to your editor and not within the scope of Python. If you open the two module files in Emacs, you'll see identical renderings of the string literals. With other editors, you may have to explicitly tell the editor which encoding to assume. Dito for shell printouts. To make it more clear, try the following little program (untested except on NT4 with Python 2.4b1 (#56, Nov 3 2004, 01:47:27) [GCC 3.2.3 (mingw special 20030504-1)] on win32 ;-): t_srcenc.py import os def test(): open('module_a.py','wb').write( # -*- coding: latin-1 -*- + os.linesep + cs = '\xfcber-cool' + os.linesep) open('module_b.py','wb').write( # -*- coding: utf-8 -*- + os.linesep + cs = '\xc3\xbcber-cool' + os.linesep) # show that we have two modules differing only in encoding: print ''.join(line.decode('latin-1') for line in open('module_a.py')) print ''.join(line.decode('utf-8') for line in open('module_b.py')) # see how results are affected: import module_a, module_b print module_a.cs + ' =?= ' + module_b.cs print module_a.cs.decode('latin-1') + ' =?= ' + module_b.cs.decode('utf-8') if __name__ == '__main__': test() --- The result copied from NT4 console to clipboard and pasted into eudora: __ [17:39] C:\pywk\python-devpy24 t_srcenc.py # -*- coding: latin-1 -*- cs = 'über-cool' # -*- coding: utf-8 -*- cs = 'über-cool' nber-cool =?= ++ber-cool über-cool =?= über-cool __ (I'd say NT did the best it could, rendering the the copied cp437 superscript n as the 'n' above, and the '++' coming from the cp437 box characters corresponding to the '\xc3\xbc'. Not sure how it will show on your screen, but try the program to see ;-) Once a module is compiled, there's no distinction between a module using the latin-1 source code encoding or one using the utf-8 encoding. ISTM module_a.cs and module_b.cs can readily be distinguished after compilation, whereas the sources displayed according to their declared encodings as above (or as e.g. different editors using different native encoding might) cannot (other than the encoding cookie itself) ;-) Perhaps you meant something else? What your editor displays to you is not within the scope of Python, e.g. if you open the files in Emacs you'll see something different than in Notepad. I guess that's the price you have to pay for being able to write programs that can include Unicode literals using the complete range of possible Unicode characters without having to revert to escapes. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 25 2005) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Bill Janssen wrote: I just got mail this morning from a researcher who wants exactly what Martin described, and wondered why the default MacPython 2.4.2 didn't provide it by default. :-) If all he wants is to represent Deseret, he can do so in a 16-bit Unicode type, too: Python supports UTF-16. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
I think he was more interested in the invariant Martin proposed, that len(\U0001) should always be the same and should always be 1. Bill ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
On 10/25/05, Bill Janssen [EMAIL PROTECTED] wrote: I think he was more interested in the invariant Martin proposed, that len(\U0001) should always be the same and should always be 1. Yes but why? What does this invariant do for him? -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Guido van Rossum wrote: Yes but why? What does this invariant do for him? I don't know about this person, but there are a few things that don't work properly in UTF-16 mode: - the Unicode character database fails to lookup things. u\U0001D670.isupper() gives false, but should give true (since it denotes MATHEMATICAL MONOSPACE CAPITAL A). It gives true in UCS-4 mode - As a result, normalization on these doesn't work, either. It should normalize to LATIN CAPITAL LETTER A under NFKC, but doesn't. - regular expressions only have limited support. In particular, adding non-BMP characters to character classes is not possible. [\U0001D670] will match any character that is either \uD835 or \uDE70, whereas it only matches MATHEMATICAL MONOSPACE CAPITAL A in UCS-4 mode. There might be more limitations, but those are the ones that come to mind easily. While I could imagine fixing the first two with some effort, the third one is really tricky (unless you would accept a wide representation of a character class even if the Unicode representation is only narrow). Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
M.-A. Lemburg: You mean a slice that slices out the next indextype ? Yes. This sounds a lot like you'd want iterators for the various index types. Should be possible to implement on top of the proposed APIs, e.g. itergraphemes(u), itercodepoints(u), etc. Iterators may be helpful, but can also be too restrictive when the processing is not completely iterative, such as peeking ahead or looking behind to wrap at a word boundary in the display example. There should be It was more that there may leave less scope for error if there was a move away from indexes to slices. The PEP provides ways to specify what you want to examine or modify but it looks to me like returning indexes will see code repetition or additional variables with an increase in fragility. Note that what most people refer to as character is a grapheme in Unicode speak. A grapheme-oriented string type may be worthwhile although you'd probably have to choose a particular normalisation form to ease processing. Given that interpretation, breaking Unicode characters is something you won't ever work around with by using larger code units such as UCS4 compatible ones. I still think we can reduce the scope for errors. Furthermore, you should also note that surrogates (two code units encoding one code point) are part of Unicode life. While you don't need them when storing Unicode in UCS4 code units, they can still be part of the Unicode data and the programmer has to be aware of these. Many programmers can and will ignore surrogates. One day that may bite them but we can't close off text processing to those who have no idea of what surrogates are, or directional marks, or that sorting is locale dependent, or have no understanding of the difference between NFC and NFKD normalization forms. Neil ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Neil Hodgson wrote: I'd like to more tightly define Unicode strings for Python 3000. Currently, Unicode strings may be implemented with either 2 byte (UCS-2) or 4 byte (UTF-32) elements. Python should allow strings to contain any Unicode character and should be indexable yielding characters rather than half characters. Therefore Python strings should appear to be UTF-32. There could still be multiple implementations (using UTF-16 or UTF-8) to preserve space but all implementations should appear to be the same apart from speed and memory use. That's very tricky. If you have multiple implementations, you make usage at the C API difficult. If you make it either UTF-8 or UTF-32, you make PythonWin difficult. If you make it UTF-16, you make indexing difficult. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Phillip J. Eby wrote: I'm tempted to say it would be even better if there was a command line option that could be used to force all binary opens to result in bytes, and require all text opens to specify an encoding. For Python 3000? -1. There shouldn't be command line switches that have that much importance. For Python 2.x? Well, we are not supposed to discuss this. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Martin v. Löwis: That's very tricky. If you have multiple implementations, you make usage at the C API difficult. If you make it either UTF-8 or UTF-32, you make PythonWin difficult. If you make it UTF-16, you make indexing difficult. For Windows, the code will get a little uglier, needing to perform an allocation/encoding and deallocation more often then at present but I don't think there will be a speed degradation as Windows is currently performing a conversion from 8 bit to UTF-16 inside many system calls. To minimize the cost of allocation, Python could copy Windows in keeping a small number of commonly sized preallocated buffers handy. For indexing UTF-16, a flag could be set to show if the string is all in the base plane and if not, an index could be constructed when and if needed. It'd be good to get some feel for what proportion of string operations performed require indexing. Many, such as startswith, split, and concatenation don't require indexing. The proportion of operations that use indexing to scan strings would also be interesting as adding a (currentIndex, currentOffset) cursor to string objects would be another approach. Neil ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Neil Hodgson wrote: Guido van Rossum: Folks, please focus on what Python 3000 should do. I'm thinking about making all character strings Unicode (possibly with different internal representations a la NSString in Apple's Objective C) and introduce a separate mutable bytes array data type. But I could use some validation or feedback on this idea from actual practitioners. I'd like to more tightly define Unicode strings for Python 3000. Currently, Unicode strings may be implemented with either 2 byte (UCS-2) or 4 byte (UTF-32) elements. Python should allow strings to contain any Unicode character and should be indexable yielding characters rather than half characters. Therefore Python strings should appear to be UTF-32. There could still be multiple implementations (using UTF-16 or UTF-8) to preserve space but all implementations should appear to be the same apart from speed and memory use. There seems to be a general misunderstanding here: even if you have UCS4 storage, it is still possible to slice a Unicode string in a way which makes rendering it correctly. Unicode has the concept of combining code points, e.g. you can store an é (e with a accent) as e + '. Now if you slice off the accent, you'll break the character that you encoded using combining code points. Note that combining code points are rather common in encodings of Asian scripts, so this is not an artificial example. Some time ago I proposed a new module called unicodeindex to help with indexing. It would solve most of the indexing issues you run into when dealing with Unicode. I've attached it to this email for reference. More on the used terms: http://www.egenix.com/files/python/EuroPython2002-Python-and-Unicode.pdf http://www.egenix.com/files/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 24 2005) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! PEP: 0XXX Title: Unicode Indexing Helper Module Version: $Revision: 1.0 $ Author: [EMAIL PROTECTED] (Marc-Andr Lemburg) Status: Draft Type: Standards Track Python-Version: 2.3 Created: 06-Jun-2001 Post-History: Abstract This PEP proposes a new module unicodeindex which provides means to index Unicode objects in various higher level abstractions of characters. Problem and Terminology Unicode objects can be indexed just like string object using what in Unicode terms is called a code unit as index basis. Code units are the storage entities used by the Unicode implementation to store a single Unicode information unit and do not necessarily map 1-1 to code points which are the smallest entities encoded by the Unicode standard. Python exposes code units to the programmer via the Unicode object indexing and slicing API, e.g. u[10] or u[12:15] refer to the code units at index 10 and indices 12 to 14. These code points can sometimes be composed to form graphemes which are then displayed by the Unicode output device as one character. A word is then a sequence of characters separated by space characters or punctuation, a line is a sequence of code points separated by line breaking code point sequences. For addressing Unicode, there are basically five different methods by which you can reference the data: 1. per code unit(codeunit) 2. per code point (codepoint) 3. per grapheme (grapheme) 4. per word (word) 5. per line (line) The indexing type name is given in parenthesis and used in the module interface. Proposed Solution I propose to add a new module to the standard Python library which provides interfaces implementing the above indexing methods. Module Interface The module should provide the following interfaces for all four indexing styles: next_indextype(u, index) - integer Returns the Unicode object index for the start of the next indextype found after u[index] or -1 in case no next element of this type exists. prev_indextype(u, index) - integer Returns the Unicode object index for the start of the previous indextype found before u[index] or -1 in case no previous element of this type exists. indextype_index(u, n) - integer Returns the Unicode object index for the start of the n-th indextype element in u. Raises an IndexError in case no n-th element can be found. indextype_count(u, index) - integer Counts the number of complete indextype elements found in u[:index] and returns the count
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Bengt Richter wrote: Please bear with me for a few paragraphs ;-) Please note that source code encoding doesn't really have anything to do with the way the interpreter executes the program - it's merely a way to tell the parser how to convert string literals (currently on the Unicode ones) into constant Unicode objects within the program text. It's also a nice way to let other people know what kind of encoding you used to write your comments ;-) Nothing more. Once a module is compiled, there's no distinction between a module using the latin-1 source code encoding or one using the utf-8 encoding. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 24 2005) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
I'm thinking about making all character strings Unicode (possibly with different internal representations a la NSString in Apple's Objective C) and introduce a separate mutable bytes array data type. But I could use some validation or feedback on this idea from actual practitioners. +1 from me, too. I'm tempted to say it would be even better if there was a command line option that could be used to force all binary opens to result in bytes, and require all text opens to specify an encoding. I like this idea, too. Presumably plain open(FILENAME, MODE) would then result in a binary open (no encoding specified), which I've wanted for a long time (and which makes sense). But it is a change. Bill ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Python should allow strings to contain any Unicode character and should be indexable yielding characters rather than half characters. Therefore Python strings should appear to be UTF-32. +1. Bill ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Neil Hodgson wrote: For Windows, the code will get a little uglier, needing to perform an allocation/encoding and deallocation more often then at present but I don't think there will be a speed degradation as Windows is currently performing a conversion from 8 bit to UTF-16 inside many system calls. [...] For indexing UTF-16, a flag could be set to show if the string is all in the base plane and if not, an index could be constructed when and if needed. There are many design alternatives: one option would be to support *three* internal representations in a single type, generating the others from the one operation existing as needed. The default, initial representation might be UTF-8, with UCS-4 only being generated when indexing occurs, and UCS-2 only being generated when the API requires it. On concatenation, always concatenate just one represenation: either one that is already present in both operands, else UTF-8. It'd be good to get some feel for what proportion of string operations performed require indexing. Many, such as startswith, split, and concatenation don't require indexing. The proportion of operations that use indexing to scan strings would also be interesting as adding a (currentIndex, currentOffset) cursor to string objects would be another approach. Indeed. My guess is that indexing is more common than you think, especially when iterating over the string. Of course, iteration could also operate on UTF-8, if you introduced string iterator objects. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
There are many design alternatives: one option would be to support *three* internal representations in a single type, generating the others from the one operation existing as needed. The default, initial representation might be UTF-8, with UCS-4 only being generated when indexing occurs, and UCS-2 only being generated when the API requires it. On concatenation, always concatenate just one represenation: either one that is already present in both operands, else UTF-8. Wouldn't it be simpler to use: - one-byte representation if every character = 0xFF - two-byte representation if every character = 0x - four-byte representation otherwise Then combining several strings means using the larger representation as a result (*). In practice, most use cases will not involve the four-byte representation. (*) a heuristic can be invented so that, when producing a smaller string (by stripping/slicing/etc.), it will sometimes check whether a narrower representation is possible. For example : store the length of the string when the last check occurred, and do a new check when the length falls below the half that value. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
On 10/24/05, Martin v. Löwis [EMAIL PROTECTED] wrote: Indeed. My guess is that indexing is more common than you think, especially when iterating over the string. Of course, iteration could also operate on UTF-8, if you introduced string iterator objects. Python's slice-and-dice model pretty much ensures that indexing is common. Almost everything is ultimately represented as indices: regex search results have the index in the API, find()/index() return indices, many operations take a start and/or end index. As long as that's the case, indexing better be fast. Changing the APIs would be much work, although perhaps not impossible of Python 3000. For example, Raymond Hettinger's partition() API doesn't refer to indices at all, and can replace many uses of find() or index(). Still, the mere existence of __getitem__ and __getslice__ on strings makes it necessary to implement them efficiently. How realistic would it be to drop them? What should replace them? Some kind of abstract pointers-into-strings perhaps, but that seems much more complex. The trick seems to be to support both simple programs manipulating short strings (where indexing is probably the easiest API to understand, and the additional copying is unlikely to cause performance problems) , as well as programs manipulating very large buffers containing text and doing sophisticated string processing on them. Perhaps we could provide a different kind of API to support the latter, perhaps based on a mutable character buffer data type without direct indexing? -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
On 10/24/05, Martin v. Löwis [EMAIL PROTECTED] wrote: Guido van Rossum wrote: Changing the APIs would be much work, although perhaps not impossible of Python 3000. For example, Raymond Hettinger's partition() API doesn't refer to indices at all, and can replace many uses of find() or index(). I think Neil's proposal is not to make them go away, but to implement them less efficiently. For example, if the internal representation is UTF-8, indexing requires linear time, as opposed to constant time. If the internal representation is UTF-16, and you have a flag to indicate whether there are any surrogates on the string, indexing is constant if the flag is false, else linear. I understand all that. My point is that it's a bad idea to offer an indexing operation that isn't O(1). Perhaps we could provide a different kind of API to support the latter, perhaps based on a mutable character buffer data type without direct indexing? There are different design goals conflicting here: - some think: all my data is ASCII, so I want to only use one byte per character. - others think: all my data goes to the Windows API, so I want to use 2 byte per character. - yet others think: I want all of Unicode, with proper, efficient indexing, so I want four bytes per char. I doubt the last one though. Probably they really don't want efficient indexing, they want to perform higher-level operations that currently are only possible using efficient indexing or slicing. With the right API. perhaps they could work just as efficiently with an internal representation of UTF-8. It's not so much a matter of API as a matter of internal representation. The API doesn't have to change (except for the very low-level C API that directly exposes Py_UNICODE*, perhaps). I think the API should reflect the representation *to some extend*, namely it shouldn't claim to have operations that are typically thought of as O(1) that can only be implemented as O(n). An internal representation of UTF-8 might make everyone happy except heavy Windows users; but it requires changes to the API so people won't be writing Python 2.x-style string slinging code. -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
M.-A. Lemburg: Unicode has the concept of combining code points, e.g. you can store an é (e with a accent) as e + '. Now if you slice off the accent, you'll break the character that you encoded using combining code points. ... next_indextype(u, index) - integer Returns the Unicode object index for the start of the next indextype found after u[index] or -1 in case no next element of this type exists. Should entity breakage be further discouraged by returning a slice here rather than an object index? Something like: i = first_grapheme(u) x = 0 while x width and u[i] != \n: x, _ = draw(u[i], (x, y)) i = next_grapheme(u, i) Neil ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
- yet others think: I want all of Unicode, with proper, efficient indexing, so I want four bytes per char. I doubt the last one though. Probably they really don't want efficient indexing, they want to perform higher-level operations that currently are only possible using efficient indexing or slicing. With the right API. perhaps they could work just as efficiently with an internal representation of UTF-8. I just got mail this morning from a researcher who wants exactly what Martin described, and wondered why the default MacPython 2.4.2 didn't provide it by default. :-) Bill ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
On 10/24/05, Bill Janssen [EMAIL PROTECTED] wrote: - yet others think: I want all of Unicode, with proper, efficient indexing, so I want four bytes per char. I doubt the last one though. Probably they really don't want efficient indexing, they want to perform higher-level operations that currently are only possible using efficient indexing or slicing. With the right API. perhaps they could work just as efficiently with an internal representation of UTF-8. I just got mail this morning from a researcher who wants exactly what Martin described, and wondered why the default MacPython 2.4.2 didn't provide it by default. :-) Oh, I don't doubt that they want it. But often they don't *need* it, and the higher-level goal they are trying to accomplish can be dealt with better in a different way. (Sort of my response to people asking for static typing in Python as well. :-) Did they tell you what they were trying to do that MacPython 2.4.2 wouldn't let them, beyond represent a large Unicode string as an array of 4-byte integers? -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Guido van Rossum wrote: I think the API should reflect the representation *to some extend*, namely it shouldn't claim to have operations that are typically thought of as O(1) that can only be implemented as O(n). Maybe a compromise could be reached by using a btree of chunks or something, so indexing is O(log n). Not as good as O(1) but a lot better than O(n). -- Greg Ewing, Computer Science Dept, +--+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | [EMAIL PROTECTED] +--+ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Guido van Rossum wrote: Python's slice-and-dice model pretty much ensures that indexing is common. Almost everything is ultimately represented as indices: regex search results have the index in the API, find()/index() return indices, many operations take a start and/or end index. Maybe the idea of string views should be reconsidered in light of this. It's been criticised on the grounds that its use could keep large strings alive longer than needed, but if operations that currently return indices instead returned string views, this wouldn't be any more of a concern than it is now, especially if there is an easy way to explicitly materialise the view as an independent string when wanted. -- Greg Ewing, Computer Science Dept, +--+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | [EMAIL PROTECTED] +--+ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Guido writes: Oh, I don't doubt that they want it. But often they don't *need* it, and the higher-level goal they are trying to accomplish can be dealt with better in a different way. (Sort of my response to people asking for static typing in Python as well. :-) I suppose that's true. But what if they're not smart enough to figure out that better, different, way? I doubt you intend Python to be sort of the Rubik's cube of programming... And no, he didn't say why he wanted the ability to represent a Unicode string as an array of 4-byte integers. Though I know he's doing something with the Deseret Alphabet, translating some early work on American Indian culture that was transcribed in that character set. Bill ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
-1 on keeping the source encoding of string literals. Python should definitely decode them at compile time. -1 on decoding implicitly as needed. This causes decoding to happen late, in unpredictable places. Decodes can fail; they should happen as early and as close to the data source as possible. -j ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
On Oct 23, 2005, at 3:10 PM, Jason Orendorff wrote: -1 on decoding implicitly as needed. This causes decoding to happen late, in unpredictable places. Decodes can fail; they should happen as early and as close to the data source as possible. That's not necessarily true... Some codecs can't fail, like latin1. I think the main use case for this is to speed up usage of text in these sorts of formats anyway. -bob ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
On Sunday 23 October 2005 18:10, Jason Orendorff wrote: -1 on keeping the source encoding of string literals. Python should definitely decode them at compile time. -1 on decoding implicitly as needed. This causes decoding to happen late, in unpredictable places. Decodes can fail; they should happen as early and as close to the data source as possible. +1. We have followed this last practice throughout Zope 3 successfully. In our case, the publisher framework (in other words the output-protocol-specific layer) is responsible for the decoding and encoding of input and output streams, respectively. We have been pretty much free of any encoding/decoding troubles since. Having our application only use unicode internally was one of the best decisions we have made. Regards, Stephan -- Stephan Richter CBU Physics Chemistry (B.S.) / Tufts Physics (Ph.D. student) Web2k - Web Software Design, Development and Training ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Folks, please focus on what Python 3000 should do. I'm thinking about making all character strings Unicode (possibly with different internal representations a la NSString in Apple's Objective C) and introduce a separate mutable bytes array data type. But I could use some validation or feedback on this idea from actual practitioners. I don't want to see proposals to mess with the str/unicode semantics in Python 2.x. Let' leave the Python 2.x str/unicode semantics alone until Python 3000 -- we don't need mutliple transitions. (Although we could add the mutable bytes array type sooner.) -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
On Oct 23, 2005, at 6:06 PM, Guido van Rossum wrote: Folks, please focus on what Python 3000 should do. I'm thinking about making all character strings Unicode (possibly with different internal representations a la NSString in Apple's Objective C) and introduce a separate mutable bytes array data type. But I could use some validation or feedback on this idea from actual practitioners. I don't want to see proposals to mess with the str/unicode semantics in Python 2.x. Let' leave the Python 2.x str/unicode semantics alone until Python 3000 -- we don't need mutliple transitions. (Although we could add the mutable bytes array type sooner.) +1, this is precisely what I'd like to see. -bob ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
At 06:06 PM 10/23/2005 -0700, Guido van Rossum wrote: Folks, please focus on what Python 3000 should do. I'm thinking about making all character strings Unicode (possibly with different internal representations a la NSString in Apple's Objective C) and introduce a separate mutable bytes array data type. But I could use some validation or feedback on this idea from actual practitioners. +1. Chandler has been going through quite an upheaval to get its unicode handling together. Having a bytes type would be great, as long as there was support for files and sockets to produce bytes instead of strings (unless an encoding was specified). I'm tempted to say it would be even better if there was a command line option that could be used to force all binary opens to result in bytes, and require all text opens to specify an encoding. The Chandler i18n project lead would jump for joy if we had a way to keep legacy strings out of the system, apart from ASCII string constants found in code. It would then be okay not to drop support for the implicit conversions; if you can't get strings on input, then conversion's not really an issue. Anyway, I think all of the things I'd like to see can be done without breakage in 2.5. For Chandler at least, we'd be willing to go with a command-line option that's more strict, in order to be able to ensure that plugin developers can't accidentally put 8-bit strings in somewhere, just by opening a file. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Divorcing str and unicode (no more implicit conversions).
Please bear with me for a few paragraphs ;-) One aspect of str-type strings is the efficiency afforded when all the encoding really is ascii. If the internal encoding were e.g. fixed utf-16le for strings, maybe with today's computers it would still be efficient enough for most actual string purposes (excluding the current use of str-strings as byte sequences). I.e., you'd still have to identify what was strings (of characters) and what was really byte sequences with no implied or explicit encoding or character semantics. Ok, let's make that distinction explicit: Call one kind of string a byte sequence and the other a character sequence (representation being a separate issue). A unicode object is of course the prime _general_ representation of a character sequence in Python, but all the names in python source code (that become NAME tokens) are UIAM also character sequences, and representable by a byte sequence interpreted according to ascii encoding. For the sake of discussion, suppose we had another _character_ sequence type that was the moral equivalent of unicode except for internal representation, namely a str subclass with an encoding attribute specifying the encoding that you _could_ use to decode the str bytes part to get unicode (which you wouldn't do except when necessary). We could call it class charstr(str): ... and have chrstr().bytes be the str part and chrstr().encoding specify the encoding part. In all the contexts where we have obvious encoding information, we can then generate a charstr instead of a str. E.g., if the source of module_a has # -*- coding: latin1 -*- cs = 'über-cool' then type(cs) # = type 'charstr' cs.bytes # = '\xfcber-cool' cs.encoding # = 'latin-1' and print cs would act like print cs.bytes.decode(cs.encoding) -- or I guess sys.stdout.write(cs.bytes.decode(cs.encoding).encode(sys.stdout.encoding) followed by sys.stdout.write('\n'.decode('ascii').encode(sys.stdout.encoding) for the newline of the print. Now if module_b has # -*- coding: utf8 -*- cs = 'über-cool' and we interactively import module_a, module_b and then print module_a.cs + ' =?= ' + module_b.cs what could happen ideally vs. what we have currently? UIAM, currently we would just get the concatenation of the three str byte sequences concatenated to make '\xfcber-cool =?= \xc3\xbcber-cool' and that would be printed as whatever that comes out as without conversion when seen by the output according to sys.stdout.encoding. But if those cs instances had been charstr instances, the coding cookie encoding information would have been preserved, and the interactive print could have evaluated the string expression -- given cs.decode() as sugar for (cs.bytes.decode(cs.encoding or globals().get('__encoding__') or __import__('sys').getdefaultencoding())) -- as module_a.cs.decode() + ' =?= '.decode() + module_b.cs.decode() if pairwise terms differ in encoding as they might all here. If the interactive session source were e.g. latin-1, like module_a, then module_a.cs + ' =?= ' would not require an encoding change, because the ' =?= ' would be a charstr instance with encoding == 'latin-1', and so the result would still be latin-1 that far. But with module_b.cs being utf8, the next addition would cause the .decode() promotions to unicode. In a console window, the ' =?= '.encoding might be 'cp437' or such, and the first addition would then cause promotion (since module_a.cs.encoding != 'cp437'). I have sneaked in run-time access to individual modules' encodings by assuming that the encoding cookie could be compiled in as an explicit global __encoding__ variable for any given module (what to have as __encoding__ for built-in modules could vary for various purposes). ISTM this could have use in situations where an encoding assumption is necessary and currently 'ascii' is not as good a guess as one could make, though I suspect if string literals became charstr strings instead of str strings, many if not most of those situations would disappear (I'm saying this because ATM I can't think of an 'ascii'-guess situation that wouldn't go away ;-) If there were a charchr() version of chr() that would result in a charstr instead of a str, IWT one would want an easy-sugar default encoding assumption, probably based on the same as one would assume for '%c' % num in a given module source -- which presumably would be '%c'.encoding, where '%c' assumes the encoding of the module source, normally recorded in __encoding__. So charchr(n) would act like chr(n).decode().encode(''.encoding) -- or more reasonably charstr(chr(n)), which would be short for charstr(chr(n), globals().get('__encoding__') or __import__('sys').getdefaultencoding()) Or some efficient equivalent ;-) Using strings in dicts requires hashing to find key comparison candidates and comparison to check for key equivalence. This would seem to point to some kind of
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Martin Blais wrote: Yes. setdefaultencoding() is removed from sys by site.py. To get it again you must reload sys. Thanks. Actually, I should take the opportunity to advise people that setdefaultencoding doesn't really work. With the default default encoding, strings and Unicode objects hash equal when they are equal. If you change the default encoding, this property goes away (perhaps unless you change it to Latin-1). As a result, dictionaries where you mix string and Unicode keys won't work: you might not find a value for a string key when looking up with a Unicode object, and vice versa. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
On 10/15/05, Reinhold Birkenfeld [EMAIL PROTECTED] wrote: Martin Blais wrote: On 10/3/05, Michael Hudson [EMAIL PROTECTED] wrote: Martin Blais [EMAIL PROTECTED] writes: How hard would that be to implement? import sys reload(sys) sys.setdefaultencoding('undefined') Hmmm any particular reason for the call to reload() here? Yes. setdefaultencoding() is removed from sys by site.py. To get it again you must reload sys. Thanks. cheers, ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
On 10/3/05, Michael Hudson [EMAIL PROTECTED] wrote: Martin Blais [EMAIL PROTECTED] writes: How hard would that be to implement? import sys reload(sys) sys.setdefaultencoding('undefined') Hmmm any particular reason for the call to reload() here? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Martin Blais wrote: On 10/3/05, Michael Hudson [EMAIL PROTECTED] wrote: Martin Blais [EMAIL PROTECTED] writes: How hard would that be to implement? import sys reload(sys) sys.setdefaultencoding('undefined') Hmmm any particular reason for the call to reload() here? Yes. setdefaultencoding() is removed from sys by site.py. To get it again you must reload sys. Reinhold -- Mail address is perfectly valid! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Divorcing str and unicode (no more implicit conversions).
Hi. Like a lot of people (or so I hear in the blogosphere...), I've been experiencing some friction in my code with unicode conversion problems. Even when being super extra careful with the types of str's or unicode objects that my variables can contain, there is always some case or oversight where something unexpected happens which results in a conversion which triggers a decode error. str.join() of a list of strs, where one unicode object appears unexpectedly, and voila! exception galore. Sometimes the problem shows up late because your test code doesn't always contain accented characters. I'm sure many of you experienced that or some variant at some point. I came to realize recently that this problem shares strong similarity with the problem of implicit type conversions in C++, or at least it feels the same: Stuff just happens implicitly, and it's hard to track down where and when it happens by just looking at the code. Part of the problem is that the unicode object acts a lot like a str, which is convenient, but... What if we could completely disable the implicit conversions between unicode and str? In other words, if you would ALWAYS be forced to call either .encode() or .decode() to convert between one and the other... wouldn't that help a lot deal with that issue? How hard would that be to implement? Would it break a lot of code? Would some people want that? (I know I would, at least for some of my code.) It seems to me that this would make the code more explicit and force the programmer to become more aware of those conversions. Any opinions welcome. cheers, ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Martin Blais [EMAIL PROTECTED] writes: What if we could completely disable the implicit conversions between unicode and str? In other words, if you would ALWAYS be forced to call either .encode() or .decode() to convert between one and the other... wouldn't that help a lot deal with that issue? I don't know. I've made one or two apps safe against this and it's mostly just annoying. How hard would that be to implement? import sys reload(sys) sys.setdefaultencoding('undefined') Would it break a lot of code? Would some people want that? (I know I would, at least for some of my code.) It seems to me that this would make the code more explicit and force the programmer to become more aware of those conversions. Any opinions welcome. I'm not sure it's a sensible default. Cheers, mwh -- It is never worth a first class man's time to express a majority opinion. By definition, there are plenty of others to do that. -- G. H. Hardy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Le lundi 03 octobre 2005 à 02:09 -0400, Martin Blais a écrit : What if we could completely disable the implicit conversions between unicode and str? This would be very annoying when dealing with some modules or libraries where the type (str / unicode) returned by a function depends on the context, build, or platform. A good rule of thumb is to convert to unicode everything that is semantically textual, and to only use str for what is to be semantically treated as a string of bytes (network packets, identifiers...). This is also, AFAIU, the semantic model which is favoured for a hypothetical future version of Python. This is what I'm using to do safe conversion to a given type without worrying about the type of the argument: DEFAULT_CHARSET = 'utf-8' def safe_unicode(s, charset=None): Forced conversion of a string to unicode, does nothing if the argument is already an unicode object. This function is useful because the .decode method on an unicode object, instead of being a no-op, tries to do a double conversion back and forth (which often fails because 'ascii' is the default codec). if isinstance(s, str): return s.decode(charset or DEFAULT_CHARSET) else: return s def safe_str(s, charset=None): Forced conversion of an unicode to string, does nothing if the argument is already a plain str object. This function is useful because the .encode method on an str object, instead of being a no-op, tries to do a double conversion back and forth (which often fails because 'ascii' is the default codec). if isinstance(s, unicode): return s.encode(charset or DEFAULT_CHARSET) else: return s Good luck Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Antoine Pitrou wrote: A good rule of thumb is to convert to unicode everything that is semantically textual and isn't pure ASCII. (anyone who are tempted to argue otherwise should benchmark their applications, both speed- and memorywise, and be prepared to come up with very strong arguments for why python programs shouldn't be allowed to be fast and memory-efficient whenever they can...) /F ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Le lundi 03 octobre 2005 à 14:59 +0200, Fredrik Lundh a écrit : Antoine Pitrou wrote: A good rule of thumb is to convert to unicode everything that is semantically textual and isn't pure ASCII. How can you be sure that something that is /semantically textual/ will always remain pure ASCII ? That's contradictory, unless your software never goes out of the anglo-saxon world (and even...). (anyone who are tempted to argue otherwise should benchmark their applications, both speed- and memorywise, and be prepared to come up with very strong arguments for why python programs shouldn't be allowed to be fast and memory-efficient whenever they can...) I think most applications don't critically depend on text processing performance. OTOH, international adaptability is the kind of thing that /will/ bite you one day if you don't prepare for it at the beginning. Also, if necessary, the distinction could be an implementation detail and the conversion be transparent (like int vs. long): the text would be coded in an 8-bit charset as long as possible and converted to a wide encoding only when necessary. The important thing is that these optimisations, if they are necessary, should be transparently handled by the Python runtime. (it seems to me - I may be mistaken - that modern Windows versions treat every string as 16-bit unicode internally. Why are they doing it if it is that inefficient?) Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
On 10/3/05, M.-A. Lemburg [EMAIL PROTECTED] wrote: I'm not sure it's a sensible default. Me neither, especially since this would make it impossible to write polymorphic code - e.g. ', '.join(list) wouldn't work anymore if list contains Unicode; dito for u', '.join(list) with list containing a string. Sounds like what you want is exactly what I want to avoid (for those two types anyway). cheers, ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Martin Blais wrote: Hi. Like a lot of people (or so I hear in the blogosphere...), I've been experiencing some friction in my code with unicode conversion problems. Even when being super extra careful with the types of str's or unicode objects that my variables can contain, there is always some case or oversight where something unexpected happens which results in a conversion which triggers a decode error. str.join() of a list of strs, where one unicode object appears unexpectedly, and voila! exception galore. Sometimes the problem shows up late because your test code doesn't always contain accented characters. I'm sure many of you experienced that or some variant at some point. I came to realize recently that this problem shares strong similarity with the problem of implicit type conversions in C++, or at least it feels the same: Stuff just happens implicitly, and it's hard to track down where and when it happens by just looking at the code. Part of the problem is that the unicode object acts a lot like a str, which is convenient, but... I agree. I think it was a mistake to implicitly convert mixed string expressions to unicode. What if we could completely disable the implicit conversions between unicode and str? In other words, if you would ALWAYS be forced to call either .encode() or .decode() to convert between one and the other... wouldn't that help a lot deal with that issue? Perhaps. How hard would that be to implement? Not hard. We considered doing it for Zope 3, but ... Would it break a lot of code? Yes. Would some people want that? No, I wouldn't want lots of code to break. ;) (I know I would, at least for some of my code.) It seems to me that this would make the code more explicit and force the programmer to become more aware of those conversions. Any opinions welcome. I think it's too late to change this. I wish it had been done differently. (OTOH, I'm very happy we have Unicode support, so I'm not really complaining. :) I'll note that this hasn't been that much of a problem for us in Zope. We follow the strategy: Antoine Pitrou wrote: ... A good rule of thumb is to convert to unicode everything that is semantically textual, and to only use str for what is to be semantically treated as a string of bytes (network packets, identifiers...). This is also, AFAIU, the semantic model which is favoured for a hypothetical future version of Python. This approach has worked pretty well for us. Still, when there is a problem, it's a real pain to debug because the error occurs too late, as you point out. Jim -- Jim Fulton mailto:[EMAIL PROTECTED] Python Powered! CTO (540) 361-1714http://www.python.org Zope Corporation http://www.zope.com http://www.zope.org ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
M.-A. Lemburg wrote: Michael Hudson wrote: Martin Blais [EMAIL PROTECTED] writes: What if we could completely disable the implicit conversions between unicode and str? In other words, if you would ALWAYS be forced to call either .encode() or .decode() to convert between one and the other... wouldn't that help a lot deal with that issue? I don't know. I've made one or two apps safe against this and it's mostly just annoying. How hard would that be to implement? import sys reload(sys) sys.setdefaultencoding('undefined') You shouldn't post tricks like these :-) The correct way to change the default encoding is by providing a sitecustomize.py module which then call the sys.setdefaultencoding(undefined). This is a much more evil trick IMO, as it affects all Python code, rather than a single program. I would argue that it's evil to change the default encoding in the first place, except in this case to disable implicit encoding or decoding. Jim -- Jim Fulton mailto:[EMAIL PROTECTED] Python Powered! CTO (540) 361-1714http://www.python.org Zope Corporation http://www.zope.com http://www.zope.org ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Jim Fulton wrote: I would argue that it's evil to change the default encoding in the first place, except in this case to disable implicit encoding or decoding. absolutely. unfortunately, all attempts to add such information to the sys module documentation seem to have failed... (last time I tried, I seem to remember that someone argued that it's there, so it should be documented in a neutral fashion) /F ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Antoine Pitrou [EMAIL PROTECTED] wrote: Le lundi 03 octobre 2005 à 14:59 +0200, Fredrik Lundh a écrit : Antoine Pitrou wrote: A good rule of thumb is to convert to unicode everything that is semantically textual and isn't pure ASCII. How can you be sure that something that is /semantically textual/ will always remain pure ASCII ? That's contradictory, unless your software never goes out of the anglo-saxon world (and even...). Non-unicode text input widgets. Works great. Can be had with the ANSI wxPython installation. (it seems to me - I may be mistaken - that modern Windows versions treat every string as 16-bit unicode internally. Why are they doing it if it is that inefficient?) Because modern Windows supports all sorts of symbols which are necessary for certain special English uses (greek symbols for math, etc.), and trying to have all of them without just using the unicode backend that is used for all of the international builds (isn't it just a language definition?) anyways, would be a waste of time/effort. - Josiah ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Josiah Carlson wrote: and isn't pure ASCII. How can you be sure that something that is /semantically textual/ will always remain pure ASCII ? That's contradictory, unless your software never goes out of the anglo-saxon world (and even...). Non-unicode text input widgets. Works great. Can be had with the ANSI wxPython installation. You're both missing that Python is dynamically typed. A single string source doesn't have to return the same type of strings, as long as the objects it returns are compatible with Python's string model and with each other. Under the default encoding (and quite a few other encodings), that's true for plain ascii strings and Unicode strings. This is a good thing. /F ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com