[issue4153] Unicode HOWTO up to date?
Serhiy Storchaka added the comment: Most of changes are applicable to Python 2 too. Do you want backport part of your patch to 2.7? -- nosy: +serhiy.storchaka ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4153] Unicode HOWTO up to date?
Roundup Robot added the comment: New changeset 1dbbed06a163 by Andrew Kuchling in branch '3.3': #4153: update Unicode howto for Python 3.3 http://hg.python.org/cpython/rev/1dbbed06a163 -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4153] Unicode HOWTO up to date?
A.M. Kuchling added the comment: As far as I can tell, there are no other outstanding suggestions for howto updates, so I'll now close this item. Feel free to re-open or file a new item if there are further improvements that can be made. -- resolution: - fixed stage: commit review - committed/rejected status: open - closed ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4153] Unicode HOWTO up to date?
A.M. Kuchling added the comment: Continuing my tour of the howtos, here's a patch making many of the changes discussed here and on issue13997. Changes made: * state that python3 source encoding is UTF-8, and give examples * mention surrogateescape in the 'tips and tricks' section, and backslashreplace in the Python's Unicode Support section. * default filesystem encoding is now UTF-8, not ascii. * link to Nick Coghlan's and Ned Batchelder's notes/presentations. * remove revision history * remove usage of I think, I'm not going to, etc. * update acks section Things I did *not* do, though they were suggested: * Move tip Software should only work with Unicode strings internally from the last section to somewhere earlier and more prominent. Perhaps it could go somewhere in the Python's Unicode Support section. * mention codecs.StreamRecoder and StreamReaderWriter (I could put this in 'tips and tricks'). * Examples should be properly marked up to allow sphinx to run them and check the output. (May not be possible.) * mention unicode support in re module * clarify some more terms (e.g. codepoints, code units, characters, possibly scalar values etc.) -- I don't see why they matter, since we don't use them. -- Added file: http://bugs.python.org/file30508/unicode-howto.txt ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4153] Unicode HOWTO up to date?
A.M. Kuchling added the comment: Updated version of my patch, which adds two more todo items and handles Ezio's review comments: * Switch from Greek examples to French, and remove non-Latin-1 characters. * Change language for bytes.decode to but supports a few more possible handlers. * Describe Unicode support in the re module. * Describe StreamRecoder. I don't see why StreamReaderWriter would need to be mentioned. I do not intend to do the remaining items on the todo list (clarify some more terms; make it work with doctest). -- Added file: http://bugs.python.org/file30509/unicode-howto.txt ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4153] Unicode HOWTO up to date?
Nick Coghlan added the comment: amk's latest patch looks like a very nice improvement to me. One suggested wording tweak for the aside about the simplified history: s/The average Python programmer doesn't need to know the historical details/The precise historical details aren't relevant to understanding how to use Unicode effectively/ (and then continue with ; if you're curious ... as it does now) -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4153] Unicode HOWTO up to date?
Ezio Melotti added the comment: As discussed in #13997, the HOWTO should be reorganized to start with a basic introduction and then expand on more advanced topic. See also msg180743 for a couple of essays that could be linked as see also or integrated in the HOWTO. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4153] Unicode HOWTO up to date?
Changes by Chris Rebert pyb...@rebertia.com: -- nosy: +cvrebert ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4153] Unicode HOWTO up to date?
Nick Coghlan added the comment: The section in the HOWTO on Python's unicode support also misses the fact that the easiest way to include a Unicode character in a string literal in Python 3 is to *include that character in the string literal* (since source code is now treated as UTF-8 by default). -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4153] Unicode HOWTO up to date?
Roundup Robot added the comment: New changeset 260a9afd999a by Ezio Melotti in branch '3.2': #4153: update the Unicode howto. http://hg.python.org/cpython/rev/260a9afd999a New changeset 572ca3d35c2f by Ezio Melotti in branch '3.3': #4153: merge with 3.2. http://hg.python.org/cpython/rev/572ca3d35c2f New changeset 034e1e076c77 by Ezio Melotti in branch 'default': #4153: merge with 3.3. http://hg.python.org/cpython/rev/034e1e076c77 -- nosy: +python-dev ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4153] Unicode HOWTO up to date?
Ezio Melotti added the comment: I committed the attached patch with some minor modifications, but there are still comments that should be addressed on Rietveld. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4153] Unicode HOWTO up to date?
Changes by Ezio Melotti ezio.melo...@gmail.com: -- assignee: - ezio.melotti ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4153] Unicode HOWTO up to date?
Changes by Ezio Melotti ezio.melo...@gmail.com: -- nosy: +ncoghlan ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4153] Unicode HOWTO up to date?
Éric Araujo mer...@netwok.org added the comment: something about Unicode supports in the re module (this probably can wait after the 'regex' inclusion). I’d prefer documentation for the re module now. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4153] Unicode HOWTO up to date?
Éric Araujo mer...@netwok.org added the comment: it also removes the usage of 'byte string'. I see you’ve replaced it with “byte object”. I’m -0, as “byte[s] string” is not ambiguous IMO. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4153] Unicode HOWTO up to date?
Ezio Melotti ezio.melo...@gmail.com added the comment: There was some discussion a while ago on python-dev about it. AFAIR the outcome was that using bytes *strings* should be avoided because bytes are bytes, and not strings (until they get decoded at least). Using 'string' for both might lead people to think that there are two kinds of strings, bytes and Unicode (like in python 2) while they should think that there are only Unicode strings and they can be converted to a bytes object (or simply to 'bytes'). -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4153] Unicode HOWTO up to date?
Éric Araujo mer...@netwok.org added the comment: Ah, I see: you’re equating “string” with “text string” or “character string”, whereas I read “bytes string” as “finite sequence of bytes”. With this definition, there *are* two string types in Python 3, it’s just that they’re much more divorced than in 2.x. they should think that there are only Unicode strings I’d say they should think that text processing should only happen with the one type dedicated to text, i.e. str. they can be converted to a bytes object (or simply to 'bytes') Okay, +0 to use only “bytes object” (or “bytes” when it sounds better). -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4153] Unicode HOWTO up to date?
Ezio Melotti ezio.melo...@gmail.com added the comment: After the recent discussions on python-dev I went through the Unicode howto and fixed a few things, then I found this issue so I'm attaching the patch here. The patch addresses mostly markup issues, but it also removes the usage of 'byte string'. A few more things that should be done: * clarify some more terms (e.g. codepoints, code units, characters, possibly scalar values etc.); * mention the differences between narrow and wide builds, including: - a discussion about the UCS-2/UTF-16 implementation of narrow builds; - something about surrogates and surrogate pairs; - effects of slicing and indexing on narrow builds; - functions/methods that (don't) accept non-BMP chars on narrow builds; * something about Unicode supports in the re module (this probably can wait after the 'regex' inclusion). Also the codecs doc has a section about Unicode and encodings that might be moved to the howto. -- assignee: georg.brandl - resolution: fixed - stage: - commit review versions: +Python 3.3 Added file: http://bugs.python.org/file23081/issue4153-2.diff ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4153] Unicode HOWTO up to date?
Ezio Melotti ezio.melo...@gmail.com added the comment: I also left a few comments on rietveld about other things that can be improved. Please reply and comment there. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4153] Unicode HOWTO up to date?
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: Committed in revision 86530. Thanks Terry and Raymond for your comments. I would like to keep this issue open (at a low priority) because the question in the titles is still relevant. There are many new 3.x features that are not covered such as surrogateescape error handler. Such topics may or may not be appropriate for a HOWTO. there are also some stylistic changes that I would like to consider: 1. Replace verbatim URLs with properly formatted hyperlinked titles of the referenced resources. 2. I couldn't figure out who the original author was. With first person passages, such as I remember looking at Apple ][ BASIC programs, .. it may be appropriate to list the original author at the top even if the text has been changed by others over the years. At the very least the Acknowlegements section should start with This article was originally written by X [on an occasion Y.] 3. Examples should be properly marked up to allow sphinx to run them and check the output. -- priority: normal - low ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4153] Unicode HOWTO up to date?
Éric Araujo mer...@netwok.org added the comment: Agreed on 1 and 3. Regarding 2, looking at the early history of the file makes me suspect that amk is the author. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4153] Unicode HOWTO up to date?
Changes by Alexander Belopolsky belopol...@users.sourceforge.net: Added file: http://bugs.python.org/file19632/issue4153.diff ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4153] Unicode HOWTO up to date?
Changes by Alexander Belopolsky belopol...@users.sourceforge.net: Removed file: http://bugs.python.org/file19631/issue4153.diff ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4153] Unicode HOWTO up to date?
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: r82301 appears to be a blind merge of r82120 from the trunk. It is fairly obvious that it was not intentional. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4153] Unicode HOWTO up to date?
Changes by Alexander Belopolsky belopol...@users.sourceforge.net: -- nosy: +akuchling ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4153] Unicode HOWTO up to date?
Changes by Éric Araujo mer...@netwok.org: -- nosy: +eric.araujo ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4153] Unicode HOWTO up to date?
Terry J. Reedy tjre...@udel.edu added the comment: Thanks for persisting with this. Looking at the patch: @@ -65,7 +63,7 @@ goal was to have Unicode contain the alphabets for every single human language. It turns out that even 16 bits isn't enough to meet that goal, and the modern Unicode specification uses a wider range of codes, 0-1,114,111 (0x10 in -base-16). +base 16). I visually parse 0-1,114,111 as 0-1, 114, 111. So I think either the commas should be removed or extra spaces are needed: 0-1114111 or 0 - 1,114,111. In your recent (and excellent) chr/ord doc patch, you used (or stayed with) 'hexadecimal' versus 'base 16'. Do we have a standard? I *think* I prefer the former. -character with value 0x12ca (4810 decimal). The Unicode standard contains a lot +character with value 0x12ca (4,810 decimal). The Unicode standard contains a lot I prefer without the added comma. b'\x80abc'.decode(utf-8, replace) -'\ufffdabc' +'�abc' Three replacements (i with diaeresis, upside-down ?, 1/2) for one bad char looks wrong. With IDLE I get '�abc' (? in hexagon, codepoint 65533). Perhaps something just went wrong to patch from your file to my browser window. @@ -281,10 +279,10 @@ built-in :func:`ord` function that takes a one-character Unicode string and returns the code point value:: You fixed chr/ord doc, need to fix references thereto in this doc. -point. The ``\U`` escape sequence is similar, but expects 8 hex digits, not 4:: +point. The ``\U`` escape sequence is similar, but expects eight base 16 +digits, not four:: I really think of them as hex or hexadecimal digits, just as 0-9 are decimal, not base 10 digits. s = a\xac\u1234\u20ac\U8000 two-digit hex escape -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4153] Unicode HOWTO up to date?
Changes by Ezio Melotti ezio.melo...@gmail.com: -- nosy: +ezio.melotti ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4153] Unicode HOWTO up to date?
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: On Thu, Nov 18, 2010 at 2:41 PM, Terry J. Reedy rep...@bugs.python.org wrote: .. I visually parse 0-1,114,111 as 0-1, 114, 111. So I think either the commas should be removed or extra spaces are needed: 0-1114111 or 0 - 1,114,111. What about 0 through 1,114,111? you used (or stayed with) 'hexadecimal' versus 'base 16'. Do we have a standard? I *think* I prefer the former. I prefer 'base 16'. I thought about changing 'hexadecimal' to 'base 16' in chr/ord docs, but decided to leave it because the term 'hexadecimal' is used elsewhere on the same page notably in hex() function description where it is quite appropriate. No, we don't have a standard. I've also seen base-16 used elsewhere which I really don't like. + '�abc' Three replacements (i with diaeresis, upside-down ?, 1/2) for one bad char looks wrong. That must be UTF-8 misinterpreted as Latin-1. Won't affect the commit. With IDLE I get '�abc' (? in hexagon, codepoint 65533). Perhaps something just went wrong to patch from your file to my browser window. Yes. I get the same on the terminal window and that's what it should look like. built-in :func:`ord` function that takes a one-character Unicode string and returns the code point value:: You fixed chr/ord doc, need to fix references thereto in this doc. I don't understand. I think one-character Unicode string is fine here because Unicode string means an abstract Unicode string, not :class:`str`. -point. The ``\U`` escape sequence is similar, but expects 8 hex digits, not 4:: +point. The ``\U`` escape sequence is similar, but expects eight base 16 +digits, not four:: I really think of them as hex or hexadecimal digits, just as 0-9 are decimal, not base 10 digits. I am fine with hexadecimal here. I did not like hex. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4153] Unicode HOWTO up to date?
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: On Thu, Nov 18, 2010 at 3:00 PM, Alexander Belopolsky rep...@bugs.python.org wrote: .. I really think of them as hex or hexadecimal digits, just as 0-9 are decimal, not base 10 digits. I am fine with hexadecimal here. I did not like hex. If you think about it, hexadecimal digit is a twice oxymoron because both decimal and digit imply base 10. :-) It does look like the most widely used term, nevertheless. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4153] Unicode HOWTO up to date?
Terry J. Reedy tjre...@udel.edu added the comment: 0 through ... is fine with me. Yes, hex numeral would be more accurate than hex digit. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4153] Unicode HOWTO up to date?
Raymond Hettinger rhettin...@users.sourceforge.net added the comment: Yes, hex numeral would be more accurate than hex digit. Stick with hex digit. We've used that phraseology for a long time. See string.hexdigits for example. And hex numeral just sounds weird -- it makes me do a double-take to see if there was some special implied meaning. -- nosy: +rhettinger ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4153] Unicode HOWTO up to date?
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: Reopening because it looks like the fix was reverted in r82301. This HOWTO discusses Python 2.x’s support for Unicode, and explains various problems that people commonly encounter when trying to work with Unicode. (This HOWTO has not yet been updated to cover the 3.x versions of Python.) http://docs.python.org/dev/howto/unicode.html -- nosy: +belopolsky status: closed - open versions: +Python 3.2 -Python 3.0 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4153] Unicode HOWTO up to date?
Georg Brandl [EMAIL PROTECTED] added the comment: Thanks for noting this! The most basic changes had been done, but I had to revise some sections for changes. Done in r67338. -- resolution: - fixed status: open - closed ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue4153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com