On Sun, Feb 05, 2017 at 04:31:43PM -0600, boB Stepp wrote: > On Sat, Feb 4, 2017 at 10:50 PM, Random832 <random...@fastmail.com> wrote: > > On Sat, Feb 4, 2017, at 22:52, boB Stepp wrote: > >> Does the list sort() method (and other sort methods in Python) just go > >> by the hex value assigned to each symbol to determine sort order in > >> whichever Unicode encoding chart is being implemented? > > > > By default. You need key=locale.strxfrm to make it do anything more > > sophisticated. > > > > I'm not sure what you mean by "whichever unicode encoding chart". Python > > 3 strings are unicode-unicode, not UTF-8. > > As I said in my response to Steve just now: I was looking at > http://unicode.org/charts/ Because they called them charts, so did I.
Ah, that makes sense! They're just reference tables ("charts") for the convenience of people wishing to find particular characters. > I'm assuming that despite this organization into charts, each and > every character in each chart has its own unique hexadecimal code to > designate each character. Correct, although strictly speaking the codes are only conventionally given in hexadecimal. They are numbered from 0 to 1114111 in decimal (although not all codes are currently used). The terminology used is that a "code point" is what I've been calling a "character", although not all code points are characters. Code points are usually written either as the character itself, e.g. 'A', or using the notation U+0041 where there are at least four and no more than six hexadecimal digits following the "U+". Bringing this back to Python, if you know the code point (as a number), you can use the chr() function to return it as a string: py> chr(960) 'π' Don't forget that Python understands hex too! py> chr(0x03C0) # better than chr(int('03C0', 16)) 'π' Alternatively, you can embed it right in the string. For code points between U+0000 and U+FFFF, use the \u escape, and for the rest, use \U escapes: py> 'pi = \u03C0' # requires exactly four hex digits 'pi = π' py> 'pi = \U000003C0' # requires exactly eight hex digits 'pi = π' Lastly, you can use the code point's name: py> 'pi = \N{GREEK SMALL LETTER PI}' 'pi = π' One last comment: Random832 said: "Python 3 strings are unicode-unicode, not UTF-8." To be pedantic, Unicode strings are sequences of abstract code points ("characters"). UTF-8 is a particular concrete implementation that is used to store or transmit such code strings. Here are examples of three possible encoding forms for the string 'πz': UTF-16: either two, or four, bytes per character: 03C0 007A UTF-32: exactly four bytes per character: 000003C0 0000007A UTF-8: between one and four bytes per character: CF80 7A (UTF-16 and UTF-32 are hardware-dependent, and the byte order could be reversed, e.g. C003 7A00. UTF-8 is not.) Prior to version 3.3, there was a built-time option to select either "narrow" or "wide" Unicode strings. A narrow build used a fixed two bytes per code point, together with an incomplete and not quite correct scheme for using two code points together to represent the supplementary Unicode characters U+10000 through U+10FFFF. (This is sometimes called UCS-2, sometimes UTF-16, but strictly speaking it is neither, or at least an incomplete and "buggy" implementation of UTF-16.) -- Steve _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor