Ray Jones wrote: > I have directory names that contain Russian characters, Romanian > characters, French characters, et al. When I search for a file using > glob.glob(), I end up with stuff like \x93\x8c\xd1 in place of the > directory names. I thought simply identifying them as Unicode would > clear that up. Nope. Now I have stuff like \u0456\u0439\u043e.
That's the representation which is guaranteed to be all-ascii. Python will automatically apply repr() to a unicode string when it is part of a list >>> files = [u"\u0456\u0439\u043e"] # files = glob.glob(unicode_pattern) >>> print files [u'\u0456\u0439\u043e'] To see the actual characters print the unicode strings individually: >>> for file in files: ... print file ... ійо > These representations of directory names are eventually going to be > passed to Dolphin (my file manager). Will they pass to Dolphin properly? How exactly do you "pass" these names? > Do I need to run a conversion? When you write them to a file you need to pick an encoding. > Can that happen automatically within the > script considering that the various types of characters are all mixed > together in the same directory (i.e. # coding: Latin-1 at the top of the > script is not going to address all the different types of characters). the coding cookie tells python how to interpret the bytes in the files, so # -*- coding: utf-8 -*- s = u"äöü" and # -*- coding: latin1 -*- s = u"äöü" contain a different byte sequence on disc, but once imported the two strings are equal (and have the same in-memory layout): >>> import codecs >>> for encoding in "latin-1", "utf-8": ... with codecs.open("tmp_%s.py" % encoding.replace("-", ""), "w", encoding=encoding) as f: f.write(u'# -*- coding: %s\ns = u"äöü"' % encoding)... >>> for encoding in "latin1", "utf8": ... open("tmp_%s.py" % encoding).read() ... '# -*- coding: latin-1\ns = u"\xe4\xf6\xfc"' '# -*- coding: utf-8\ns = u"\xc3\xa4\xc3\xb6\xc3\xbc"' >>> from tmp_latin1 import s >>> from tmp_utf8 import s as t >>> s == t True > While on the subject, I just read through the Unicode info for Python > 2.7.3. The history was interesting, but the implementation portion was > beyond me. I was looking for a way for a Russian 'backward R' to look > like a Russian 'backward R' - not for a bunch of \xxx and \uxxxxx stuff. >>> ya = u"\N{CYRILLIC CAPITAL LETTER YA}" >>> ya u'\u042f' >>> print ya Я This only works because Python correctly guesses the terminal encoding. If you are piping output to another file it will assume ascii and you will see an encoding error: $ cat tmp.py # -*- coding: utf-8 -*- print u"Я" $ python tmp.py Я $ python tmp.py | cat Traceback (most recent call last): File "tmp.py", line 2, in <module> print u"Я" UnicodeEncodeError: 'ascii' codec can't encode character u'\u042f' in position 0: ordinal not in range(128) You can work around that by specifying the appropriate encoding explicitly: $ python tmp2.py iso-8859-5 | cat � $ python tmp2.py latin1 | cat Traceback (most recent call last): File "tmp2.py", line 4, in <module> print u"Я".encode(encoding) UnicodeEncodeError: 'latin-1' codec can't encode character u'\u042f' in position 0: ordinal not in range(256) _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor