On Wed, Jul 04, 2007 at 11:28:53AM -0400, Kent Johnson wrote: >FWIW, I'm pretty sure you are confusing Unicode strings and UTF-8 >strings, they are not the same thing. A Unicode string uses 16 bits to >represent each character. It is a distinct data type from a 'regular' >string. Regular Python strings are byte strings with an implicit >encoding. One possible encoding is UTF-8 which uses one or more bytes to >represent each character. > >Some good reading on Unicode and utf-8: >http://www.joelonsoftware.com/articles/Unicode.html >http://effbot.org/zone/unicode-objects.htm
The problem is that the Windows filesystem uses UTF-8 as the encoding for filenames, but os doesn't seem to have a UTF-8 mode, just an ascii mode and a Unicode mode. >If you pass a unicode string (not utf-8) to os.walk(), the resulting >lists will also be unicode. > >Again, it would be helpful to see the code that is getting the error. The code is quite complex for not-relevant-to-this-problem reasons. The gist is that I walk the FS, get filenames, some of which get written to an XML file. If I leave the output alone I get errors on reading the XML file. If I try to change the output so that it is all Unicode, I get errors because my UTF-8 data sometimes looks like ascii, and I don't see a UTF-8-to-Unicode converter in the docs. >>I suspect that my program will have to make sure to recast all >>equivalent-to-ascii strings as UTF-8 while leaving the ones that are >>already extended alone. > >It is nonsense to talk about 'recasting' an ascii string as UTF-8; an >ascii string is *already* UTF-8 because the representation of the >characters is identical. OTOH it makes sense to talk about converting an >ascii string to a unicode string. Then what does mystring.encode("UTF-8") do? -- yours, William _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor