Michael Lange wrote:
I *thought* I would have to convert the user input which might be any encoding back into
byte string first

How are you getting the user input? Is it from the console or from a GUI?

I think the best strategy is to try to keep all your strings as Unicode. Unicode is the only encoding that can represent characters from any locale. (That's the point of Unicode, actually.) So I would convert the user input to unicode, not to a byte string.

(remember, I got heavily confused, because user input was sometimes unicode and
sometimes byte string), so I can convert it to "standard" unicode (utf-8) later 
on.

Careful! I wouldn't call utf-8 "standard unicode". UTF-8 is a standard *encoding* of Unicode. Unicode is a 16-bit code.


I've added this test to the file selection method, where "result" holds the 
filename the user chose:

    if isinstance(result, unicode):
        result = result.encode('iso8859-1')
    return result

This will fail if result includes characters that are not in the iso8859-1 repertoire.


later on self.nextfile is set to "result" .

The idea was, if I could catch the user's encoding, I could do something like:

    if isinstance(result, unicode):
        result = result.encode(sys.stdin.encoding)
    result = unicode(result, 'utf-8')

This is broken code that will corrupt your result string. Here is what it does:
if result is a unicode string, convert it to a byte string in the standard encoding. Then, assume that the byte string is in utf-8 encoding and convert it back to Unicode. Do you see why that is unlikely to have a good result?


If your intent is to create a unicode string, try this:
    if not isinstance(result, unicode):
        result = result.decode(sys.stdin.encoding)


to avoid problems with unicode objects that have different encodings - or isn't this necessary at all ?

I'm sorry if this is a dumb question, but I'm afraid I'm a complete encoding-idiot.

This article gives a lot of good background: http://www.joelonsoftware.com/articles/Unicode.html

I have written an essay about console encoding issues. At the end there is a collection of links to more general Python and Unicode articles.
http://www.pycs.net/users/0000323/stories/14.html


Kent


Thanks and best regards

Michael




_______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor



_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Reply via email to