Ray Jones wrote: >> You can work around that by specifying the appropriate encoding >> explicitly: >> >> $ python tmp2.py iso-8859-5 | cat >> � >> $ python tmp2.py latin1 | cat >> Traceback (most recent call last): >>File "tmp2.py", line 4, in <module> >>print u"Я".encode(encoding) >> UnicodeEncodeError: 'latin-1' codec can't encode character u'\u042f' in >> position 0: ordinal not in range(256) >> > But doesn't that entail knowing in advance which encoding you will be > working with? How would you automate the process while reading existing > files?
If you don't *know* the encoding you *have* to guess. For instance you could default to UTF-8 and fall back to Latin-1 if you get an error. While decoding non-UTF-8 data with an UTF-8 decoder is likely to fail Latin-1 will always "succeed" as there is one codepoint associated with every possible byte. The result howerver may not make sense. Think for line in codecs.open("lol_cat.jpg", encoding="latin1"): print line.rstrip() _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor