This is not a request for help, but a demonstration of what can go wrong with text processing in Python 2.
Following up on the "Special characters" thread, one of the design flaws of Python 2 is that byte strings and text strings offer BOTH decode and encode methods, even though only one is meaningful in each case.[1] - text strings are ENCODED to bytes; - byte are DECODED to text strings. One of the symptoms of getting it wrong is when you take a Unicode text string and encode/decode it but get an error from the *opposite* operation: py> u'ä'.decode('latin1') Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128) Look at what happens: I try to DECODE a string, but get an ENCODE error. And even though I specified Latin 1 as the codec, Python uses ASCII. What is going on here? Behind the scenes, the interpreter takes my text u'ä' (a Unicode string) and attempts to *encode* it to bytes first, using the default ASCII codec. That fails. Had it succeeded, it would have then attempted to *decode* those bytes using Latin 1. Similarly: py> b = u'ä'.encode('latin1') py> print repr(b) '\xe4' py> b.encode('latin1') Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128) The error here is that I tried to encode a bunch of bytes, instead of decoding them. But the insidious thing about this error is if you are working with pure ASCII, it seems to work: py> 'ascii'.encode('utf-16') '\xff\xfea\x00s\x00c\x00i\x00i\x00' That is, it *seems* to work because there's no error, but the result is pretty much meaningless: I *intended* to get a UTF-16 Unicode string, but instead I ended up with bytes just like I started with. Python 3 fixes this bug magnet by removing the decode method from Unicode text strings, and the encode method from byte-strings. [1] Technically this is not so, as there are codecs which can be used to convert bytes to bytes, or text to text. But the vast majority of common cases, codecs are used to convert bytes to text and vice versa. For the rare exception, we can use the "codecs" module. -- Steve _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor