> ######################################################################## > for encoding in ('utf-8', 'utf-16', 'utf-32'): > for i in range(0x110000): > aChar = unichr(i) > try: > someBytes = aChar.encode(encoding) > if '\n' in someBytes: > print("%r contains a newline in its bytes encoded with %s" % > (aChar, encoding)) > except: > ## Normally, try/catches with an empty except is a bad idea. > ## Here, this is toy code, and we're just exploring. > pass > ########################################################################
Gaa... Sorry about the bad indenting. Let me try that again. #################################### for encoding in ('utf-8', 'utf-16', 'utf-32'): for i in range(0x110000): aChar = unichr(i) try: someBytes = aChar.encode(encoding) if '\n' in someBytes: print("%r contains a newline in its bytes encoded with %s" % (aChar, encoding)) except: ## Normally, try/catches with an empty except is a bad idea. ## Here, this is toy code, and we're just exploring. pass #################################### > Hopefully, this makes the point clearer: we must not try to decode > individual lines. By that time, the damage has been done: the act of > trying to break the file into lines by looking naively at newline byte > characters is invalid when certain characters can themselves have > newline characters. Confusing last sentence. Let me try that again. The act of trying to break the file into lines by looking naively at newline byte characters is invalid because certain characters, under encoding, themselves consist of newline characters. We've got to open the file with the right encoding in play. Joel Spolsky's article on "The Absolute minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" needs to be referenced. :P http://www.joelonsoftware.com/articles/Unicode.html _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor