Jorge De Castro wrote: > Hi all, > > It seems I can't get rid of my continuous issues i18n with Python :( > You're not alone :-) > I've been through: > http://docs.python.org/lib/module-email.Header.html > and > http://www.reportlab.com/i18n/python_unicode_tutorial.html > to no avail. > Try these: http://www.joelonsoftware.com/articles/Unicode.html http://jorendorff.com/articles/unicode/index.html > Basically, I'm receiving and processing mail that comes with content (from > an utf-8 accepting form) from many locales (France, Germany, etc) > > def splitMessage() does what the name indicates, and send message is the > code below. > > def sendMessage(text): > to, From, subject, body = splitMessage(text) > msg = MIMEText(decodeChars(body), 'plain', 'UTF-8') > msg['From'] = From > msg['To'] = to > msg['Subject'] = Header(decodeChars(subject), 'UTF-8') > > def decodeChars(str=""): > if not str: return None > for characterCode in _characterCodes.keys(): > str = str.replace(characterCode, _characterCodes[characterCode]) > return str > > Now as you have noticed, this only works, ie, I get an email sent with the > i18n characters displayed correctly, after I pretty much wrote my own > 'urldecode' map > > _characterCodes ={ "%80" : "�", "%82" : "�", "%83" : > "�", "%84" : "�", \ > "%85" : "�", "%86" : "�", "%87" : > "�", "%88" : "�", \ > "%89" : "�", "%8A" : "�", "%8B" : > "�", "%8C" : "�", \ > "%8E" : "�", "%91" : "�", "%92" : > "�", "%93" : "�", \ > "%94" : "�", "%95" : "�", "%96" : > "�", "%97" : "�", \ > ... > > Which feels like an horrible kludge. > This _characterCodes map replaces chars is the range 80-9F with a Unicode "undefined" marker, so I don't understand how using it gives you a correct result. > Note that using urlilib.unquote doesn't do it -I get an error saying that it > is unable to . Replacing my decodeChars > > msg = MIMEText(urllib.unquote(body), 'plain', 'UTF-8') > > Returns content with i18n characters mangled. > From the selection of characters you have chosen to replace, my guess is that your source data is urlencoded Cp1252, not urlencoded UTF-8. So when you unquote it and then call it UTF-8, which is what the above code does, you get incorrect display. What happens if you change UTF-8 to Cp1252 in the call to MIMEText? > Using unicode(body, 'latin-1').encode('utf-8') doesn't work either. Besides, > am I the only one to feel that if I want to encode something in UTF-8 it > doesn't feel intuitive to have to convert to latin-1 first and then encode? > It doesn't work because the urlencoded text is ascii, not latin-1. I suspect that unicode(urllib.unquote(body), 'Cp12521).decode('UTF-8') would give you what you want. > Any ideas? I am dry on other option and really don't want to keep my kludge > (unless I absolutely have to) > Post some of your actual data, it will be obvious whether it is encoded from Cp1252 or UTF-8.
Keep trying, it's worth it to actually understand what is going on. Trying to solve encoding problems when you don't understand the basic issues is unlikely to give a good solution. Kent _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor