On Sun, Nov 22, 2015 at 11:19:17PM -0500, bruce wrote: > Hi. > > Doing a 'simple' test with linux command line curl, as well as pycurl > to fetch a page from a server. > > The page has a charset of >>AL32UTF8.
I had never heard of that before, so I googled for it. No surprise, it comes from Oracle, and they have made a complete dog's breakfast out of it. According to the answers here: https://community.oracle.com/thread/3514820 (1) Oracle thinks that UTF-8 is a two-byte encoding (it isn't); (2) AL32UTF8 has "extra characters" that UTF-8 doesn't, but UTF-8 is a superset of AL32UTF8 (that's a contradiction!); (3) Oracle's UTF-8 is actually the abomination more properly known as CESU-8: http://www.unicode.org/reports/tr26/ (4) Oracle's AL32UTF8 might actually be the real UTF-8, not "Oracle UTF-8", which is rubbish. > Anyway to conert this to straight ascii. Python is throwing a > notice/error on the charset in another part of the test.. > > The target site is US based, so there's no weird chars in it.. I wouldn't be so sure about that. > I suspect that the page/system is based on legacy oracle > > The metadata of the page is > > <META HTTP-EQUIV="Content-Type" NAME="META" CONTENT="text/html; > charset=AL32UTF8"> > > I tried the usual > > foo = foo.decode('utf-8') And what happened? Did you get an error? Please copy and paste the complete traceback. The easy way to hit this problem with a hammer and "fix it" is to do this: foo = foo.decode('utf-8', errors='replace') but that will replace any non-ASCII chars or malformed UFT-8 bytes with question marks: py> s = u"abc π def".encode('utf-8') # non-ASCII string py> print s.decode('ascii', errors='replace') abc �� def which loses data. That should normally be considered a last resort. It might also help to open the downloaded file in a hex editor and see if it looks like binary or text. If you see lots of zeroes, e.g.: ...006100340042005600... then the encoding is probably not UTF-8. -- Steve _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor