This is probably more of a comp.lang.python question, since the XML content seems low, but anyway ...
On Mon, 2005-03-21 at 23:22 +0000, James King wrote: > Hi > > I am trying to convert CGI data, which arrives encoded with escape > characters, into unicode data. > > t represents the type of character data that I start with (the result > of fetching cgi-field data from a cgi.FieldStorage object). > > >>> t = '\x93quotation marks\x94, and a series of other characters: > \x91\xe5 \xdf \xa9 \xe6 \xee \x9c\x92' > >>> import codecs > >>> t.encode('utf-8') > > Traceback (most recent call last): > File "<stdin>", line 1, in ? > UnicodeError: ASCII decoding error: ordinal not in range(128) Not surprising. Given no other information, Python will assume t has ASCII encoding, which is obviously not true (as indicated by the error message). The problem is that you need to know the encoding of the input string. The initial 0x93 byte is a problem: it is not ASCII, it is not ISO-8859- *, it is not UTF-8 or UTF-16. Looks like it might be the Windows- specific codepage 1252 encoding, since you are hinting that there are initial quotation marks. >>> s = unicode(t, encoding) where 'encoding' is a string like 'iso-8859-1' or 'utf-8' or whatever the input encoding is (I'm not sure how the cp-1252 encoding is represented). After that, statements like s.encode('utf-8') will make sense. Cheers, Malcolm _______________________________________________ XML-SIG maillist - XML-SIG@python.org http://mail.python.org/mailman/listinfo/xml-sig