On Sun, Jan 5, 2014 at 5:26 PM, Steven D'Aprano <st...@pearwood.info> wrote: > On Sun, Jan 05, 2014 at 11:02:34AM -0500, eryksun wrote: >> >> <?xml version="1.0" encoding="ISO-8859-1" ?> > > That surprises me. I thought XML was only valid in UTF-8? Or maybe that > was wishful thinking.
JSON text SHALL be encoded in Unicode: https://tools.ietf.org/html/rfc4627#section-3 For XML, UTF-8 is recommended by RFC 3023, but not required. Also, the MIME charset takes precedence. Section 8 has examples: https://tools.ietf.org/html/rfc3023#section-8 So I was technically wrong to rely on the XML encoding (they happen to be the same in this case). Instead you can create a parser with the encoding from the header: encoding = response.headers.getparam('charset') parser = ET.XMLParser(encoding=encoding) tree = ET.parse(response, parser) The expat parser (pyexpat) used by Python is limited to ASCII, Latin-1 and Unicode transport encodings. So it's probably better to transcode to UTF-8 as Alex is doing, but then use a custom parser to override the XML encoding: encoding = response.headers.getparam('charset') info = response.read().decode(encoding).encode('utf-8') parser = ET.XMLParser(encoding='utf-8') tree = ET.fromstring(info, parser) _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor