Fabian López wrote: > I am parsing an XML file that includes chineses characters, like ^ > 評評啖啖才是眞.細氺長锍才是愛 or ヘアアイロン... The problem is that I get an error like: > UnicodeEncodeerror:'charmap' codec can't encode characters in position.... > The thing is that I would like to ignore it and parse all the characters > less these ones. So, could anyone help me? I suppose that I can catch an > exception that ignores it or maybe use any function that detects this > chinese characters and after that ignore them.
If the parser can't handle the characters here, it's because the document is broken and does not declare the correct encoding. From your last post, I assume you're using lxml to do this (it's always helpful to state what software you use when you describe a problem with it). Since 2.0alpha3(?), you can override the encoding of the parsed file with the "encoding" keyword that you can pass to the XMLParser class. So, for example, you can try parsing the document as usual and if that fails, try parsing it with a different parser that is configured for a specific encoding override. Or you can determine the encoding based on some external source (like what the HTTP protocol tells you), and then use an override parser right away, or use that information as the first fallback. Stefan _______________________________________________ XML-SIG maillist - XML-SIG@python.org http://mail.python.org/mailman/listinfo/xml-sig