Albert-Jan Roskam <[email protected]> wrote: > # CODE: > for element in doc.getiterator(): > try: > m = re.match(search_text, str(element.text)) > except UnicodeEncodeError: > raise # I want to get rid of this exception.
First, you should separate both actions done in a single statement to isolate the source of error: for element in doc.getiterator(): try: source = str(element.text) except UnicodeEncodeError: raise # I want to get rid of this exception. else: m = re.match(search_text, source) I guess source = unicode(element;text, "utf8") should do the job if, actually, you know elements are utf8 encoded (else try latin1, or better get proper information on origin of you doc files). PS: I just discovered python's builtin attribute file.encoding that should give you the proper encoding to pass to unicode(..., encoding). PPS: You should in fact decode the whole source before parsing it, no? (meaning parsing a unicode object, not encoded text) Denis ________________________________ la vita e estrany http://spir.wikidot.com/ _______________________________________________ Tutor maillist - [email protected] To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
