Albert-Jan Roskam <[email protected]> wrote:

> # CODE:
> for element in doc.getiterator():
>   try:
>     m = re.match(search_text, str(element.text))
>   except UnicodeEncodeError:
>     raise # I want to get rid of this exception.


First, you should separate both actions done in a single statement to isolate 
the source of error:
for element in doc.getiterator():
  try:
    source = str(element.text)
  except UnicodeEncodeError:
    raise # I want to get rid of this exception.
  else:
    m = re.match(search_text, source)

I guess
   source = unicode(element;text, "utf8")
should do the job if, actually, you know elements are utf8 encoded (else try 
latin1, or better get proper information on origin of you doc files).

PS: I just discovered python's builtin attribute file.encoding that should give 
you the proper encoding to pass to unicode(..., encoding).
PPS: You should in fact decode the whole source before parsing it, no? (meaning 
parsing a unicode object, not encoded text)

Denis
________________________________

la vita e estrany

http://spir.wikidot.com/


_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to