Re: Special chars with HTMLParser

2009-08-07 Thread Stefan Behnel
Fafounet wrote: I am parsing a web page with special chars such as #xE9; (which stands for é). I know I can have the unicode character é from unicode (\xe9,iso-8859-1) but with those extra characters I don' t know. I tried to implement handle_charref within HTMLParser without success.

Special chars with HTMLParser

2009-08-05 Thread Fafounet
Hello, I am parsing a web page with special chars such as #xE9; (which stands for é). I know I can have the unicode character é from unicode (\xe9,iso-8859-1) but with those extra characters I don' t know. I tried to implement handle_charref within HTMLParser without success. Furthermore, if I

Re: Special chars with HTMLParser

2009-08-05 Thread Piet van Oostrum
Fafounet fafou...@gmail.com (F) wrote: F Hello, F I am parsing a web page with special chars such as #xE9; (which F stands for é). F I know I can have the unicode character é from unicode F (\xe9,iso-8859-1) F but with those extra characters I don' t know. F I tried to implement handle_charref

Re: Special chars with HTMLParser

2009-08-05 Thread Fafounet
Thank you, now I can get the correct character. Now when I have the string ab#xE9;cd I can get ab then é thanks to your function and then cd. But how is it possible to know that cd is still the same word ? Fabien The character references indicate Unicode ordinals, not iso-8859-1 characters.

Re: Special chars with HTMLParser

2009-08-05 Thread Piet van Oostrum
Fafounet fafou...@gmail.com (F) wrote: F Thank you, now I can get the correct character. F Now when I have the string ab#xE9;cd I can get ab then é thanks to F your function and then cd. But how is it possible to know that cd is F still the same word ? That depends on your definition of `word'.