Fafounet wrote:
I am parsing a web page with special chars such as #xE9; (which
stands for é).
I know I can have the unicode character é from unicode
(\xe9,iso-8859-1)
but with those extra characters I don' t know.
I tried to implement handle_charref within HTMLParser without success.
Hello,
I am parsing a web page with special chars such as #xE9; (which
stands for é).
I know I can have the unicode character é from unicode
(\xe9,iso-8859-1)
but with those extra characters I don' t know.
I tried to implement handle_charref within HTMLParser without success.
Furthermore, if I
Fafounet fafou...@gmail.com (F) wrote:
F Hello,
F I am parsing a web page with special chars such as #xE9; (which
F stands for é).
F I know I can have the unicode character é from unicode
F (\xe9,iso-8859-1)
F but with those extra characters I don' t know.
F I tried to implement handle_charref
Thank you, now I can get the correct character.
Now when I have the string ab#xE9;cd I can get ab then é thanks to
your function and then cd. But how is it possible to know that cd is
still the same word ?
Fabien
The character references indicate Unicode ordinals, not iso-8859-1
characters.
Fafounet fafou...@gmail.com (F) wrote:
F Thank you, now I can get the correct character.
F Now when I have the string ab#xE9;cd I can get ab then é thanks to
F your function and then cd. But how is it possible to know that cd is
F still the same word ?
That depends on your definition of `word'.