Re: encoding in lxml

2008-11-03 Thread Stefan Behnel
jasiu85 wrote: > I have a problem with character encoding in LXML. Here's how it goes: > > I read an HTML document from a third-party site. It is supposed to be > in UTF-8, but unfortunately from time to time it's not. You can instantiate your own HTML parser and pass enco

Re: encoding in lxml

2008-11-03 Thread pjacobi . de
Hi Mike, > I read an HTML document from a third-party site. It is supposed to be > in UTF-8, but unfortunately from time to time it's not. There will be host of more lightweight solutions, but you can opt to sanizite incominhg HTML with HTML Tidy (python binding available). It will replace inval

encoding in lxml

2008-11-03 Thread jasiu85
Hey, I have a problem with character encoding in LXML. Here's how it goes: I read an HTML document from a third-party site. It is supposed to be in UTF-8, but unfortunately from time to time it's not. I parse the document like this: html_doc = HTML(string_with_document) Then I ret