Re: Python HTML parser chokes on UTF-8 input

2008-10-17 Thread John Nagle
Johannes Bauer wrote: Hello group, I'm trying to use a htmllib.HTMLParser derivate class to parse a website which I fetched via httplib.HTTPConnection().request().getresponse().read(). Now the problem is: As soon as I pass the htmllib.HTMLParser UTF-8 code, it chokes. The code is something like

Re: Python HTML parser chokes on UTF-8 input

2008-10-10 Thread Marc 'BlackJack' Rintsch
On Fri, 10 Oct 2008 00:13:36 +0200, Johannes Bauer wrote: > Terry Reedy schrieb: >> I believe you are confusing unicode with unicode encoded into bytes >> with the UTF-8 encoding. Having a problem feeding a unicode string, >> not 'UFT-8 code', which in Python can only mean a UTF-8 encoded byte >>

Re: Python HTML parser chokes on UTF-8 input

2008-10-09 Thread Terry Reedy
Johannes Bauer wrote: Terry Reedy schrieb: Johannes Bauer wrote: Hello group, I'm trying to use a htmllib.HTMLParser derivate class to parse a website which I fetched via httplib.HTTPConnection().request().getresponse().read(). Now the problem is: As soon as I pass the htmllib.HTMLParser UTF-8

Re: Python HTML parser chokes on UTF-8 input

2008-10-09 Thread Jerry Hill
On Thu, Oct 9, 2008 at 4:54 PM, Johannes Bauer <[EMAIL PROTECTED]> wrote: > Hello group, > > Now when I take "website" directly from the parser, everything is fine. > However I want to do some modifications before I parse it, namely UTF-8 > modifications in the style: > > website = website.replace(

Re: Python HTML parser chokes on UTF-8 input

2008-10-09 Thread Johannes Bauer
Terry Reedy schrieb: > Johannes Bauer wrote: >> Hello group, >> >> I'm trying to use a htmllib.HTMLParser derivate class to parse a website >> which I fetched via >> httplib.HTTPConnection().request().getresponse().read(). Now the problem >> is: As soon as I pass the htmllib.HTMLParser UTF-8 code,

Re: Python HTML parser chokes on UTF-8 input

2008-10-09 Thread Terry Reedy
Johannes Bauer wrote: Hello group, I'm trying to use a htmllib.HTMLParser derivate class to parse a website which I fetched via httplib.HTTPConnection().request().getresponse().read(). Now the problem is: As soon as I pass the htmllib.HTMLParser UTF-8 code, it chokes. The code is something like

Python HTML parser chokes on UTF-8 input

2008-10-09 Thread Johannes Bauer
Hello group, I'm trying to use a htmllib.HTMLParser derivate class to parse a website which I fetched via httplib.HTTPConnection().request().getresponse().read(). Now the problem is: As soon as I pass the htmllib.HTMLParser UTF-8 code, it chokes. The code is something like this: prs = self.parser