Re: Python HTML parser chokes on UTF-8 input
Johannes Bauer wrote: Hello group, I'm trying to use a htmllib.HTMLParser derivate class to parse a website which I fetched via httplib.HTTPConnection().request().getresponse().read(). Now the problem is: As soon as I pass the htmllib.HTMLParser UTF-8 code, it chokes. The code is something like this: Try BeautifulSoup. It actually understands how to detect the encoding of an HTML file (there are three different ways that information can be expressed), and will shift modes accordingly. This is an ugly problem. Sometimes, it's necessary to parse part of the file, discover that the rest of the file has a non-ASCII encoding, and restart the parse from the beginning. BeautifulSoup has the machinery for that. John Nagle -- http://mail.python.org/mailman/listinfo/python-list
Re: Python HTML parser chokes on UTF-8 input
On Fri, 10 Oct 2008 00:13:36 +0200, Johannes Bauer wrote: Terry Reedy schrieb: I believe you are confusing unicode with unicode encoded into bytes with the UTF-8 encoding. Having a problem feeding a unicode string, not 'UFT-8 code', which in Python can only mean a UTF-8 encoded byte string. I also believe I am. Could you please elaborate further? Do I understand correctly when saying that type 'str' has no associated default encoding, but type 'unicode' does? `str` doesn't know an encoding. The content could be any byte data anyway. And `unicode` doesn't know an encoding either, it is unicode characters. How they are represented internally is not the business of the programmer. If you want operate with unicode characters you have to decode a byte string (`str`) with the appropriate encoding. If you want feed `unicode` to something that expects bytes and not unicode characters you have to encode again. This is incredibly ugly IMHO, as I would really like the parser to just accept UTF-8 input. It accepts UTF-8 input but not `unicode` objects. However I am sure you will agree that explicit encoding conversions are cumbersome and error-prone. But implicit conversions are impossible because the interpreter doesn't know which encoding to use and refuses to guess. Implicit and guessed conversions are error prone too. Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list
Python HTML parser chokes on UTF-8 input
Hello group, I'm trying to use a htmllib.HTMLParser derivate class to parse a website which I fetched via httplib.HTTPConnection().request().getresponse().read(). Now the problem is: As soon as I pass the htmllib.HTMLParser UTF-8 code, it chokes. The code is something like this: prs = self.parserclass(formatter.NullFormatter()) prs.init() prs.feed(website) self.__result = prs.get() prs.close() Now when I take website directly from the parser, everything is fine. However I want to do some modifications before I parse it, namely UTF-8 modifications in the style: website = website.replace(uföö, ubär) Therefore, after fetching the web site content, I have to convert it to UTF-8 first, modify it and convert it back: website = website.decode(latin1) website = website.replace(uföö, ubär) website = website.encode(latin1) This is incredibly ugly IMHO, as I would really like the parser to just accept UTF-8 input. However when I omit the reecoding to latin1: File CachedWebParser.py, line 13, in __init__ self.__process(website) File CachedWebParser.py, line 55, in __process prs.feed(website) File /usr/lib64/python2.5/sgmllib.py, line 99, in feed self.goahead(0) File /usr/lib64/python2.5/sgmllib.py, line 133, in goahead k = self.parse_starttag(i) File /usr/lib64/python2.5/sgmllib.py, line 285, in parse_starttag self._convert_ref, attrvalue) UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 0: ordinal not in range(128) Annoying, IMHO, that the internal html Parser cannot cope with UTF-8 input - which should (again, IMHO) be the absolute standard for such a new language. Can I do something about it? Regards, Johannes -- Meine Gegenklage gegen dich lautet dann auf bewusste Verlogenheit, verlästerung von Gott, Bibel und mir und bewusster Blasphemie. -- Prophet und Visionär Hans Joss aka HJP in de.sci.physik [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list
Re: Python HTML parser chokes on UTF-8 input
Johannes Bauer wrote: Hello group, I'm trying to use a htmllib.HTMLParser derivate class to parse a website which I fetched via httplib.HTTPConnection().request().getresponse().read(). Now the problem is: As soon as I pass the htmllib.HTMLParser UTF-8 code, it chokes. The code is something like this: I believe you are confusing unicode with unicode encoded into bytes with the UTF-8 encoding. Having a problem feeding a unicode string, not 'UFT-8 code', which in Python can only mean a UTF-8 encoded byte string. prs = self.parserclass(formatter.NullFormatter()) prs.init() prs.feed(website) self.__result = prs.get() prs.close() Now when I take website directly from the parser, everything is fine. However I want to do some modifications before I parse it, namely UTF-8 modifications in the style: website = website.replace(uföö, ubär) Therefore, after fetching the web site content, I have to convert it to UTF-8 first, modify it and convert it back: website = website.decode(latin1) # produces unicode website = website.replace(uföö, ubär) #remains unicode website = website.encode(latin1) # produces byte string in the latin-1 encoding This is incredibly ugly IMHO, as I would really like the parser to just accept UTF-8 input. To me, code that works is prettier than code that does not. In 3.0, text strings are unicode, and I believe that is what the parser now accepts. However when I omit the reecoding to latin1: File CachedWebParser.py, line 13, in __init__ self.__process(website) File CachedWebParser.py, line 55, in __process prs.feed(website) File /usr/lib64/python2.5/sgmllib.py, line 99, in feed self.goahead(0) File /usr/lib64/python2.5/sgmllib.py, line 133, in goahead k = self.parse_starttag(i) File /usr/lib64/python2.5/sgmllib.py, line 285, in parse_starttag self._convert_ref, attrvalue) UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 0: ordinal not in range(128) When you do not bother to specify some other encoding in an encoding operation, sgmllib or something deeper in Python tries the default encoding, which does not work. Stop being annoyed and tell the interpreter what you want. It is not a mind-reader. Annoying, IMHO, that the internal html Parser cannot cope with UTF-8 input - which should (again, IMHO) be the absolute standard for such a new language. The first version of Python came out in 1989, I believe, years before unicode. One of the features of the new 3.0 version is that is uses unicode as the standard for text. Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: Python HTML parser chokes on UTF-8 input
Terry Reedy schrieb: Johannes Bauer wrote: Hello group, I'm trying to use a htmllib.HTMLParser derivate class to parse a website which I fetched via httplib.HTTPConnection().request().getresponse().read(). Now the problem is: As soon as I pass the htmllib.HTMLParser UTF-8 code, it chokes. The code is something like this: I believe you are confusing unicode with unicode encoded into bytes with the UTF-8 encoding. Having a problem feeding a unicode string, not 'UFT-8 code', which in Python can only mean a UTF-8 encoded byte string. I also believe I am. Could you please elaborate further? Do I understand correctly when saying that type 'str' has no associated default encoding, but type 'unicode' does? Does this mean that really the only way of coping with that stuff is doing what I've been doing? This is incredibly ugly IMHO, as I would really like the parser to just accept UTF-8 input. To me, code that works is prettier than code that does not. In 3.0, text strings are unicode, and I believe that is what the parser now accepts. Well, yes, I suppose working code is nicer than non-working code. However I am sure you will agree that explicit encoding conversions are cumbersome and error-prone. UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 0: ordinal not in range(128) When you do not bother to specify some other encoding in an encoding operation, sgmllib or something deeper in Python tries the default encoding, which does not work. Stop being annoyed and tell the interpreter what you want. It is not a mind-reader. How do I tell the interpreter to parse the strings I pass to it as unicode? The way I did or is there some better way? Annoying, IMHO, that the internal html Parser cannot cope with UTF-8 input - which should (again, IMHO) be the absolute standard for such a new language. The first version of Python came out in 1989, I believe, years before unicode. One of the features of the new 3.0 version is that is uses unicode as the standard for text. Hmmm. I suppose you're right there. Python 3.0 really sounds quite nice, do you know when will approximately be ready? Regards, Johannes -- Meine Gegenklage gegen dich lautet dann auf bewusste Verlogenheit, verlästerung von Gott, Bibel und mir und bewusster Blasphemie. -- Prophet und Visionär Hans Joss aka HJP in de.sci.physik [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list
Re: Python HTML parser chokes on UTF-8 input
On Thu, Oct 9, 2008 at 4:54 PM, Johannes Bauer [EMAIL PROTECTED] wrote: Hello group, Now when I take website directly from the parser, everything is fine. However I want to do some modifications before I parse it, namely UTF-8 modifications in the style: website = website.replace(uföö, ubär) That's not utf-8, that's unicode. Even if your file is saved as utf-8, you're telling python to convert those from utf-8 encoded bytes to unicode characters, by prefixing them with 'u'. Therefore, after fetching the web site content, I have to convert it to UTF-8 first, modify it and convert it back: You have to convert it to unicode if and only if you are doing manipulation with unicode stings. website = website.decode(latin1) website = website.replace(uföö, ubär) website = website.encode(latin1) This is incredibly ugly IMHO, as I would really like the parser to just accept UTF-8 input. However when I omit the reecoding to latin1: You could just use the precise Latin-1 byte strings you'd like to replace: website = website.replace(f\xf6\xf6, b\xe4r) Or, you could set the encoding of your source file to Latin-1, by putting the following on the first or second line of your source file: # -*- coding: Latin-1 -*- Then use the appropriate literals in your source code, making sure that you save it as Latin-1 in your editor of choice. Truthfully, though, I think your current approach really is the right one. Decode to unicode character strings as soon as they come into your program, manipulate them as unicode, then select your preferred encoding when you write them back out. It's explicit, and only takes two lines of code. -- Jerry -- http://mail.python.org/mailman/listinfo/python-list
Re: Python HTML parser chokes on UTF-8 input
Johannes Bauer wrote: Terry Reedy schrieb: Johannes Bauer wrote: Hello group, I'm trying to use a htmllib.HTMLParser derivate class to parse a website which I fetched via httplib.HTTPConnection().request().getresponse().read(). Now the problem is: As soon as I pass the htmllib.HTMLParser UTF-8 code, it chokes. The code is something like this: I believe you are confusing unicode with unicode encoded into bytes with the UTF-8 encoding. Having a problem feeding a unicode string, not 'UFT-8 code', which in Python can only mean a UTF-8 encoded byte string. I also believe I am. Could you please elaborate further? I am a unicode neophyte. My source of info is the first 3 or so chapters of the unicode specification. http://www.unicode.org/versions/Unicode5.1.0/ I recommend that or other sites for other questions. It took me more than one reading of the same topics in different texts to pretty well 'get it' Do I understand correctly when saying that type 'str' has no associated default encoding, but type 'unicode' does? I am not sure what you mean. Unicode strings in Python are internally stored in USC-2 or UCS-4 format. Does this mean that really the only way of coping with that stuff is doing what I've been doing? Having two text types in 2.x was necessary as a transition strategy but has also been something of a mess. You did it one way. Jerry gave you an alternative that I could not have explained. Your choice. Or use 3.0. .. Hmmm. I suppose you're right there. Python 3.0 really sounds quite nice, do you know when will approximately be ready? For my current purposes, it is ready enough. Developers *really* hope to get 3.0 final out by mid-December. The schedule was pushed back because a) the outside world has not completely and cleanly switched to unicode text and b) some people who just started with the release candidate have found import bugs that earlier testers did not. It still needs more testing from more different users (hint, hint). Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list