Re: [Tutor] The dreaded UnicodeDecodeError... why, why, why does it still want ascii?

Marc Tompkins Wed, 06 Jun 2012 01:23:16 -0700

On Tue, Jun 5, 2012 at 11:22 PM, Stefan Behnel <[email protected]> wrote:


> You can do this:
>
>    connection = urllib2.urlopen(url)
>    tree = etree.parse(connection, my_html_parser)
>
> Alternatively, use fromstring() to parse from strings:
>
>    page = urllib2.urlopen(url)
>    pagecontents = page.read()
>     html_root = etree.fromstring(pagecontents, my_html_parser)
>
>
Thank you!  fromstring() did the trick for me.

Interestingly, your first suggestion - parsing straight from the connection
without an intermediate read() - appears to create the tree successfully,
but my first strip_tags() fails, with the error "ValueError: Input object
has no document: lxml.etree._ElementTree".  Since fromstring() works just
fine, I will set this aside as a mystery for my copious free time (after
this project is done, for example.)



> See the lxml tutorial.

I did - I've been consulting it religiously - but I missed the fact that I
was mixing strings with file-like IO, and (as you mentioned) the error
message really wasn't helping me figure out my problem.  Perhaps I should
have figured it out from the fact that the character value and position
change, even though the webpage doesn't... but no.


> Also note that there's lxml.html, which provides an
> extended tool set for HTML processing.
>

I've been using lxml.etree because I'm used to the syntax, and because
(perhaps mistakenly) I was under the impression that its parser was more
resilient in the face of broken HTML - this page has unclosed tags all over
the place.  I'll try lxml.html, but (again) it'll have to be some time
later.

_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] The dreaded UnicodeDecodeError... why, why, why does it still want ascii?

Reply via email to