On Tue, Jun 5, 2012 at 11:22 PM, Stefan Behnel <[email protected]> wrote:
> You can do this: > > connection = urllib2.urlopen(url) > tree = etree.parse(connection, my_html_parser) > > Alternatively, use fromstring() to parse from strings: > > page = urllib2.urlopen(url) > pagecontents = page.read() > html_root = etree.fromstring(pagecontents, my_html_parser) > > Thank you! fromstring() did the trick for me. Interestingly, your first suggestion - parsing straight from the connection without an intermediate read() - appears to create the tree successfully, but my first strip_tags() fails, with the error "ValueError: Input object has no document: lxml.etree._ElementTree". Since fromstring() works just fine, I will set this aside as a mystery for my copious free time (after this project is done, for example.) > See the lxml tutorial. I did - I've been consulting it religiously - but I missed the fact that I was mixing strings with file-like IO, and (as you mentioned) the error message really wasn't helping me figure out my problem. Perhaps I should have figured it out from the fact that the character value and position change, even though the webpage doesn't... but no. > Also note that there's lxml.html, which provides an > extended tool set for HTML processing. > I've been using lxml.etree because I'm used to the syntax, and because (perhaps mistakenly) I was under the impression that its parser was more resilient in the face of broken HTML - this page has unclosed tags all over the place. I'll try lxml.html, but (again) it'll have to be some time later.
_______________________________________________ Tutor maillist - [email protected] To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
