Had a question for the group related to Beautiful Soup that is packaged with Twill.
I'm trying to get away from using a regex to pull out all of the images in a HTML page, I figured I would use Beautiful Soup since it's included with Twill and it's made for parsing HTML, but I'm having some seriously weird results. Basically, if I try to do something like: >> from twill.commands import * >> from twill import get_browser >> from BeautifulSoup import BeautifulSoup >> u = "http://somedomain.com" >> go(u) >> p = get_browser().get_html() >> soup = BeautifulSoup(p) >> soup.findAll('img') >> Null Wasn't sure if I was doing something wrong, so I installed the Beautiful Soup egg and did the following: >> from BeautifulSoup import BeautifulSoup >> string = """ ... <html><body><img src="foo.gif"/><img src="bar.jpg"/></body></html> ... """ >> soup = BeautifulSoup(string) >> soup.findAll('img') >> [<img src="foo.gif" />, <img src="bar.jpg" />] So I'm not sure if Twill comes with a scaled back version of BeautifulSoup or if I'm just approaching the problem incorrectly. (If I were a productive member of the OS community I would offer Titus a patch that would just pull all the images in....). Anyone? _______________________________________________ twill mailing list [email protected] http://lists.idyll.org/listinfo/twill
