On Wed, Jul 29, 2009 at 9:59 AM, Raj Medhekar <cosmicsan...@yahoo.com>wrote:
> Does anyone know a good webcrawler that could be used in tandem with the > Beautiful soup parser to parse out specific elements from news sites like > BBC and CNN? Thanks! > -Raj > > > _______________________________________________ > Tutor maillist - Tutor@python.org > http://mail.python.org/mailman/listinfo/tutor > > I have used httplib2 http://code.google.com/p/httplib2/ to crawl sites(with auth/cookies) and lxml (html xpath) to parse out links. but you could use builtin urllib2 to request pages if no auth/cookie support is required, here is a simple example import urllib2 from lxml import html page = urllib2.urlopen("http://this.page.com <http://this.page/>") data = html.fromstring(page.read()) all_links = data.xpath("//a") # all links on the page for link in all_links: print link.attrib["href"]
_______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor