Re: [Tutor] Web crawling!

vince spicer Wed, 29 Jul 2009 11:54:30 -0700

On Wed, Jul 29, 2009 at 9:59 AM, Raj Medhekar <cosmicsan...@yahoo.com>wrote:


> Does anyone know a good webcrawler that could be used in tandem with the
> Beautiful soup parser to parse out specific elements from news sites like
> BBC and CNN? Thanks!
> -Raj
>
>
> _______________________________________________
> Tutor maillist  -  Tutor@python.org
> http://mail.python.org/mailman/listinfo/tutor
>
>

I have used httplib2 http://code.google.com/p/httplib2/ to crawl sites(with
auth/cookies) and lxml (html xpath) to parse out links.

but you could use builtin urllib2 to request pages if no auth/cookie support
is required, here is a simple example

import urllib2
from lxml import html

page = urllib2.urlopen("http://this.page.com <http://this.page/>")
data = html.fromstring(page.read())

all_links = data.xpath("//a") # all links on the page

for link in all_links:
    print link.attrib["href"]

_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Web crawling!

Reply via email to