Re: [Tutor] How to read websites - Web Scraping or Parsing in python

Alan Gauld Wed, 13 Jun 2012 03:11:56 -0700

On 13/06/12 10:06, Surya K wrote:

As my target webpage could be anyone of web, and each website's
designers could have designed in their own fashion using different
"class names", I am unable to figure out how to read "article content"
in a webpage.

This is always the problem with scraping webpages, you are dependant onhow the individual author structures their pages. And if they change theformat it will likely break your scraper. Also some web sites implementdevices to deliberately make it hard to scrape the pages - such aschanging the div/class names arbitrarily. This is to encourage you touse their web site and see the beautiful adverts they have on displayand that pay for the service.

So, can anyone tell me what libraries should I ultimately use to achieve
it ?? and what elements and attributes I should read??


The most basic are urllib, urllib2 and httplib in the standard library

I thought of using BeautifulSoup but I really don't know which elements
( div's or p or a ) I should read.


BS is good but you will need to know which tags you are interested in.

Usually the low level <p> tags are inside a <div> so you can locate the<div> and only fetch the <p>'s from that section. But you will need todo some digging and probably some trial and error. - put it in amodule/class and use the >>> prompt to experiment is my advice.


--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/



_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] How to read websites - Web Scraping or Parsing in python

Reply via email to