On 13/06/12 10:06, Surya K wrote:

As my target webpage could be anyone of web, and each website's
designers could have designed in their own fashion using different
"class names", I am unable to figure out how to read "article content"
in a webpage.

This is always the problem with scraping webpages, you are dependant on how the individual author structures their pages. And if they change the format it will likely break your scraper. Also some web sites implement devices to deliberately make it hard to scrape the pages - such as changing the div/class names arbitrarily. This is to encourage you to use their web site and see the beautiful adverts they have on display and that pay for the service.

So, can anyone tell me what libraries should I ultimately use to achieve
it ?? and what elements and attributes I should read??

The most basic are urllib, urllib2 and httplib in the standard library

I thought of using BeautifulSoup but I really don't know which elements
( div's or p or a ) I should read.

BS is good but you will need to know which tags you are interested in.
Usually the low level <p> tags are inside a <div> so you can locate the <div> and only fetch the <p>'s from that section. But you will need to do some digging and probably some trial and error. - put it in a module/class and use the >>> prompt to experiment is my advice.

--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/



_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to