Hi,

I am trying to write a python program which reads any webpage's content. 
Considering a blog, I'd like to read all the content written by the author in 
it.
So, each blog/ site would be having different type of HTML/ XML whether its 
Blogger or Wordpress or Typepad or any.. I thought of using their RSS/Atom 
feeds to extract the content. 
Then, I used Universal Feed Parser to extract the content.
import feedparserurl  = "http://knolzone.com/feed"; parsedFeed = 
feedparser.parse(url)
websiteTitle = parsedFeed.feed.titlefor aArticle in parsedFeed.entries:       
print aArticle.link       print aArticle.summary
I could able to find links of all articles and their summaries with the 
website's title. But I'd like to read the whole content of a particular 
article, not just summary
Say, for example we take a webpage 
http://knolzone.com/unlock-hidden-themes-in-windows-7-and-other-useful-tips-part-5-of-7/.
 The author had written some article in it and I'd like to read that portion of 
webpage.
As my target webpage could be anyone of web, and each website's designers could 
have designed in their own fashion using different "class names", I am unable 
to figure out how to read "article content" in a webpage.
So, can anyone tell me what libraries should I ultimately use to achieve it ?? 
and what elements and attributes I should read??
I thought of using BeautifulSoup but I really don't know which elements ( div's 
or p or a ) I should read. Considering the above webpage given, I consists of 
lots of "p" elements in it and fortunately all the article content is in "p" 
elements.. However, there are few other "p" elements which don't belong to 
article content. In that case how should I eliminate them????


Thanks for reading.. I hope you help.


                                          
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to