Hi, I am trying to write a python program which reads any webpage's content. Considering a blog, I'd like to read all the content written by the author in it. So, each blog/ site would be having different type of HTML/ XML whether its Blogger or Wordpress or Typepad or any.. I thought of using their RSS/Atom feeds to extract the content. Then, I used Universal Feed Parser to extract the content. import feedparserurl = "http://knolzone.com/feed" parsedFeed = feedparser.parse(url) websiteTitle = parsedFeed.feed.titlefor aArticle in parsedFeed.entries: print aArticle.link print aArticle.summary I could able to find links of all articles and their summaries with the website's title. But I'd like to read the whole content of a particular article, not just summary Say, for example we take a webpage http://knolzone.com/unlock-hidden-themes-in-windows-7-and-other-useful-tips-part-5-of-7/. The author had written some article in it and I'd like to read that portion of webpage. As my target webpage could be anyone of web, and each website's designers could have designed in their own fashion using different "class names", I am unable to figure out how to read "article content" in a webpage. So, can anyone tell me what libraries should I ultimately use to achieve it ?? and what elements and attributes I should read?? I thought of using BeautifulSoup but I really don't know which elements ( div's or p or a ) I should read. Considering the above webpage given, I consists of lots of "p" elements in it and fortunately all the article content is in "p" elements.. However, there are few other "p" elements which don't belong to article content. In that case how should I eliminate them????
Thanks for reading.. I hope you help.
_______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor