Hi,
I am trying to write a python program which reads any webpage's content.
Considering a blog, I'd like to read all the content written by the author in
it.
So, each blog/ site would be having different type of HTML/ XML whether its
Blogger or Wordpress or Typepad or any.. I thought of using their RSS/Atom
feeds to extract the content.
Then, I used Universal Feed Parser to extract the content.
import feedparserurl = "http://knolzone.com/feed" parsedFeed =
feedparser.parse(url)
websiteTitle = parsedFeed.feed.titlefor aArticle in parsedFeed.entries:
print aArticle.link print aArticle.summary
I could able to find links of all articles and their summaries with the
website's title. But I'd like to read the whole content of a particular
article, not just summary
Say, for example we take a webpage
http://knolzone.com/unlock-hidden-themes-in-windows-7-and-other-useful-tips-part-5-of-7/.
The author had written some article in it and I'd like to read that portion of
webpage.
As my target webpage could be anyone of web, and each website's designers could
have designed in their own fashion using different "class names", I am unable
to figure out how to read "article content" in a webpage.
So, can anyone tell me what libraries should I ultimately use to achieve it ??
and what elements and attributes I should read??
I thought of using BeautifulSoup but I really don't know which elements ( div's
or p or a ) I should read. Considering the above webpage given, I consists of
lots of "p" elements in it and fortunately all the article content is in "p"
elements.. However, there are few other "p" elements which don't belong to
article content. In that case how should I eliminate them????
Thanks for reading.. I hope you help.
_______________________________________________
Tutor maillist - [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor