Hi, (1) A custom parse filter can be used to extract the content of <div class="finalstorytext3"> (or any other HTML element). Then it has to replace the plain text in the returned ParseResult.
(2) That's a redirect, the target is successfully fetched and parsed. Ev. URL filters have to be adapted to follow links to an "external" host: % bin/nutch parsechecker "http://timesofindia.indiatimes .com/sports/cricket/domestic-cricket/ranji-trophy/Tendulkars-half-ton-keeps- Mumbai-hopes-alive/articleshow/24880440.cms" fetching: http://timesofindia.indiatimes .com/sports/cricket/domestic-cricket/ranji-trophy/Tendulkars-half-ton-keeps- Mumbai-hopes-alive/articleshow/24880440.cms Fetch failed with protocol status: moved(12), lastModified=0: http://articles.timesofindia.indiatimes.com/2013-10-29/ranji -trophy/43494559_1_sachin-tendulkar-west-indies-lahli % bin/nutch parsechecker "http://articles.timesofindia.indiatimes .com/2013-10-29/ranji-trophy/43494559_1_sachin-tendulkar-west-indies-lahli" ... Sebastian 2013/10/29 Tej Kumar Ilindra <[email protected]> > Hi, > > I have couple of queries. > > *Environment:* > Nutch 2.2.1, Hbase 0.90.4 > > 1) Requirement: > I want to crawl only the content of all the articles in any choosen > website. > > Scenario: > consider the article below for example, > http://news.outlookindia.com/items.aspx?artid=815315 > > When i crawl the above url, I can see the data got updated and storing to > hbase under one column family. > > Problem: > It is crawling all the text from the webpage including tab names, headings > and other. > But,I want to crawl only the content of article starting from "Housing..." > to "...inflation". > > > 2) How to crawl urls with '.cms' extension. > sample url: > > http://timesofindia.indiatimes.com/sports/cricket/domestic-cricket/ranji-trophy/Tendulkars-half-ton-keeps-Mumbai-hopes-alive/articleshow/24880440.cms > > > 3) As suggested by Talat earlier( > http://www.mail-archive.com/[email protected]/msg11008.html), > we can do crawling based on regular intervals of time using cron job. > > Requirement Changed: > Whenever an update is avaialble in the website, need to stream (crawl) the > data to hbase. > > Could anyone please help me with the above requirements. > > -- > Regards, > Tej Ilindra > +91- 9962569369 > [Always do what you are afraid to do. -Ralph Waldo Emerson] >

