Hi, I have couple of queries.
*Environment:* Nutch 2.2.1, Hbase 0.90.4 1) Requirement: I want to crawl only the content of all the articles in any choosen website. Scenario: consider the article below for example, http://news.outlookindia.com/items.aspx?artid=815315 When i crawl the above url, I can see the data got updated and storing to hbase under one column family. Problem: It is crawling all the text from the webpage including tab names, headings and other. But,I want to crawl only the content of article starting from "Housing..." to "...inflation". 2) How to crawl urls with '.cms' extension. sample url: http://timesofindia.indiatimes.com/sports/cricket/domestic-cricket/ranji-trophy/Tendulkars-half-ton-keeps-Mumbai-hopes-alive/articleshow/24880440.cms 3) As suggested by Talat earlier( http://www.mail-archive.com/[email protected]/msg11008.html), we can do crawling based on regular intervals of time using cron job. Requirement Changed: Whenever an update is avaialble in the website, need to stream (crawl) the data to hbase. Could anyone please help me with the above requirements. -- Regards, Tej Ilindra +91- 9962569369 [Always do what you are afraid to do. -Ralph Waldo Emerson]

