Crawling specific content from url; .cms extension is not supporting; Crawl website dynamically when there is an update

Tej Kumar Ilindra Tue, 29 Oct 2013 11:39:34 -0700

Hi,

I have couple of queries.


*Environment:*
Nutch 2.2.1,  Hbase 0.90.4

1) Requirement:
I want to crawl only the content of all the articles in any choosen website.

Scenario:
consider the article below for example,
http://news.outlookindia.com/items.aspx?artid=815315

When i crawl the above url, I can see the data got updated and storing to
hbase under one column family.

Problem:
It is crawling all the text from the webpage including tab names, headings
and other.
But,I want to crawl only the content of article starting from "Housing..."
to "...inflation".


2) How to crawl urls with '.cms' extension.
sample url:
http://timesofindia.indiatimes.com/sports/cricket/domestic-cricket/ranji-trophy/Tendulkars-half-ton-keeps-Mumbai-hopes-alive/articleshow/24880440.cms


3) As suggested by Talat earlier(
http://www.mail-archive.com/[email protected]/msg11008.html),
we can do crawling based on regular intervals of time using cron job.

Requirement Changed:
Whenever an update is avaialble in the website, need to stream (crawl) the
data to hbase.

Could anyone please help me with the above requirements.

-- 
Regards,
Tej Ilindra
+91- 9962569369
[Always do what you are afraid to do. -Ralph Waldo Emerson]

Crawling specific content from url; .cms extension is not supporting; Crawl website dynamically when there is an update

Reply via email to