Re: Crawling specific content from url; .cms extension is not supporting; Crawl website dynamically when there is an update

Sebastian Nagel Wed, 30 Oct 2013 06:27:35 -0700

Hi,

(1)  A custom parse filter can be used to extract the content of <div
class="finalstorytext3"> (or any other HTML element). Then it has to
replace the plain text in the returned ParseResult.


(2) That's a redirect, the target is successfully fetched and parsed. Ev.
URL filters have to be adapted to follow links to an "external" host:

% bin/nutch parsechecker "http://timesofindia.indiatimes
.com/sports/cricket/domestic-cricket/ranji-trophy/Tendulkars-half-ton-keeps-
Mumbai-hopes-alive/articleshow/24880440.cms"
fetching: http://timesofindia.indiatimes
.com/sports/cricket/domestic-cricket/ranji-trophy/Tendulkars-half-ton-keeps-
Mumbai-hopes-alive/articleshow/24880440.cms
Fetch failed with protocol status: moved(12), lastModified=0:
http://articles.timesofindia.indiatimes.com/2013-10-29/ranji
-trophy/43494559_1_sachin-tendulkar-west-indies-lahli
% bin/nutch parsechecker "http://articles.timesofindia.indiatimes
.com/2013-10-29/ranji-trophy/43494559_1_sachin-tendulkar-west-indies-lahli"
...

Sebastian


2013/10/29 Tej Kumar Ilindra <[email protected]>

> Hi,
>
> I have couple of queries.
>
> *Environment:*
> Nutch 2.2.1,  Hbase 0.90.4
>
> 1) Requirement:
> I want to crawl only the content of all the articles in any choosen
> website.
>
> Scenario:
> consider the article below for example,
> http://news.outlookindia.com/items.aspx?artid=815315
>
> When i crawl the above url, I can see the data got updated and storing to
> hbase under one column family.
>
> Problem:
> It is crawling all the text from the webpage including tab names, headings
> and other.
> But,I want to crawl only the content of article starting from "Housing..."
> to "...inflation".
>
>
> 2) How to crawl urls with '.cms' extension.
> sample url:
>
> http://timesofindia.indiatimes.com/sports/cricket/domestic-cricket/ranji-trophy/Tendulkars-half-ton-keeps-Mumbai-hopes-alive/articleshow/24880440.cms
>
>
> 3) As suggested by Talat earlier(
> http://www.mail-archive.com/[email protected]/msg11008.html),
> we can do crawling based on regular intervals of time using cron job.
>
> Requirement Changed:
> Whenever an update is avaialble in the website, need to stream (crawl) the
> data to hbase.
>
> Could anyone please help me with the above requirements.
>
> --
> Regards,
> Tej Ilindra
> +91- 9962569369
> [Always do what you are afraid to do. -Ralph Waldo Emerson]
>

Re: Crawling specific content from url; .cms extension is not supporting; Crawl website dynamically when there is an update

Reply via email to