Re: Crawling entire website using Nutch 2.2.1 for every 2 hours

Talat UYARER Wed, 23 Oct 2013 23:02:41 -0700

Hi Tej,

You can do that different ways. Your question has two parts.

Fist part of question is fetch time setting and if they is changed, itwill fetch. you should set interval value with db.fetch.interval.defaultin your nutch-site.xml. In default nutch check websites, are theymodification based on http protocol. I think enough your first requirement.

Second part of question, Nutch should be work depend by time. You can dowith crontab or oozie workflow. Now I explain crontab way.You can writeyour nutch crawl shell script in your crontab like this:

0 */2 * * * $NUTCH_HOME/runtime/deploy/bin/crawl <seedDir> <crawlID><solrURL> <numberOfRounds>

Disadvantage of crontab way is that, crontab don't check your previousjob status. Sometimes your job may takes time more than your planningtime or crontab dont give information about your job status.

I think better way of schedulat working is ozzie way. But i cant explainnow. I will write a document about that.


Talat




24-10-2013 05:35 tarihinde, Tej Kumar Ilindra yazdı:

Hi,

I am using Nutch 2.2.1 with Hbase 0.90.4 to crawl and store the data to
hbase.

As of now, data is getting crawled from website based on the urls provided
in the seed.txt

*To Do:*
I would like to write a program to crawl entire website and for every 2
hours, it should check the website for any updates, if any thing is new, it
should crawl.

Can anyone suggest me, how to do this.

Re: Crawling entire website using Nutch 2.2.1 for every 2 hours

Reply via email to