Nutch 2.2.1 with Hbase re-crawl

A Laxmi Wed, 30 Apr 2014 11:09:28 -0700

Hi,

I need some help with the nutch re-crawl esp. because I am using Nutch
2.2.1 with HBase and I could not find a whole lot of information anywhere
on how recrawl can be performed when urls are stored in HBase.


*Background*: I have crawled various domains and I assigned a specific
hbase table name in the crawl command every time I crawled individual
domains like shown below.

E.g.

Crawl 1: Url: www.*abc*.com

Crawl Command used first it is crawled:
bin/crawl urls *abc_webpage *http://localhost:8983/solr/ 6

-------------------------------------------------------------------------------------------------------------
 Crawl 2: Url: www.*def*.com


Crawl Command used first it is crawled:
bin/crawl urls *def_webpage *http://localhost:8983/solr/ 10

------------------------------------------------------------------------------------------------------------
 Crawl 3: Url: www.*ghi*.com

Crawl Command used first it is crawled:
bin/crawl urls *ghi_webpage *http://localhost:8983/solr/ 3

*Question:*
Now, since it is time to refetch/recrawl all those urls as there have been
several updates that were made to it since the last time it was crawled. I
will be using Defaultfetchschedule and I have updated fetch.interval in
nutch-site.xml accordingly.

My question is what nutch command should I use when I *re-crawl* those
three above URLs from the example provided above?

Thanks for any help!

Nutch 2.2.1 with Hbase re-crawl

Reply via email to