Id based crawling with nutch2.x/hbase and multiple webpage tables

Tony Mullins Mon, 24 Jun 2013 10:47:54 -0700

Hi ,

(I am starting a new thread as previous thread's topic was little
misleading)


I am crawling with nutch2.x with hbase 0.90.6 and if I run my nutch jobs
with crawlId then hbase creates two tables , namely 'webpage' which has
always 0 rows and 2nd 'crawlId_webpage' table. And in this scenario no
document is inserted to Solr by mine SolrIndexJob .

And if I run my nutch jobs without crawlId then only 'webpage' table is
created with all crawling data and SolrIndexJob also inserts documents to
Solr successfully.

So my question is how to do Id based crawling with hbase ?

And when I will run bin/crawl script as its always expect a crawlId so in
that case hbase will create new webpage table every time for new crawlId. ?

Is there any configuration which I need to tweek to make hbase to always
create or insert in only one pre-created 'webpage' table ?

Thanks,
Tony

Id based crawling with nutch2.x/hbase and multiple webpage tables

Reply via email to