Re: Id based crawling with nutch2.x/hbase and multiple webpage tables

Tony Mullins Tue, 25 Jun 2013 01:11:44 -0700

I have seen two threads with similar issue

http://www.mail-archive.com/[email protected]/msg09639.html
http://www.mail-archive.com/[email protected]/msg09251.html

But the patched mentioned in these threads I have all of them and my all
crawling jobs are working fine [ Inject -> Generate -> Fetch -> Parse ->
DBUpdate ] these jobs work fine with crawlId_webpage table .

Issue is with my SolrIndexJob ... its not finding any documents to insert
to Solr.
I think either there should be some fix to tell Solr to which webpage
(id_webpage) to read or there should always be one single 'webpage' even
when user will give crawlId.( like with cassandra backend  )

So what should I do now to run my complete cycle of Nutch2.x jobs and
insert my docs to Solr ?

Thanks,
Tony.

On Mon, Jun 24, 2013 at 10:46 PM, Tony Mullins <[email protected]>wrote:

> Hi ,
>
> (I am starting a new thread as previous thread's topic was little
> misleading)
>
> I am crawling with nutch2.x with hbase 0.90.6 and if I run my nutch jobs
> with crawlId then hbase creates two tables , namely 'webpage' which has
> always 0 rows and 2nd 'crawlId_webpage' table. And in this scenario no
> document is inserted to Solr by mine SolrIndexJob .
>
> And if I run my nutch jobs without crawlId then only 'webpage' table is
> created with all crawling data and SolrIndexJob also inserts documents to
> Solr successfully.
>
> So my question is how to do Id based crawling with hbase ?
>
> And when I will run bin/crawl script as its always expect a crawlId so in
> that case hbase will create new webpage table every time for new crawlId. ?
>
> Is there any configuration which I need to tweek to make hbase to always
> create or insert in only one pre-created 'webpage' table ?
>
> Thanks,
> Tony
>

Re: Id based crawling with nutch2.x/hbase and multiple webpage tables

Reply via email to