I have seen two threads with similar issue http://www.mail-archive.com/[email protected]/msg09639.html http://www.mail-archive.com/[email protected]/msg09251.html
But the patched mentioned in these threads I have all of them and my all crawling jobs are working fine [ Inject -> Generate -> Fetch -> Parse -> DBUpdate ] these jobs work fine with crawlId_webpage table . Issue is with my SolrIndexJob ... its not finding any documents to insert to Solr. I think either there should be some fix to tell Solr to which webpage (id_webpage) to read or there should always be one single 'webpage' even when user will give crawlId.( like with cassandra backend ) So what should I do now to run my complete cycle of Nutch2.x jobs and insert my docs to Solr ? Thanks, Tony. On Mon, Jun 24, 2013 at 10:46 PM, Tony Mullins <[email protected]>wrote: > Hi , > > (I am starting a new thread as previous thread's topic was little > misleading) > > I am crawling with nutch2.x with hbase 0.90.6 and if I run my nutch jobs > with crawlId then hbase creates two tables , namely 'webpage' which has > always 0 rows and 2nd 'crawlId_webpage' table. And in this scenario no > document is inserted to Solr by mine SolrIndexJob . > > And if I run my nutch jobs without crawlId then only 'webpage' table is > created with all crawling data and SolrIndexJob also inserts documents to > Solr successfully. > > So my question is how to do Id based crawling with hbase ? > > And when I will run bin/crawl script as its always expect a crawlId so in > that case hbase will create new webpage table every time for new crawlId. ? > > Is there any configuration which I need to tweek to make hbase to always > create or insert in only one pre-created 'webpage' table ? > > Thanks, > Tony >

