If I dont give any crawlId to my Nutch jobs then it creates and populate 'webpage' table and my solrIndexJob also execute successfully.
But what if I want to crawl specific site only and for that I need to give a new crawlId to my Nutch job and when I will give a crawlId it will create new 'crawlId_webpage' table and solrindexJob will be failed as it wont know how to and from where get its documents. So how do guyz do id based crawling with hbase ? Thanks, Tony On Mon, Jun 24, 2013 at 4:36 PM, Tony Mullins <[email protected]>wrote: > Ok, I have crawled again . And on checking my hbase "scan 'C23_webpage', > {COLUMNS => ['p:c']}" I can see the parsed text but in my ParseFilter > plugin I still don't get null for page.getText(). > > But disturbing thing for me is that why there is blank 'webpage' table and > for every new crawlId I get new 'crawld_webpage' table ? > And my SolrIndex job was working fine with cassandra but its not working > with hbase now , is this due to different table structure in hbase, how > should I solve it ? > > Thanks, > Tony > > > On Mon, Jun 24, 2013 at 2:39 PM, Tony Mullins <[email protected]>wrote: > >> Hi, >> >> I have successfully setup nutch 2.x with hbase-0.90.6 and my jobs are >> running fine. But there is one issue for which I need your help. >> >> Earlier I was using Cassandra with nutch 2.x and data from my all jobs >> were used to go to 'webpage' keyspace. But in case of hbase-0.90.6 I can >> see there are 2 tables created , one is 'webpage' which always have 0 rows >> and other is 'crawlId_webpage' and that has some data. >> >> But when I run my solrIndexJob , no documents are added and I think this >> is due to the face that there is no parsed text present in >> 'crawlId_webpage' table for my crawled pages. >> >> I can also verify this in my ParseFilter plugin when I do Utf8 text = >> page.getText(); >> my text is always null and thats why I think solrindexjob is not >> inserting any doc to Solr. >> >> So what should I do here ? Why I am not having any text in hbase table ? >> And why there are two tables created 'webpage' & 'crawlid_webpage' ? >> >> Thanks guys for help & support. >> >> Tony. >> > >

