If I dont give any crawlId to my Nutch jobs then it creates and populate
'webpage' table and my solrIndexJob also execute successfully.

But what if I want to crawl specific site only and for that I need to give
a new crawlId to my Nutch job and when I will give a crawlId it will create
new 'crawlId_webpage' table and solrindexJob will be failed as it wont know
how to and from where get its documents.

So how do guyz do id based crawling with hbase ?

Thanks,
Tony





On Mon, Jun 24, 2013 at 4:36 PM, Tony Mullins <[email protected]>wrote:

> Ok, I have crawled again . And on checking my hbase  "scan 'C23_webpage',
> {COLUMNS => ['p:c']}" I can see the parsed text but in my ParseFilter
> plugin I still don't get null for page.getText().
>
> But disturbing thing for me is that why there is blank 'webpage' table and
> for every new crawlId I get new 'crawld_webpage' table ?
> And my SolrIndex job was working fine with cassandra but its not working
> with hbase now , is this due to different table structure in hbase, how
> should I solve it ?
>
> Thanks,
> Tony
>
>
> On Mon, Jun 24, 2013 at 2:39 PM, Tony Mullins <[email protected]>wrote:
>
>> Hi,
>>
>> I have successfully setup nutch 2.x with hbase-0.90.6 and my jobs are
>> running fine. But there is one issue for which I need your help.
>>
>> Earlier I was using Cassandra with nutch 2.x and data from my all jobs
>> were used to go to 'webpage'  keyspace. But in case of hbase-0.90.6 I can
>> see there are 2 tables created , one is 'webpage' which always have 0 rows
>> and other is 'crawlId_webpage' and that has some data.
>>
>> But when I run my solrIndexJob ,  no documents are added and I think this
>> is due to the face that there is no parsed text present in
>> 'crawlId_webpage' table for my crawled pages.
>>
>> I can also verify this in my ParseFilter plugin when I do Utf8 text =
>> page.getText();
>> my text is always null and thats why I think solrindexjob is not
>> inserting any doc to Solr.
>>
>> So what should I do here ? Why I am not having any text in hbase table ?
>> And why there are two tables created 'webpage' & 'crawlid_webpage' ?
>>
>> Thanks guys for help & support.
>>
>> Tony.
>>
>
>

Reply via email to