Re: Id based crawling with nutch2.x/hbase and multiple webpage tables

Tony Mullins Wed, 26 Jun 2013 04:31:11 -0700

Lewis,

Is it possible to crawl with crawlId but HBase only crates 'webpage' table
without crawlId prefix , just like Cassandra does?


And my other problems of DBUpdateJob's exception on some random urls and
repeating/mixed html of all urls present in seed.txt are also resolved
(disappeared) with HBase backend.

But in ParseFilter plugin I still get null for page.getText() function and
'0' for some other properties like

fetchTime:    0
prevFetchTime:    0
fetchInterval:    0
retriesSinceFetch:    0
modifiedTime:    0
prevModifiedTime:    0
protocolStatus:    (null)

Am I suppose to get proper values here or these are the expected output in
ParseFilter plugin ?

Thanks,
Tony

PS. Now I am getting correct HTML in ParseFilter with HBase backend.


On Wed, Jun 26, 2013 at 1:13 PM, Tony Mullins <[email protected]>wrote:

> Hi Lewis,
>
> As you siad UpdateDBJob doesn't expect crawlId , but if I give a crawlId
> like -crawlId c10  then it doesn't create new 'webpage' and use the
> pre-existing 'c10_webpage' table and solrindexjob also successfully inserts
> the docs to Solr.
>
> I just wonder how a new nutch user like me could solve these kind of
> issues without the community help.
>
> Thanks for the help.
> Tony.
>
>
>
> On Wed, Jun 26, 2013 at 1:10 AM, Lewis John Mcgibbney <
> [email protected]> wrote:
>
>> Hi Tony,
>>
>> On Tue, Jun 25, 2013 at 1:10 AM, Tony Mullins <[email protected]
>> >wrote:
>>
>> >
>> > So what should I do now to run my complete cycle of Nutch2.x jobs and
>> > insert my docs to Solr ?
>> >
>> >
>> I'm not using HBase as backend however I know that as per the crawl
>> script,
>> the updatedb doesn't use crawlId parameter. Try adding the parameter
>> please
>> and see if it works.
>> Thanks
>> Lewis
>>
>
>

Re: Id based crawling with nutch2.x/hbase and multiple webpage tables

Reply via email to