Hi,

Are those filters put on all data selected from hbase or sent to hbase as 
filters to select a subset of all hbase records?

Thanks.
Alex.

 

 

 

-----Original Message-----
From: Lewis John Mcgibbney <[email protected]>
To: user <[email protected]>
Sent: Wed, Feb 20, 2013 12:56 pm
Subject: Re: nutch with cassandra internal network usage


Hi Alex,

On Wed, Feb 20, 2013 at 11:54 AM, <[email protected]> wrote:

>
> The generator also does not have filters. Its mapper  goes over all
> records as far as I know. If you use hadoop you can see how many records go
> as input to mappers. Also see this
>

I don't think this is true. The GeneratorMapper filters URLs before
selecting them for inclusion based on the following
- distance
- URLNormalizer(s)
- URLFilter(s)
in that order.
I am going to start a new thread on improvements to the GeneratorJob
regarding better configuration as it is a crucial stage in the crawl
process.

So the issue here, as you correctly explain, is with the Fetcher obtaining
the URLs which have been marked with a desired batchId. This would be done
via scanner in Gora.

 

Reply via email to