Hi, Are those filters put on all data selected from hbase or sent to hbase as filters to select a subset of all hbase records?
Thanks. Alex. -----Original Message----- From: Lewis John Mcgibbney <[email protected]> To: user <[email protected]> Sent: Wed, Feb 20, 2013 12:56 pm Subject: Re: nutch with cassandra internal network usage Hi Alex, On Wed, Feb 20, 2013 at 11:54 AM, <[email protected]> wrote: > > The generator also does not have filters. Its mapper goes over all > records as far as I know. If you use hadoop you can see how many records go > as input to mappers. Also see this > I don't think this is true. The GeneratorMapper filters URLs before selecting them for inclusion based on the following - distance - URLNormalizer(s) - URLFilter(s) in that order. I am going to start a new thread on improvements to the GeneratorJob regarding better configuration as it is a crucial stage in the crawl process. So the issue here, as you correctly explain, is with the Fetcher obtaining the URLs which have been marked with a desired batchId. This would be done via scanner in Gora.

