Hi Alex, On Wed, Feb 20, 2013 at 11:54 AM, <[email protected]> wrote:
> > The generator also does not have filters. Its mapper goes over all > records as far as I know. If you use hadoop you can see how many records go > as input to mappers. Also see this > I don't think this is true. The GeneratorMapper filters URLs before selecting them for inclusion based on the following - distance - URLNormalizer(s) - URLFilter(s) in that order. I am going to start a new thread on improvements to the GeneratorJob regarding better configuration as it is a crucial stage in the crawl process. So the issue here, as you correctly explain, is with the Fetcher obtaining the URLs which have been marked with a desired batchId. This would be done via scanner in Gora.

