Hi Alex,

On Wed, Feb 20, 2013 at 11:54 AM, <[email protected]> wrote:

>
> The generator also does not have filters. Its mapper  goes over all
> records as far as I know. If you use hadoop you can see how many records go
> as input to mappers. Also see this
>

I don't think this is true. The GeneratorMapper filters URLs before
selecting them for inclusion based on the following
- distance
- URLNormalizer(s)
- URLFilter(s)
in that order.
I am going to start a new thread on improvements to the GeneratorJob
regarding better configuration as it is a crucial stage in the crawl
process.

So the issue here, as you correctly explain, is with the Fetcher obtaining
the URLs which have been marked with a desired batchId. This would be done
via scanner in Gora.

Reply via email to