Hi, Please head over to most recent thread on dev@ for potential improvements for the Generator* code.
Thanks for invoking this discussion, it is well overdue. Lewis On Wed, Feb 20, 2013 at 12:55 PM, Lewis John Mcgibbney < [email protected]> wrote: > Hi Alex, > > > On Wed, Feb 20, 2013 at 11:54 AM, <[email protected]> wrote: > >> >> The generator also does not have filters. Its mapper goes over all >> records as far as I know. If you use hadoop you can see how many records go >> as input to mappers. Also see this >> > > I don't think this is true. The GeneratorMapper filters URLs before > selecting them for inclusion based on the following > - distance > - URLNormalizer(s) > - URLFilter(s) > in that order. > I am going to start a new thread on improvements to the GeneratorJob > regarding better configuration as it is a crucial stage in the crawl > process. > > So the issue here, as you correctly explain, is with the Fetcher obtaining > the URLs which have been marked with a desired batchId. This would be done > via scanner in Gora. > -- *Lewis*

