Hi,

Please head over to most recent thread on dev@ for potential improvements
for the Generator* code.

Thanks for invoking this discussion, it is well overdue.

Lewis



On Wed, Feb 20, 2013 at 12:55 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi Alex,
>
>
> On Wed, Feb 20, 2013 at 11:54 AM, <[email protected]> wrote:
>
>>
>> The generator also does not have filters. Its mapper  goes over all
>> records as far as I know. If you use hadoop you can see how many records go
>> as input to mappers. Also see this
>>
>
> I don't think this is true. The GeneratorMapper filters URLs before
> selecting them for inclusion based on the following
> - distance
> - URLNormalizer(s)
> - URLFilter(s)
> in that order.
> I am going to start a new thread on improvements to the GeneratorJob
> regarding better configuration as it is a crucial stage in the crawl
> process.
>
> So the issue here, as you correctly explain, is with the Fetcher obtaining
> the URLs which have been marked with a desired batchId. This would be done
> via scanner in Gora.
>



-- 
*Lewis*

Reply via email to