I may as well drop this one in here, I opened an issue a while back to
discuss inherent differences in ordering of filtering and normalization
between 1.x and 2.x codebases specifically within Generator* classes.
https://issues.apache.org/jira/browse/NUTCH-1373
I am not sure how/if this applies to other tools e.g. if there is
inconsistency between ordering of normalization and filtering... we would
need to revisit this.
Lewis


On Mon, Jun 24, 2013 at 1:14 PM, Markus Jelsma
<[email protected]>wrote:

> yes, that matters indeed! But if you don't normalize, your URL filters may
> not work although that should not be a problem in small crawls or a limited
> number of (good) websites. You could try the following normalizing rule to
> remove very long URL's as your first rule.
>
> .{256,}
>
> With an empty substitution this should `empty` all long URL's.
>
>
> -----Original message-----
> > From:eakarsu <[email protected]>
> > Sent: Monday 24th June 2013 22:03
> > To: [email protected]
> > Subject: Re: Parse reduce stage take forver
> >
> > Sebastian,
> >
> > Does it matter reverse order of normalize and filter calls?
> > Currently, nutch first does normalize and then filter.
> >
> > What about if we do reverse: filter and then normalize? Suppose we have
> very
> > long urls, does it kill normalize?
> >
> > Thanks
> >
> > Erol Akarsu
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/Parse-reduce-stage-take-forver-tp4072755p4072834.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
>



-- 
*Lewis*

Reply via email to