I may as well drop this one in here, I opened an issue a while back to discuss inherent differences in ordering of filtering and normalization between 1.x and 2.x codebases specifically within Generator* classes. https://issues.apache.org/jira/browse/NUTCH-1373 I am not sure how/if this applies to other tools e.g. if there is inconsistency between ordering of normalization and filtering... we would need to revisit this. Lewis
On Mon, Jun 24, 2013 at 1:14 PM, Markus Jelsma <[email protected]>wrote: > yes, that matters indeed! But if you don't normalize, your URL filters may > not work although that should not be a problem in small crawls or a limited > number of (good) websites. You could try the following normalizing rule to > remove very long URL's as your first rule. > > .{256,} > > With an empty substitution this should `empty` all long URL's. > > > -----Original message----- > > From:eakarsu <[email protected]> > > Sent: Monday 24th June 2013 22:03 > > To: [email protected] > > Subject: Re: Parse reduce stage take forver > > > > Sebastian, > > > > Does it matter reverse order of normalize and filter calls? > > Currently, nutch first does normalize and then filter. > > > > What about if we do reverse: filter and then normalize? Suppose we have > very > > long urls, does it kill normalize? > > > > Thanks > > > > Erol Akarsu > > > > > > > > -- > > View this message in context: > http://lucene.472066.n3.nabble.com/Parse-reduce-stage-take-forver-tp4072755p4072834.html > > Sent from the Nutch - User mailing list archive at Nabble.com. > > > -- *Lewis*

