Hi, Not sure if it's possibly in the 2.x branch to filter/normalize just once, but with a bit of hacking this should not be too difficult. If you filter the input urls (injected urls) then you only need to filter the new urls in the parser and never more again. (Ofcourse when you change the normalize/filter rules you have to reprocess them all again).
Alternatively, what you could try is the patch in https://issues.apache.org/jira/browse/NUTCH-1314 that limits the url lengths. Usually a few urls can stall the process for a long time because the regexes (in the filter/normalizer) go crazy on them. Best is to do both. On Fri, Feb 1, 2013 at 9:04 AM, kemical <[email protected]> wrote: > Ok now with the generate and -noFilter -noNorm option, the fetch is > starting > almost directly. > > I would really like to have an exhaustive pipeline of how > filtering/normalizing urls are done among all different steps of a crawl to > understand side effects of what i'm doing. > > From what i've found updatedb can also filter/normalize urls, but it also > normalize crawldb urls (which should take a very long time too). What i > want > (i think ^^) is to filter/normalize only discovered urls once. Is there a > way to do that? Or am i completely wrong? > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Very-long-time-just-before-fetching-and-just-after-parsing-tp4037673p4037886.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- *Ferdy Galema* Kalooga Development -- *Kalooga* | Visual RelevanceCheck out our Visual Gallery Layer now!<http://spitsnieuws.nl/archives/entertainment/2012/12/huis-amy-winehouse-levert-weinig-op> Kalooga Helperpark 288 9723 ZA Groningen The Netherlands +31 50 2103400 www.kalooga.com [email protected] EMEA 53 Davies Street W1K 5JH London United Kingdom +44 20 7129 1430Kalooga Spain and LatAM Maria de Sevilla Diago No 3 28022 Madrid - Madrid Spain +34 670 580 872

