Hi,

Not sure if it's possibly in the 2.x branch to filter/normalize just once,
but with a bit of hacking this should not be too difficult. If you filter
the input urls (injected urls) then you only need to filter the new urls in
the parser and never more again. (Ofcourse when you change the
normalize/filter rules you have to reprocess them all again).

Alternatively, what you could try is the patch in
https://issues.apache.org/jira/browse/NUTCH-1314 that limits the url
lengths. Usually a few urls can stall the process for a long time because
the regexes (in the filter/normalizer) go crazy on them.

Best is to do both.


On Fri, Feb 1, 2013 at 9:04 AM, kemical <[email protected]> wrote:

> Ok now with the generate and -noFilter -noNorm option, the fetch is
> starting
> almost directly.
>
> I would really like to have an exhaustive pipeline of how
> filtering/normalizing urls are done among all different steps of a crawl to
> understand side effects of what i'm doing.
>
> From what i've found updatedb can also filter/normalize urls, but it also
> normalize crawldb urls (which should take a very long time too). What i
> want
> (i think ^^) is to filter/normalize only discovered urls once. Is there a
> way to do that? Or am i completely wrong?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Very-long-time-just-before-fetching-and-just-after-parsing-tp4037673p4037886.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Ferdy Galema*
Kalooga Development

-- 

*Kalooga* | Visual RelevanceCheck out our Visual Gallery Layer 
now!<http://spitsnieuws.nl/archives/entertainment/2012/12/huis-amy-winehouse-levert-weinig-op>
Kalooga

Helperpark 288
9723 ZA Groningen
The Netherlands
+31 50 2103400

www.kalooga.com
[email protected] EMEA

53 Davies Street
W1K 5JH London
United Kingdom
+44 20 7129 1430Kalooga Spain and LatAM

Maria de Sevilla Diago No 3
28022 Madrid - Madrid
Spain
+34 670 580 872


Reply via email to