Ok now with the generate and -noFilter -noNorm option, the fetch is starting
almost directly.

I would really like to have an exhaustive pipeline of how
filtering/normalizing urls are done among all different steps of a crawl to
understand side effects of what i'm doing.

>From what i've found updatedb can also filter/normalize urls, but it also
normalize crawldb urls (which should take a very long time too). What i want
(i think ^^) is to filter/normalize only discovered urls once. Is there a
way to do that? Or am i completely wrong?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Very-long-time-just-before-fetching-and-just-after-parsing-tp4037673p4037886.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to