Ok, it makes sense, thanks Markus! Remi
On Saturday, February 18, 2012, Markus Jelsma <markus.jel...@openindex.io> wrote: > >> That works just fine! >> >> I wonder why crawldb has to be updated first. All these URLs are in >> segments and similarly the regex-urlfilter works immediately without the >> need of updating the db. > > They work immediately indeed, but the DB is not magically updated. Also, > segment crawldatum objects are not normalized so they will not match anymore > with the contents of the crawldb. > >> >> Any particular reason? >> >> Remi >> >> On Saturday, February 18, 2012, Markus Jelsma <markus.jel...@openindex.io > >> >> wrote: >> > Did you update the entire crawldb with that normalizer? >> > >> >> Hi, >> >> >> >> I'm witnessing a weird problem. I configured regex-normalize.xml to >> >> escape >> >> >> whitespaces, curly braces...and it works while checking with >> >> URLNormalizerChecker: >> >> *echo "URL non escaped" | bin/nutch >> >> org.apache.nutch.net.URLNormalizerChecker* >> >> *output: escaped URL* >> >> >> >> But when I run crawl with Nutch, I can still see "bad" URLs being >> >> fetched. >> >> >> Any explanation for this? >> >> >> >> Remi >