Re: URLNormalizer not working properly

Markus Jelsma Sat, 18 Feb 2012 12:54:25 -0800

> That works just fine!
> 
> I wonder why crawldb has to be updated first. All these URLs are in
> segments and similarly the regex-urlfilter works immediately without the
> need of updating the db.


They work immediately indeed, but the DB is not magically updated. Also, 
segment crawldatum objects are not normalized so they will not match anymore 
with the contents of the crawldb.

> 
> Any particular reason?
> 
> Remi
> 
> On Saturday, February 18, 2012, Markus Jelsma <[email protected]>
> 
> wrote:
> > Did you update the entire crawldb with that normalizer?
> > 
> >> Hi,
> >> 
> >> I'm witnessing a weird problem. I configured regex-normalize.xml to
> 
> escape
> 
> >> whitespaces, curly braces...and it works while checking with
> >> URLNormalizerChecker:
> >> *echo "URL non escaped" | bin/nutch
> >> org.apache.nutch.net.URLNormalizerChecker*
> >> *output: escaped URL*
> >> 
> >> But when I run crawl with Nutch, I can still see "bad" URLs being
> 
> fetched.
> 
> >> Any explanation for this?
> >> 
> >> Remi

Re: URLNormalizer not working properly

Reply via email to