Re: URLNormalizer not working properly

remi tassing Sat, 18 Feb 2012 12:44:25 -0800

That works just fine!

I wonder why crawldb has to be updated first. All these URLs are in
segments and similarly the regex-urlfilter works immediately without the
need of updating the db.


Any particular reason?

Remi

On Saturday, February 18, 2012, Markus Jelsma <[email protected]>
wrote:
> Did you update the entire crawldb with that normalizer?
>
>> Hi,
>>
>> I'm witnessing a weird problem. I configured regex-normalize.xml to
escape
>> whitespaces, curly braces...and it works while checking with
>> URLNormalizerChecker:
>> *echo "URL non escaped" | bin/nutch
>> org.apache.nutch.net.URLNormalizerChecker*
>> *output: escaped URL*
>>
>> But when I run crawl with Nutch, I can still see "bad" URLs being
fetched.
>>
>> Any explanation for this?
>>
>> Remi
>

Re: URLNormalizer not working properly

Reply via email to