Re: URLNormalizer not working properly

remi tassing Sat, 18 Feb 2012 12:57:23 -0800

Ok, it makes sense, thanks Markus!

Remi


On Saturday, February 18, 2012, Markus Jelsma <[email protected]>
wrote:
>
>> That works just fine!
>>
>> I wonder why crawldb has to be updated first. All these URLs are in
>> segments and similarly the regex-urlfilter works immediately without the
>> need of updating the db.
>
> They work immediately indeed, but the DB is not magically updated. Also,
> segment crawldatum objects are not normalized so they will not match
anymore
> with the contents of the crawldb.
>
>>
>> Any particular reason?
>>
>> Remi
>>
>> On Saturday, February 18, 2012, Markus Jelsma <[email protected]
>
>>
>> wrote:
>> > Did you update the entire crawldb with that normalizer?
>> >
>> >> Hi,
>> >>
>> >> I'm witnessing a weird problem. I configured regex-normalize.xml to
>>
>> escape
>>
>> >> whitespaces, curly braces...and it works while checking with
>> >> URLNormalizerChecker:
>> >> *echo "URL non escaped" | bin/nutch
>> >> org.apache.nutch.net.URLNormalizerChecker*
>> >> *output: escaped URL*
>> >>
>> >> But when I run crawl with Nutch, I can still see "bad" URLs being
>>
>> fetched.
>>
>> >> Any explanation for this?
>> >>
>> >> Remi
>

Re: URLNormalizer not working properly

Reply via email to