Ok, it makes sense, thanks Markus!

Remi

On Saturday, February 18, 2012, Markus Jelsma <markus.jel...@openindex.io>
wrote:
>
>> That works just fine!
>>
>> I wonder why crawldb has to be updated first. All these URLs are in
>> segments and similarly the regex-urlfilter works immediately without the
>> need of updating the db.
>
> They work immediately indeed, but the DB is not magically updated. Also,
> segment crawldatum objects are not normalized so they will not match
anymore
> with the contents of the crawldb.
>
>>
>> Any particular reason?
>>
>> Remi
>>
>> On Saturday, February 18, 2012, Markus Jelsma <markus.jel...@openindex.io
>
>>
>> wrote:
>> > Did you update the entire crawldb with that normalizer?
>> >
>> >> Hi,
>> >>
>> >> I'm witnessing a weird problem. I configured regex-normalize.xml to
>>
>> escape
>>
>> >> whitespaces, curly braces...and it works while checking with
>> >> URLNormalizerChecker:
>> >> *echo "URL non escaped" | bin/nutch
>> >> org.apache.nutch.net.URLNormalizerChecker*
>> >> *output: escaped URL*
>> >>
>> >> But when I run crawl with Nutch, I can still see "bad" URLs being
>>
>> fetched.
>>
>> >> Any explanation for this?
>> >>
>> >> Remi
>

Reply via email to