Re: Nutch local: large crawls, extremely slow, small solr index

Craig Leinoff Thu, 10 Jul 2014 12:00:38 -0700

Thank you so much again, Julien, for this email and the other.

We are using the /bin/crawl script, yes!

From:   Julien Nioche <[email protected]>
To:     "[email protected]" <[email protected]>
Date:   07/10/2014 12:15 PM
Subject:        Re: Nutch local: large crawls, extremely slow, small solr 
index

Hi again Craig,

There is a deduplicator in Nutch but it won't prevent you from crawling
these URLs infinitely. One option would be to change the URLFilters /
Normalisers so that they deal with the repetition of two elements in the
path.

How do you run your crawl BTW? Do you use the crawl script?

-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Nutch local: large crawls, extremely slow, small solr index

Reply via email to