Thank you so much again, Julien, for this email and the other. We are using the /bin/crawl script, yes!
From: Julien Nioche <[email protected]> To: "[email protected]" <[email protected]> Date: 07/10/2014 12:15 PM Subject: Re: Nutch local: large crawls, extremely slow, small solr index Hi again Craig, There is a deduplicator in Nutch but it won't prevent you from crawling these URLs infinitely. One option would be to change the URLFilters / Normalisers so that they deal with the repetition of two elements in the path. How do you run your crawl BTW? Do you use the crawl script? -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

