Hello everybody, I have the situation: There are over 160 URLs in my seed. I started my crawling one month ago and run bin/crawl script every midnight. Now I have a lot of pages crawled in my storage (hBase) but I can see in my Solr index that some URLs from seed are not crawled at all (ok, some URLs have some restrictions from a robots.txt but there are lot of URLs with no restriction from robots.txt, ban or anything) or only in very small number:
The problem is that some of URLs have for example 1500 indexed documents in Solr and some urls only 15, 20... and lot of them just 0 docs. Lets take an example: http://artcyclopedia.com - 8293 docs http://berlin.de - 12988 docs http://de.wikipedia.org - 15899 docs http://imdb.com - 38852 docs http://jopiehuismanmuseum.nl - 1 doc http://kasteelgroeneveld.nl - 0 docs http://kasteelheeswijk.nl - 295 docs http://kmm.nl - 0 docs http://kunsthalkade.nl - 157 docs http://velodrom.de - 232 docs Is it possible tell Nutch to prefer some URLs? Or is it possible to say that Nutch should to crawl all URLs equally? Thank you, Jan -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-2-X-Prefered-urls-to-fetch-tp4101387.html Sent from the Nutch - User mailing list archive at Nabble.com.

