Nutch 2.X - Prefered urls to fetch

glumet Sat, 16 Nov 2013 03:04:43 -0800

Hello everybody,

I have the situation: There are over 160 URLs in my seed. I started my
crawling one month ago and run bin/crawl script every midnight. Now I have a
lot of pages crawled in my storage (hBase) but I can see in my Solr index
that some URLs from seed are not crawled at all (ok, some URLs have some
restrictions from a robots.txt but there are lot of URLs with no restriction
from robots.txt, ban or anything) or only in very small number:


The problem is that some of URLs have for example 1500 indexed documents in
Solr and some urls only 15, 20... and lot of them just 0 docs. Lets take an
example:

http://artcyclopedia.com - 8293 docs
http://berlin.de - 12988 docs
http://de.wikipedia.org - 15899 docs
http://imdb.com - 38852 docs
http://jopiehuismanmuseum.nl - 1 doc
http://kasteelgroeneveld.nl     - 0 docs
http://kasteelheeswijk.nl - 295 docs
http://kmm.nl - 0 docs
http://kunsthalkade.nl   - 157 docs
http://velodrom.de - 232 docs

Is it possible tell Nutch to prefer some URLs? Or is it possible to say that
Nutch should to crawl all URLs equally?

Thank you,
Jan



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-2-X-Prefered-urls-to-fetch-tp4101387.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Nutch 2.X - Prefered urls to fetch

Reply via email to