Re: Only a small portion of URLs is indexed in Solr at the end of the crawl

Stefan Scheffler Tue, 26 Feb 2013 01:42:10 -0800

Am 26.02.2013 10:19, schrieb Amit Sela:

Hi all,


I'm running nutch 1.6 and solr 3.6.2 and I'm crawling with depth 1 topN
1000000 and 'db.update.additions.allowed' false.
The idea is to fetch, parse and index only the URLs in the seed list.

I seed ~120K URLs but in solr I see only ~20K indexed.

The fetch job counters show:

moved 49,937 -> redirections i think (not be crawled, there is a nutch 
property, which allows this)
robots_denied 1,149 -> forbidden by the robots txt of the seed url
robots_denied_maxcrawldelay 267 -> forbidden by the robots txt delay option of 
the seed url
hitByTimeLimit 6,072 -> response timeout
exception 4,479 -> other stuff
notmodified 2
access_denied 4 -> login needed
temp_moved 4,658 -> redirections (not be crawled, there is a nutch property, 
which allows this)
success 23,033 -> your 20k, which are indexed
notfound 1,658 -> 404

By the way. if you crawl just with a depth of 1, you don´t need tospecify a topN, because you will allways crawl just the seedurl


and the ParserStatus success count is 22844

What happened to all the URLs ? they are all active URLs, not some old
list...

Thanks,

Amit.



--
Stefan Scheffler
Avantgarde Labs GmbH
Löbauer Straße 19, 01099 Dresden
Telefon: + 49 (0) 351 21590834
Email: [email protected]

Re: Only a small portion of URLs is indexed in Solr at the end of the crawl

Reply via email to