Am 26.02.2013 10:19, schrieb Amit Sela:
Hi all,
I'm running nutch 1.6 and solr 3.6.2 and I'm crawling with depth 1 topN
1000000 and 'db.update.additions.allowed' false.
The idea is to fetch, parse and index only the URLs in the seed list.
I seed ~120K URLs but in solr I see only ~20K indexed.
The fetch job counters show:
moved 49,937 -> redirections i think (not be crawled, there is a nutch
property, which allows this)
robots_denied 1,149 -> forbidden by the robots txt of the seed url
robots_denied_maxcrawldelay 267 -> forbidden by the robots txt delay option of
the seed url
hitByTimeLimit 6,072 -> response timeout
exception 4,479 -> other stuff
notmodified 2
access_denied 4 -> login needed
temp_moved 4,658 -> redirections (not be crawled, there is a nutch property,
which allows this)
success 23,033 -> your 20k, which are indexed
notfound 1,658 -> 404
By the way. if you crawl just with a depth of 1, you don´t need to
specify a topN, because you will allways crawl just the seedurl
and the ParserStatus success count is 22844
What happened to all the URLs ? they are all active URLs, not some old
list...
Thanks,
Amit.
--
Stefan Scheffler
Avantgarde Labs GmbH
Löbauer Straße 19, 01099 Dresden
Telefon: + 49 (0) 351 21590834
Email: [email protected]