Fetching of URLs from seed list ends up with only a small portion of them indexed by Solr

Amit Sela Thu, 28 Feb 2013 10:51:37 -0800

Hi everyone,

I'm running with nutch 1.6 and Solr 3.6.2.
I'm trying to crawl only the seed list (depth 1) and it seems that the
process ends with only ~255 of the URLs indexed in Solr.


Seed list is about 120K.
Fetcher map input is 117K where success is 62K and temp_moved 45K.
Parse shows success of 62K.
CrawlDB after the fetch shows db_redir_perm=56K, db_unfetched=27K
and db_fetched=22K.

And finally IndexerStatus shows 20K documents added.
What am I missing ?

Thanks!

my nutch-site.xml includes:
-----------------------------------------
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|tika|metatags|js)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)i</value>
<name>metatags.names</name>
<value>keywords;Keywords;description;Description</value>
<name>index.parse.md</name>
<value>metatag.keywords,metatag.Keywords,metatag.description,metatag.Description</value>
<name>db.update.additions.allowed</name>
<value>false</value>
<name>generate.count.mode</name>
<value>domain</value>
<name>partition.url.mode</name>
<value>byDomain</value>
<name>file.content.limit</name>
<value>262144</value>
<name>http.content.limit</name>
<value>262144</value>
<name>parse.filter.urls</name>
<value>true</value>
<name>parse.normalize.urls</name>
<value>true</value>

Fetching of URLs from seed list ends up with only a small portion of them indexed by Solr

Reply via email to