Re: Fetching of URLs from seed list ends up with only a small portion of them indexed by Solr

kiran chitturi Thu, 28 Feb 2013 12:03:23 -0800

This looks odd. From what i know, the successfully parsed documents are
sent to Solr. Did you check the logs for any exceptions ?


What command are you using to index ?


On Thu, Feb 28, 2013 at 1:51 PM, Amit Sela <[email protected]> wrote:

> Hi everyone,
>
> I'm running with nutch 1.6 and Solr 3.6.2.
> I'm trying to crawl only the seed list (depth 1) and it seems that the
> process ends with only ~255 of the URLs indexed in Solr.
>
> Seed list is about 120K.
> Fetcher map input is 117K where success is 62K and temp_moved 45K.
> Parse shows success of 62K.
> CrawlDB after the fetch shows db_redir_perm=56K, db_unfetched=27K
> and db_fetched=22K.
>
> And finally IndexerStatus shows 20K documents added.
> What am I missing ?
>
> Thanks!
>
> my nutch-site.xml includes:
> -----------------------------------------
> <name>plugin.includes</name>
>
> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|tika|metatags|js)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)i</value>
> <name>metatags.names</name>
> <value>keywords;Keywords;description;Description</value>
> <name>index.parse.md</name>
>
> <value>metatag.keywords,metatag.Keywords,metatag.description,metatag.Description</value>
> <name>db.update.additions.allowed</name>
> <value>false</value>
> <name>generate.count.mode</name>
> <value>domain</value>
> <name>partition.url.mode</name>
> <value>byDomain</value>
> <name>file.content.limit</name>
> <value>262144</value>
> <name>http.content.limit</name>
> <value>262144</value>
> <name>parse.filter.urls</name>
> <value>true</value>
> <name>parse.normalize.urls</name>
> <value>true</value>
>



-- 
Kiran Chitturi

Re: Fetching of URLs from seed list ends up with only a small portion of them indexed by Solr

Reply via email to