I tried setting http.redirect.max=30 (since I saw there is a bug preventing from setting -1 as all) but still not much difference, it did help a little bit since now I get ~28K but still it's less then half...
On Sat, Mar 2, 2013 at 9:00 AM, Stefan Scheffler < [email protected]> wrote: > Hi Amit. > As i answered you before. There is a config paramter to activate the > crawling of redirections (db_redir_temp 4,770, db_redir_perm 56,810). you > have to activate this in the nutch-site.xml. > Please have a look at the nutch-default.xml to find out which one it is... > Only the pages with db_fetched will be indexed. > > Regards > Stefan > > Am 02.03.2013 01:01, schrieb Amit Sela: > > I am using the crawl script that executes Solr indexing with: >> $bin/nutch solrindex $SOLRURL $CRAWL_PATH/crawldb -linkdb >> $CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT >> and then executes Solr dedup: >> $bin/nutch solrdedup $SOLRURL >> >> I think it has something to do with the CrawlDB job. The job counters >> show: >> db_redir_temp 4,770 >> db_redir_perm 56,810 >> db_notmodified 5,343 >> db_unfetched 27,385 >> db_gone 3,741 >> db_fetched 22,065 >> >> >> On Thu, Feb 28, 2013 at 10:02 PM, kiran chitturi >> <[email protected]>**wrote: >> >> This looks odd. From what i know, the successfully parsed documents are >>> sent to Solr. Did you check the logs for any exceptions ? >>> >>> What command are you using to index ? >>> >>> >>> On Thu, Feb 28, 2013 at 1:51 PM, Amit Sela <[email protected]> wrote: >>> >>> Hi everyone, >>>> >>>> I'm running with nutch 1.6 and Solr 3.6.2. >>>> I'm trying to crawl only the seed list (depth 1) and it seems that the >>>> process ends with only ~255 of the URLs indexed in Solr. >>>> >>>> Seed list is about 120K. >>>> Fetcher map input is 117K where success is 62K and temp_moved 45K. >>>> Parse shows success of 62K. >>>> CrawlDB after the fetch shows db_redir_perm=56K, db_unfetched=27K >>>> and db_fetched=22K. >>>> >>>> And finally IndexerStatus shows 20K documents added. >>>> What am I missing ? >>>> >>>> Thanks! >>>> >>>> my nutch-site.xml includes: >>>> ------------------------------**----------- >>>> <name>plugin.includes</name> >>>> >>>> >>>> <value>protocol-httpclient|**urlfilter-regex|parse-(text|** >>> html|tika|metatags|js)|index-(**basic|anchor|metadata)|query-(** >>> basic|site|url)|response-(**json|xml)|summary-basic|** >>> scoring-opic|urlnormalizer-(**pass|regex|basic)i</value> >>> >>>> <name>metatags.names</name> >>>> <value>keywords;Keywords;**description;Description</**value> >>>> <name>index.parse.md</name> >>>> >>>> >>>> <value>metatag.keywords,**metatag.Keywords,metatag.** >>> description,metatag.**Description</value> >>> >>>> <name>db.update.additions.**allowed</name> >>>> <value>false</value> >>>> <name>generate.count.mode</**name> >>>> <value>domain</value> >>>> <name>partition.url.mode</**name> >>>> <value>byDomain</value> >>>> <name>file.content.limit</**name> >>>> <value>262144</value> >>>> <name>http.content.limit</**name> >>>> <value>262144</value> >>>> <name>parse.filter.urls</name> >>>> <value>true</value> >>>> <name>parse.normalize.urls</**name> >>>> <value>true</value> >>>> >>>> >>> >>> -- >>> Kiran Chitturi >>> >>> >

