Hi All I'm having problems with Nutch not crawling all the documents in a directory:
The directory in question can be found at: http://ws0895/doccontrol/DC-10%20Incoming%20Correspondence(IAE-GUPC)/ There are 2460 documents (pdf's) in the directory. Nutch enters the directory and indexes the first 100 or so documents and then completes it's crawl. The command issued is: HOST=localhost PORT=8983 CORE=collection1 cd /opt/nutch bin/crawl urls crawl http://localhost:8983/solr/collection1 4 Any attempt to recrawl the directory gives the following output: Injector: starting at 2014-08-18 14:58:26 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: total number of urls rejected by filters: 0 Injector: total number of urls injected after normalization and filtering: 1 Injector: Merging injected urls into crawl db. Injector: overwrite: false Injector: update: false Injector: finished at 2014-08-18 14:58:29, elapsed: 00:00:02 Mon Aug 18 14:58:29 EST 2014 : Iteration 1 of 4 Generating a new segment Generator: starting at 2014-08-18 14:58:29 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: false Generator: normalizing: true Generator: topN: 50000 Generator: 0 records selected for fetching, exiting ... I have the following in conf/nutch-site.xml <property> <name>db.update.additions.allowed</name> <value>true</value> <description>If true, updatedb will add newly discovered URLs, if false only already existing URLs in the CrawlDb will be updated and no new URLs will be added. </description> </property> <property> <name>http.content.limit</name> <value>-1</value> <description>The length limit for downloaded content using the http:// protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the file.content.limit setting. </description> </property> I think this must be a config issue but am unsure where to look next. Can anyone point me in the right direction? Thanks P

