Hi All I have a web site serving a series of documents (pdf's) and am using Nutch 1.8 to index them in solr. The base url is http://localhost/ and the documents are stored in a series of directories in the directory http://localhost/doccontrol/. To start with while I was experimenting this directory contained a single directory (http://localhost/doccontrol/DC-10 Incoming Correspondence) containing approximately 2500 pdf documents. Nutch successfully crawled and indexed this directory and all the files contained in it.
I have now added two further directories to doccontrol ( http://localhost/doccontrol/DC-11 Outgoing Correspondence and http://localhost/doccontrol/DC-16 MEETINGS MINUTES). Each has about 2500 pdf documents in it. However when I run Nutch no further documents are added to the index and nutch gives the following output. bin/crawl urls crawl http://localhost:8983/solr/collection1 4 Injector: starting at 2014-08-21 14:06:25 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: total number of urls rejected by filters: 0 Injector: total number of urls injected after normalization and filtering: 1 Injector: Merging injected urls into crawl db. Injector: overwrite: false Injector: update: false Injector: finished at 2014-08-21 14:06:28, elapsed: 00:00:02 Thu Aug 21 14:06:28 EST 2014 : Iteration 1 of 4 Generating a new segment Generator: starting at 2014-08-21 14:06:29 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: false Generator: normalizing: true Generator: topN: 50000 Generator: 0 records selected for fetching, exiting ... I have the following in my nutch-site.xml <property> <name>db.update.additions.allowed</name> <value>true</value> <description>If true, updatedb will add newly discovered URLs, if false only already existing URLs in the CrawlDb will be updated and no new URLs will be added. </description> </property> <property> <name>db.max.outlinks.per.page</name> <value>-1</value> <description>The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. </description> </property> Not sure why nutch is not adding new URL's. Is it because http://localhost/doccontrol is not the "root" and will only be scanned again in 30 days time? I thought the db.update.additions.allowed fixed this but am I missing something? Why are the new directories and folders not being added? Can anyone point me in the right direction. Many thanks Paul

