Hi Guys Reposting this since I think it got lost in the tail end of the last post.
I have a web site serving a series of documents (pdf's) and am using Nutch 1.8 to index them in solr. The base url is http://localhost/ and the documents are stored in a series of directories in the directory http://localhost/doccontrol/, e.g. / |_doccontrol |_DC-10 Incoming Correspondence |_DC-11 Outgoing Correspondence If when I first run nutch the folders DC-10 and DC-11 contain all the files to be indexed then nutch crawls everything without a problem - GOOD :-) If I add a new folder or documents to the root or doc control folder then the next time nutch runs it crawls all the new files and indexes them - GOOD :-) However any new files that are added to the DC-10 or DC-11 directories are not indexed with nutch's output as follows (summarised): Injector: starting at 2014-08-29 15:19:59 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: total number of urls rejected by filters: 0 Injector: total number of urls injected after normalization and filtering: 1 Injector: Merging injected urls into crawl db. Injector: overwrite: true Injector: update: false Injector: finished at 2014-08-29 15:20:02, elapsed: 00:00:02 Fri Aug 29 15:20:02 EST 2014 : Iteration 1 of 4 Generating a new segment Generator: starting at 2014-08-29 15:20:02 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: false Generator: normalizing: true Generator: topN: 50000 Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20140829152005 Generator: finished at 2014-08-29 15:20:06, elapsed: 00:00:03 Operating on segment : 20140829152005 Fetching : 20140829152005 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2014-08-29 15:20:06 Fetcher: segment: crawl/segments/20140829152005 Fetcher Timelimit set for : 1409354406733 Using queue mode : byHost Fetcher: threads: 50 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 fetching http://ws0895/doccontrol/ (queue crawl delay=5000ms) -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 . . . -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Fetcher: throughput threshold: -1 -finishing thread FetcherThread, activeThreads=1 Fetcher: throughput threshold retries: 5 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2014-08-29 15:20:09, elapsed: 00:00:02 Parsing : 20140829152005 ParseSegment: starting at 2014-08-29 15:20:09 ParseSegment: segment: crawl/segments/20140829152005 Parsed (3ms):http://ws0895/doccontrol/ ParseSegment: finished at 2014-08-29 15:20:10, elapsed: 00:00:01 CrawlDB update CrawlDb update: starting at 2014-08-29 15:20:11 CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20140829152005] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: false CrawlDb update: URL filtering: false CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2014-08-29 15:20:12, elapsed: 00:00:01 Link inversion LinkDb: starting at 2014-08-29 15:20:13 LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: internal links will be ignored. LinkDb: adding segment: crawl/segments/20140829152005 LinkDb: merging with existing linkdb: crawl/linkdb LinkDb: finished at 2014-08-29 15:20:15, elapsed: 00:00:02 Dedup on crawldb Indexing 20140829152005 on SOLR index -> http://localhost:8983/solr/collection1 Indexer: starting at 2014-08-29 15:20:19 Indexer: deleting gone documents: false Indexer: URL filtering: false Indexer: URL normalizing: false Active IndexWriters : SOLRIndexWriter solr.server.url : URL of the SOLR instance (mandatory) solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : use authentication (default false) solr.auth : username for authentication solr.auth.password : password for authentication Indexer: finished at 2014-08-29 15:20:20, elapsed: 00:00:01 Cleanup on SOLR index -> http://localhost:8983/solr/collection1 Fri Aug 29 15:20:22 EST 2014 : Iteration 2 of 4 Generating a new segment Generator: starting at 2014-08-29 15:20:23 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: false Generator: normalizing: true Generator: topN: 50000 Generator: 0 records selected for fetching, exiting ... BAD - :-( What I'd like nutch to do is to index any newly added docs whatever level they were added at. My nutch command is as follows: bin/crawl urls crawl http://localhost:8983/solr/collection1 4 My nutch-site.xml contains: <property> <name>db.update.additions.allowed</name> <value>true</value> <description>If true, updatedb will add newly discovered URLs, if false only already existing URLs in the CrawlDb will be updated and no new URLs will be added. </description> </property> <name>db.max.outlinks.per.page</name> <value>-1</value> <description>The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. </description> </property> <property> <name>db.injector.overwrite</name> <value>true</value> <description>Whether existing records in the CrawlDB will be overwritten by injected records. </description> </property> <property> <name>db.fetch.schedule.class</name> <value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value> <description>The implementation of fetch schedule. DefaultFetchSchedule simply adds the original fetchInterval to the last fetch time, regardless of page changes.</description> </property> <property> <name>db.fetch.schedule.adaptive.min_interval</name> <value>86400.0</value> <description>Minimum fetchInterval, in seconds.</description> </property> <property> <name>db.fetch.interval.default</name> <value>1209600</value> <description>The default number of seconds between re-fetches of a page (14 days). </description> </property> Is what I am trying to do (recrawl any newly added documents at any level) impossible? Or (more likely) am I missing something in the config? Can anyone point me in the right direction? Many thanks P

