Hi Paul, Try expanding your last parameter (which is the # of crawling rounds).
Also make sure to check these properties: <property> <name>db.ignore.internal.links</name> <value>false</value> <description>If true, when adding new links to a page, links from the same host are ignored. This is an effective way to limit the size of the link database, keeping only the highest quality links. </description> </property> <property> <name>db.ignore.external.links</name> <value>true</value> <description>If true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. </description> </property> The first can be set to false so that Nutch actually processes inlinks from the same host and the second to true, so that Nutch folks external links (if necessary). Also check your max outlinks per page property. HTH, Chris ________________________________________ From: Paul Rogers [[email protected]] Sent: Monday, September 08, 2014 2:09 PM To: [email protected] Subject: Nutch not crawling deep enough into directory structure Hi Guys Reposting this since I think it got lost in the tail end of the last post. I have a web site serving a series of documents (pdf's) and am using Nutch 1.8 to index them in solr. The base url is http://localhost/ and the documents are stored in a series of directories in the directory http://localhost/doccontrol/, e.g. / |_doccontrol |_DC-10 Incoming Correspondence |_DC-11 Outgoing Correspondence If when I first run nutch the folders DC-10 and DC-11 contain all the files to be indexed then nutch crawls everything without a problem - GOOD :-) If I add a new folder or documents to the root or doc control folder then the next time nutch runs it crawls all the new files and indexes them - GOOD :-) However any new files that are added to the DC-10 or DC-11 directories are not indexed with nutch's output as follows (summarised): Injector: starting at 2014-08-29 15:19:59 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: total number of urls rejected by filters: 0 Injector: total number of urls injected after normalization and filtering: 1 Injector: Merging injected urls into crawl db. Injector: overwrite: true Injector: update: false Injector: finished at 2014-08-29 15:20:02, elapsed: 00:00:02 Fri Aug 29 15:20:02 EST 2014 : Iteration 1 of 4 Generating a new segment Generator: starting at 2014-08-29 15:20:02 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: false Generator: normalizing: true Generator: topN: 50000 Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20140829152005 Generator: finished at 2014-08-29 15:20:06, elapsed: 00:00:03 Operating on segment : 20140829152005 Fetching : 20140829152005 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2014-08-29 15:20:06 Fetcher: segment: crawl/segments/20140829152005 Fetcher Timelimit set for : 1409354406733 Using queue mode : byHost Fetcher: threads: 50 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 fetching http://ws0895/doccontrol/ (queue crawl delay=5000ms) -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 . . . -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Fetcher: throughput threshold: -1 -finishing thread FetcherThread, activeThreads=1 Fetcher: throughput threshold retries: 5 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2014-08-29 15:20:09, elapsed: 00:00:02 Parsing : 20140829152005 ParseSegment: starting at 2014-08-29 15:20:09 ParseSegment: segment: crawl/segments/20140829152005 Parsed (3ms):http://ws0895/doccontrol/ ParseSegment: finished at 2014-08-29 15:20:10, elapsed: 00:00:01 CrawlDB update CrawlDb update: starting at 2014-08-29 15:20:11 CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20140829152005] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: false CrawlDb update: URL filtering: false CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2014-08-29 15:20:12, elapsed: 00:00:01 Link inversion LinkDb: starting at 2014-08-29 15:20:13 LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: internal links will be ignored. LinkDb: adding segment: crawl/segments/20140829152005 LinkDb: merging with existing linkdb: crawl/linkdb LinkDb: finished at 2014-08-29 15:20:15, elapsed: 00:00:02 Dedup on crawldb Indexing 20140829152005 on SOLR index -> http://localhost:8983/solr/collection1 Indexer: starting at 2014-08-29 15:20:19 Indexer: deleting gone documents: false Indexer: URL filtering: false Indexer: URL normalizing: false Active IndexWriters : SOLRIndexWriter solr.server.url : URL of the SOLR instance (mandatory) solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : use authentication (default false) solr.auth : username for authentication solr.auth.password : password for authentication Indexer: finished at 2014-08-29 15:20:20, elapsed: 00:00:01 Cleanup on SOLR index -> http://localhost:8983/solr/collection1 Fri Aug 29 15:20:22 EST 2014 : Iteration 2 of 4 Generating a new segment Generator: starting at 2014-08-29 15:20:23 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: false Generator: normalizing: true Generator: topN: 50000 Generator: 0 records selected for fetching, exiting ... BAD - :-( What I'd like nutch to do is to index any newly added docs whatever level they were added at. My nutch command is as follows: bin/crawl urls crawl http://localhost:8983/solr/collection1 4 My nutch-site.xml contains: <property> <name>db.update.additions.allowed</name> <value>true</value> <description>If true, updatedb will add newly discovered URLs, if false only already existing URLs in the CrawlDb will be updated and no new URLs will be added. </description> </property> <name>db.max.outlinks.per.page</name> <value>-1</value> <description>The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. </description> </property> <property> <name>db.injector.overwrite</name> <value>true</value> <description>Whether existing records in the CrawlDB will be overwritten by injected records. </description> </property> <property> <name>db.fetch.schedule.class</name> <value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value> <description>The implementation of fetch schedule. DefaultFetchSchedule simply adds the original fetchInterval to the last fetch time, regardless of page changes.</description> </property> <property> <name>db.fetch.schedule.adaptive.min_interval</name> <value>86400.0</value> <description>Minimum fetchInterval, in seconds.</description> </property> <property> <name>db.fetch.interval.default</name> <value>1209600</value> <description>The default number of seconds between re-fetches of a page (14 days). </description> </property> Is what I am trying to do (recrawl any newly added documents at any level) impossible? Or (more likely) am I missing something in the config? Can anyone point me in the right direction? Many thanks P

