Hi All

I have a web site serving a series of documents (pdf's) and am using Nutch
1.8 to index them in solr.  The base url is http://localhost/ and the
documents are stored in a series of directories in the directory
http://localhost/doccontrol/.  To start with while I was experimenting this
directory contained a single directory (http://localhost/doccontrol/DC-10
Incoming Correspondence) containing approximately 2500 pdf documents.
 Nutch successfully crawled and indexed this directory and all the files
contained in it.

I have now added two further directories to doccontrol (
http://localhost/doccontrol/DC-11 Outgoing Correspondence and
http://localhost/doccontrol/DC-16 MEETINGS MINUTES).  Each has about 2500
pdf documents in it.

However when I run Nutch no further documents are added to the index and
nutch gives the following output.

bin/crawl urls crawl http://localhost:8983/solr/collection1 4

Injector: starting at 2014-08-21 14:06:25
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: overwrite: false
Injector: update: false
Injector: finished at 2014-08-21 14:06:28, elapsed: 00:00:02
Thu Aug 21 14:06:28 EST 2014 : Iteration 1 of 4
Generating a new segment
Generator: starting at 2014-08-21 14:06:29
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: 0 records selected for fetching, exiting ...

I have the following in my nutch-site.xml

 <property>
  <name>db.update.additions.allowed</name>
  <value>true</value>
  <description>If true, updatedb will add newly discovered URLs, if false
  only already existing URLs in the CrawlDb will be updated and no new
  URLs will be added.
  </description>
 </property>
 <property>
  <name>db.max.outlinks.per.page</name>
  <value>-1</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page
outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
 </property>

Not sure why nutch is not adding new URL's.  Is it because
http://localhost/doccontrol is not the "root" and will only be scanned
again in 30 days time?

I thought the db.update.additions.allowed fixed this but am I missing
something?

Why are  the new directories and folders not being added?  Can anyone point
me in the right direction.

Many thanks

Paul

Reply via email to