New documents not being added by nutch

Paul Rogers Thu, 21 Aug 2014 13:39:13 -0700

Hi All

I have a web site serving a series of documents (pdf's) and am using Nutch
1.8 to index them in solr.  The base url is http://localhost/ and the
documents are stored in a series of directories in the directory
http://localhost/doccontrol/.  To start with while I was experimenting this
directory contained a single directory (http://localhost/doccontrol/DC-10
Incoming Correspondence) containing approximately 2500 pdf documents.
 Nutch successfully crawled and indexed this directory and all the files
contained in it.


I have now added two further directories to doccontrol (
http://localhost/doccontrol/DC-11 Outgoing Correspondence and
http://localhost/doccontrol/DC-16 MEETINGS MINUTES).  Each has about 2500
pdf documents in it.

However when I run Nutch no further documents are added to the index and
nutch gives the following output.

bin/crawl urls crawl http://localhost:8983/solr/collection1 4

Injector: starting at 2014-08-21 14:06:25
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: overwrite: false
Injector: update: false
Injector: finished at 2014-08-21 14:06:28, elapsed: 00:00:02
Thu Aug 21 14:06:28 EST 2014 : Iteration 1 of 4
Generating a new segment
Generator: starting at 2014-08-21 14:06:29
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: 0 records selected for fetching, exiting ...

I have the following in my nutch-site.xml

 <property>
  <name>db.update.additions.allowed</name>
  <value>true</value>
  <description>If true, updatedb will add newly discovered URLs, if false
  only already existing URLs in the CrawlDb will be updated and no new
  URLs will be added.
  </description>
 </property>
 <property>
  <name>db.max.outlinks.per.page</name>
  <value>-1</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page
outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
 </property>

Not sure why nutch is not adding new URL's.  Is it because
http://localhost/doccontrol is not the "root" and will only be scanned
again in 30 days time?

I thought the db.update.additions.allowed fixed this but am I missing
something?

Why are  the new directories and folders not being added?  Can anyone point
me in the right direction.

Many thanks

Paul

New documents not being added by nutch

Reply via email to