Hi All

I'm having problems with Nutch not crawling all the documents in a
directory:

The directory in question can be found at:

http://ws0895/doccontrol/DC-10%20Incoming%20Correspondence(IAE-GUPC)/

There are 2460 documents (pdf's) in the directory.  Nutch enters the
directory and indexes the first 100 or so documents and then completes it's
crawl.  The command issued is:

HOST=localhost
PORT=8983
CORE=collection1
cd /opt/nutch
bin/crawl urls crawl http://localhost:8983/solr/collection1 4

Any attempt to recrawl the directory gives the following output:

Injector: starting at 2014-08-18 14:58:26
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: overwrite: false
Injector: update: false
Injector: finished at 2014-08-18 14:58:29, elapsed: 00:00:02
Mon Aug 18 14:58:29 EST 2014 : Iteration 1 of 4
Generating a new segment
Generator: starting at 2014-08-18 14:58:29
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: 0 records selected for fetching, exiting ...

I have the following in conf/nutch-site.xml

 <property>
  <name>db.update.additions.allowed</name>
  <value>true</value>
  <description>If true, updatedb will add newly discovered URLs, if false
  only already existing URLs in the CrawlDb will be updated and no new
  URLs will be added.
  </description>
 </property>
<property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content using the http://
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  </description>
 </property>

I think this must be a config issue but am unsure where to look next.

Can anyone point me in the right direction?

Thanks

P

Reply via email to