Hi Paul, documents in a directory are first just links. There is a limit on the max. number of links per page. You may guess: the default is 100 :) Increase it, or even set it to -1, see below.
Cheers, Sebastian <property> <name>db.max.outlinks.per.page</name> <value>100</value> <description>The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. </description> </property> On 08/18/2014 10:03 PM, Paul Rogers wrote: > Hi All > > I'm having problems with Nutch not crawling all the documents in a > directory: > > The directory in question can be found at: > > http://ws0895/doccontrol/DC-10%20Incoming%20Correspondence(IAE-GUPC)/ > > There are 2460 documents (pdf's) in the directory. Nutch enters the > directory and indexes the first 100 or so documents and then completes it's > crawl. The command issued is: > > HOST=localhost > PORT=8983 > CORE=collection1 > cd /opt/nutch > bin/crawl urls crawl http://localhost:8983/solr/collection1 4 > > Any attempt to recrawl the directory gives the following output: > > Injector: starting at 2014-08-18 14:58:26 > Injector: crawlDb: crawl/crawldb > Injector: urlDir: urls > Injector: Converting injected urls to crawl db entries. > Injector: total number of urls rejected by filters: 0 > Injector: total number of urls injected after normalization and filtering: 1 > Injector: Merging injected urls into crawl db. > Injector: overwrite: false > Injector: update: false > Injector: finished at 2014-08-18 14:58:29, elapsed: 00:00:02 > Mon Aug 18 14:58:29 EST 2014 : Iteration 1 of 4 > Generating a new segment > Generator: starting at 2014-08-18 14:58:29 > Generator: Selecting best-scoring urls due for fetch. > Generator: filtering: false > Generator: normalizing: true > Generator: topN: 50000 > Generator: 0 records selected for fetching, exiting ... > > I have the following in conf/nutch-site.xml > > <property> > <name>db.update.additions.allowed</name> > <value>true</value> > <description>If true, updatedb will add newly discovered URLs, if false > only already existing URLs in the CrawlDb will be updated and no new > URLs will be added. > </description> > </property> > <property> > <name>http.content.limit</name> > <value>-1</value> > <description>The length limit for downloaded content using the http:// > protocol, in bytes. If this value is nonnegative (>=0), content longer > than it will be truncated; otherwise, no truncation at all. Do not > confuse this setting with the file.content.limit setting. > </description> > </property> > > I think this must be a config issue but am unsure where to look next. > > Can anyone point me in the right direction? > > Thanks > > P >

