Hey Sebastian Thank you so much!! You're a star.
P On 19 August 2014 12:39, Sebastian Nagel <[email protected]> wrote: > Hi Paul, > > documents in a directory are first just links. > There is a limit on the max. number of links per page. > You may guess: the default is 100 :) > Increase it, or even set it to -1, see below. > > Cheers, > Sebastian > > > <property> > <name>db.max.outlinks.per.page</name> > <value>100</value> > <description>The maximum number of outlinks that we'll process for a > page. > If this value is nonnegative (>=0), at most db.max.outlinks.per.page > outlinks > will be processed for a page; otherwise, all outlinks will be processed. > </description> > </property> > > > On 08/18/2014 10:03 PM, Paul Rogers wrote: > > Hi All > > > > I'm having problems with Nutch not crawling all the documents in a > > directory: > > > > The directory in question can be found at: > > > > http://ws0895/doccontrol/DC-10%20Incoming%20Correspondence(IAE-GUPC)/ > > > > There are 2460 documents (pdf's) in the directory. Nutch enters the > > directory and indexes the first 100 or so documents and then completes > it's > > crawl. The command issued is: > > > > HOST=localhost > > PORT=8983 > > CORE=collection1 > > cd /opt/nutch > > bin/crawl urls crawl http://localhost:8983/solr/collection1 4 > > > > Any attempt to recrawl the directory gives the following output: > > > > Injector: starting at 2014-08-18 14:58:26 > > Injector: crawlDb: crawl/crawldb > > Injector: urlDir: urls > > Injector: Converting injected urls to crawl db entries. > > Injector: total number of urls rejected by filters: 0 > > Injector: total number of urls injected after normalization and > filtering: 1 > > Injector: Merging injected urls into crawl db. > > Injector: overwrite: false > > Injector: update: false > > Injector: finished at 2014-08-18 14:58:29, elapsed: 00:00:02 > > Mon Aug 18 14:58:29 EST 2014 : Iteration 1 of 4 > > Generating a new segment > > Generator: starting at 2014-08-18 14:58:29 > > Generator: Selecting best-scoring urls due for fetch. > > Generator: filtering: false > > Generator: normalizing: true > > Generator: topN: 50000 > > Generator: 0 records selected for fetching, exiting ... > > > > I have the following in conf/nutch-site.xml > > > > <property> > > <name>db.update.additions.allowed</name> > > <value>true</value> > > <description>If true, updatedb will add newly discovered URLs, if false > > only already existing URLs in the CrawlDb will be updated and no new > > URLs will be added. > > </description> > > </property> > > <property> > > <name>http.content.limit</name> > > <value>-1</value> > > <description>The length limit for downloaded content using the http:// > > protocol, in bytes. If this value is nonnegative (>=0), content longer > > than it will be truncated; otherwise, no truncation at all. Do not > > confuse this setting with the file.content.limit setting. > > </description> > > </property> > > > > I think this must be a config issue but am unsure where to look next. > > > > Can anyone point me in the right direction? > > > > Thanks > > > > P > > > >

