Re: Nutch not crawling all documents in a directory

Paul Rogers Tue, 19 Aug 2014 12:13:05 -0700

Hey Sebastian

Thank you so much!!  You're a star.


P


On 19 August 2014 12:39, Sebastian Nagel <[email protected]> wrote:

> Hi Paul,
>
> documents in a directory are first just links.
> There is a limit on the max. number of links per page.
> You may guess: the default is 100 :)
> Increase it, or even set it to -1, see below.
>
> Cheers,
> Sebastian
>
>
> <property>
>   <name>db.max.outlinks.per.page</name>
>   <value>100</value>
>   <description>The maximum number of outlinks that we'll process for a
> page.
>   If this value is nonnegative (>=0), at most db.max.outlinks.per.page
> outlinks
>   will be processed for a page; otherwise, all outlinks will be processed.
>   </description>
> </property>
>
>
> On 08/18/2014 10:03 PM, Paul Rogers wrote:
> > Hi All
> >
> > I'm having problems with Nutch not crawling all the documents in a
> > directory:
> >
> > The directory in question can be found at:
> >
> > http://ws0895/doccontrol/DC-10%20Incoming%20Correspondence(IAE-GUPC)/
> >
> > There are 2460 documents (pdf's) in the directory.  Nutch enters the
> > directory and indexes the first 100 or so documents and then completes
> it's
> > crawl.  The command issued is:
> >
> > HOST=localhost
> > PORT=8983
> > CORE=collection1
> > cd /opt/nutch
> > bin/crawl urls crawl http://localhost:8983/solr/collection1 4
> >
> > Any attempt to recrawl the directory gives the following output:
> >
> > Injector: starting at 2014-08-18 14:58:26
> > Injector: crawlDb: crawl/crawldb
> > Injector: urlDir: urls
> > Injector: Converting injected urls to crawl db entries.
> > Injector: total number of urls rejected by filters: 0
> > Injector: total number of urls injected after normalization and
> filtering: 1
> > Injector: Merging injected urls into crawl db.
> > Injector: overwrite: false
> > Injector: update: false
> > Injector: finished at 2014-08-18 14:58:29, elapsed: 00:00:02
> > Mon Aug 18 14:58:29 EST 2014 : Iteration 1 of 4
> > Generating a new segment
> > Generator: starting at 2014-08-18 14:58:29
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: filtering: false
> > Generator: normalizing: true
> > Generator: topN: 50000
> > Generator: 0 records selected for fetching, exiting ...
> >
> > I have the following in conf/nutch-site.xml
> >
> >  <property>
> >   <name>db.update.additions.allowed</name>
> >   <value>true</value>
> >   <description>If true, updatedb will add newly discovered URLs, if false
> >   only already existing URLs in the CrawlDb will be updated and no new
> >   URLs will be added.
> >   </description>
> >  </property>
> > <property>
> >   <name>http.content.limit</name>
> >   <value>-1</value>
> >   <description>The length limit for downloaded content using the http://
> >   protocol, in bytes. If this value is nonnegative (>=0), content longer
> >   than it will be truncated; otherwise, no truncation at all. Do not
> >   confuse this setting with the file.content.limit setting.
> >   </description>
> >  </property>
> >
> > I think this must be a config issue but am unsure where to look next.
> >
> > Can anyone point me in the right direction?
> >
> > Thanks
> >
> > P
> >
>
>

Re: Nutch not crawling all documents in a directory

Reply via email to