Re: Nutch not crawling all documents in a directory

Sebastian Nagel Tue, 19 Aug 2014 10:40:52 -0700

Hi Paul,

documents in a directory are first just links.
There is a limit on the max. number of links per page.
You may guess: the default is 100 :)
Increase it, or even set it to -1, see below.


Cheers,
Sebastian


<property>
  <name>db.max.outlinks.per.page</name>
  <value>100</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
</property>


On 08/18/2014 10:03 PM, Paul Rogers wrote:
> Hi All
> 
> I'm having problems with Nutch not crawling all the documents in a
> directory:
> 
> The directory in question can be found at:
> 
> http://ws0895/doccontrol/DC-10%20Incoming%20Correspondence(IAE-GUPC)/
> 
> There are 2460 documents (pdf's) in the directory.  Nutch enters the
> directory and indexes the first 100 or so documents and then completes it's
> crawl.  The command issued is:
> 
> HOST=localhost
> PORT=8983
> CORE=collection1
> cd /opt/nutch
> bin/crawl urls crawl http://localhost:8983/solr/collection1 4
> 
> Any attempt to recrawl the directory gives the following output:
> 
> Injector: starting at 2014-08-18 14:58:26
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: total number of urls rejected by filters: 0
> Injector: total number of urls injected after normalization and filtering: 1
> Injector: Merging injected urls into crawl db.
> Injector: overwrite: false
> Injector: update: false
> Injector: finished at 2014-08-18 14:58:29, elapsed: 00:00:02
> Mon Aug 18 14:58:29 EST 2014 : Iteration 1 of 4
> Generating a new segment
> Generator: starting at 2014-08-18 14:58:29
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: false
> Generator: normalizing: true
> Generator: topN: 50000
> Generator: 0 records selected for fetching, exiting ...
> 
> I have the following in conf/nutch-site.xml
> 
>  <property>
>   <name>db.update.additions.allowed</name>
>   <value>true</value>
>   <description>If true, updatedb will add newly discovered URLs, if false
>   only already existing URLs in the CrawlDb will be updated and no new
>   URLs will be added.
>   </description>
>  </property>
> <property>
>   <name>http.content.limit</name>
>   <value>-1</value>
>   <description>The length limit for downloaded content using the http://
>   protocol, in bytes. If this value is nonnegative (>=0), content longer
>   than it will be truncated; otherwise, no truncation at all. Do not
>   confuse this setting with the file.content.limit setting.
>   </description>
>  </property>
> 
> I think this must be a config issue but am unsure where to look next.
> 
> Can anyone point me in the right direction?
> 
> Thanks
> 
> P
>

Re: Nutch not crawling all documents in a directory

Reply via email to