RE: Nutch not crawling all pages

Markus Jelsma Wed, 30 Oct 2019 14:02:03 -0700

Hello Dave,

First you should check the CrawlDB using readdb -stats. My bet is that your set 
contains some redirects and gone (404), or transient errors. The number for 
fetched and notModified added up should be about the same as the number of 
documents indexed.


Regards,
Markus

 
 
-----Original message-----
> From:Dave Beckstrom <dbeckst...@collectivefls.com>
> Sent: Wednesday 30th October 2019 20:00
> To: user@nutch.apache.org
> Subject: Nutch not crawling all pages
> 
> Hi Everyone,
> 
> I googled and researched and I am not finding any solutions.  I'm hoping
> someone here can help.
> 
> I have txt files with about 50,000 seed urls that are fed to Nutch for
> crawling and then indexing in SOLR.  However, it will not index more than
> about 39,000 pages no matter what I do.   The robots.txt file gives Nutch
> access to the entire site.
> 
> This is a snippet of the last Nutch run:
> 
> nerator: starting at 2019-10-30 14:44:38
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: false
> Generator: normalizing: true
> Generator: topN: 80000
> Generator: 0 records selected for fetching, exiting ...
> Generate returned 1 (no new segments created)
> Escaping loop: no more URLs to fetch now
> 
> I ran that crawl about 5 or 6  times.  It seems to index about 6,000 pages
> per run.  I planned to keep running it until it hit the 50,000+ page mark
> which would indicate that all of the pages where indexed.  That last run it
> just ended without crawling anything more.
> 
> Below are some of the potentially relevent config settings.  I removed the
> "description" for brevity.
> 
> <property>
>   <name>http.content.limit</name>
>   <value>-1</value>
> </property>
> <property>
>  <name>db.ignore.external.links</name>
>  <value>true</value>
> </property>
> <property>
>  <name>db.ignore.external.links.mode</name>
>  <value>byDomain</value>
> </property>
> <property>
>   <name>db.ignore.internal.links</name>
>   <value>false</value>
> </property>
> <property>
>   <name>db.update.additions.allowed</name>
>   <value>true</value>
>  </property>
>  <property>
>  <name>db.max.outlinks.per.page</name>
>   <value>-1</value>
>  </property>
>  <property>
>   <name>db.injector.overwrite</name>
>   <value>true</value>
>  </property>
> 
> Anyone have any suggestions?  Its odd that when you give nutch a specific
> list of urls to be crawled that it wouldn't crawl all of them.
> 
> I appreicate any help you can offer.   Thank you!
> 
> -- 
> *Fig Leaf Software is now Collective FLS, Inc.*
> *
> *
> *Collective FLS, Inc.* 
> 
> https://www.collectivefls.com/ <https://www.collectivefls.com/> 
> 
> 
> 
>

RE: Nutch not crawling all pages

Reply via email to