What is the output of the inject command, ie, when you inject the 50000 seeds justo before generating the first segment?
On Wed, Oct 30, 2019 at 3:18 PM Dave Beckstrom <dbeckst...@collectivefls.com> wrote: > Hi Markus, > > Thank you so much for the reply and the help! The seed URL list is > generated from a CMS. I'm doubtful that many of the urls would be for > redirects or missing pages as the CMS only writes out the urls for valid > pages. It's got me stumped! > > Here is the result of the readdb. Not sure why the dates are wonky. The > date on the server is correct. SOLR shows 39148 pages. > > TOTAL urls: 39164 > shortest fetch interval: 30 days, 00:00:00 > avg fetch interval: 30 days, 00:07:10 > longest fetch interval: 45 days, 00:00:00 > earliest fetch time: Mon Nov 25 07:08:00 EST 2019 > avg of fetch times: Wed Nov 27 18:46:00 EST 2019 > latest fetch time: Sat Dec 14 08:18:00 EST 2019 > retry 0: 39164 > score quantile 0.01: 1.8460402498021722E-4 > score quantile 0.05: 1.8460402498021722E-4 > score quantile 0.1: 1.8460402498021722E-4 > score quantile 0.2: 1.8642803479451686E-4 > score quantile 0.25: 1.8642803479451686E-4 > score quantile 0.3: 1.960784284165129E-4 > score quantile 0.4: 1.9663813566079454E-4 > score quantile 0.5: 2.0251113164704293E-4 > score quantile 0.6: 2.037905069300905E-4 > score quantile 0.7: 2.1473052038345486E-4 > score quantile 0.75: 2.1473052038345486E-4 > score quantile 0.8: 2.172968233935535E-4 > score quantile 0.9: 2.429802336152917E-4 > score quantile 0.95: 2.4354603374376893E-4 > score quantile 0.99: 2.542474209925616E-4 > min score: 3.0443254217971116E-5 > avg score: 7.001118352666182E-4 > max score: 1.3120110034942627 > status 2 (db_fetched): 39150 > status 3 (db_gone): 13 > status 4 (db_redir_temp): 1 > CrawlDb statistics: done > > > > On Wed, Oct 30, 2019 at 4:01 PM Markus Jelsma <markus.jel...@openindex.io> > wrote: > > > Hello Dave, > > > > First you should check the CrawlDB using readdb -stats. My bet is that > > your set contains some redirects and gone (404), or transient errors. The > > number for fetched and notModified added up should be about the same as > the > > number of documents indexed. > > > > Regards, > > Markus > > > > > > > > -----Original message----- > > > From:Dave Beckstrom <dbeckst...@collectivefls.com> > > > Sent: Wednesday 30th October 2019 20:00 > > > To: user@nutch.apache.org > > > Subject: Nutch not crawling all pages > > > > > > Hi Everyone, > > > > > > I googled and researched and I am not finding any solutions. I'm > hoping > > > someone here can help. > > > > > > I have txt files with about 50,000 seed urls that are fed to Nutch for > > > crawling and then indexing in SOLR. However, it will not index more > than > > > about 39,000 pages no matter what I do. The robots.txt file gives > Nutch > > > access to the entire site. > > > > > > This is a snippet of the last Nutch run: > > > > > > nerator: starting at 2019-10-30 14:44:38 > > > Generator: Selecting best-scoring urls due for fetch. > > > Generator: filtering: false > > > Generator: normalizing: true > > > Generator: topN: 80000 > > > Generator: 0 records selected for fetching, exiting ... > > > Generate returned 1 (no new segments created) > > > Escaping loop: no more URLs to fetch now > > > > > > I ran that crawl about 5 or 6 times. It seems to index about 6,000 > > pages > > > per run. I planned to keep running it until it hit the 50,000+ page > mark > > > which would indicate that all of the pages where indexed. That last > run > > it > > > just ended without crawling anything more. > > > > > > Below are some of the potentially relevent config settings. I removed > > the > > > "description" for brevity. > > > > > > <property> > > > <name>http.content.limit</name> > > > <value>-1</value> > > > </property> > > > <property> > > > <name>db.ignore.external.links</name> > > > <value>true</value> > > > </property> > > > <property> > > > <name>db.ignore.external.links.mode</name> > > > <value>byDomain</value> > > > </property> > > > <property> > > > <name>db.ignore.internal.links</name> > > > <value>false</value> > > > </property> > > > <property> > > > <name>db.update.additions.allowed</name> > > > <value>true</value> > > > </property> > > > <property> > > > <name>db.max.outlinks.per.page</name> > > > <value>-1</value> > > > </property> > > > <property> > > > <name>db.injector.overwrite</name> > > > <value>true</value> > > > </property> > > > > > > Anyone have any suggestions? Its odd that when you give nutch a > specific > > > list of urls to be crawled that it wouldn't crawl all of them. > > > > > > I appreicate any help you can offer. Thank you! > > > > > > -- > > > *Fig Leaf Software is now Collective FLS, Inc.* > > > * > > > * > > > *Collective FLS, Inc.* > > > > > > https://www.collectivefls.com/ <https://www.collectivefls.com/> > > > > > > > > > > > > > > > > -- > *Fig Leaf Software is now Collective FLS, Inc.* > * > * > *Collective FLS, Inc.* > > https://www.collectivefls.com/ <https://www.collectivefls.com/> > > > > -- Enviado de dispositivo móvel.