Re: Nutch not crawling all pages

Bruno Osiek Wed, 30 Oct 2019 15:51:16 -0700

What is the output of the inject command, ie, when you inject the 50000
seeds justo before generating the first segment?


On Wed, Oct 30, 2019 at 3:18 PM Dave Beckstrom <dbeckst...@collectivefls.com>
wrote:

> Hi Markus,
>
> Thank you so much for the reply and the help!  The seed URL list is
> generated from a CMS.  I'm doubtful that many of the urls would be for
> redirects or missing pages as the CMS only writes out the urls for valid
> pages.  It's got me stumped!
>
> Here is the result of the readdb.  Not sure why the dates are wonky.  The
> date on the server is correct.  SOLR shows 39148 pages.
>
> TOTAL urls:     39164
> shortest fetch interval:        30 days, 00:00:00
> avg fetch interval:     30 days, 00:07:10
> longest fetch interval: 45 days, 00:00:00
> earliest fetch time:    Mon Nov 25 07:08:00 EST 2019
> avg of fetch times:     Wed Nov 27 18:46:00 EST 2019
> latest fetch time:      Sat Dec 14 08:18:00 EST 2019
> retry 0:        39164
> score quantile 0.01:    1.8460402498021722E-4
> score quantile 0.05:    1.8460402498021722E-4
> score quantile 0.1:     1.8460402498021722E-4
> score quantile 0.2:     1.8642803479451686E-4
> score quantile 0.25:    1.8642803479451686E-4
> score quantile 0.3:     1.960784284165129E-4
> score quantile 0.4:     1.9663813566079454E-4
> score quantile 0.5:     2.0251113164704293E-4
> score quantile 0.6:     2.037905069300905E-4
> score quantile 0.7:     2.1473052038345486E-4
> score quantile 0.75:    2.1473052038345486E-4
> score quantile 0.8:     2.172968233935535E-4
> score quantile 0.9:     2.429802336152917E-4
> score quantile 0.95:    2.4354603374376893E-4
> score quantile 0.99:    2.542474209925616E-4
> min score:      3.0443254217971116E-5
> avg score:      7.001118352666182E-4
> max score:      1.3120110034942627
> status 2 (db_fetched):  39150
> status 3 (db_gone):     13
> status 4 (db_redir_temp):       1
> CrawlDb statistics: done
>
>
>
> On Wed, Oct 30, 2019 at 4:01 PM Markus Jelsma <markus.jel...@openindex.io>
> wrote:
>
> > Hello Dave,
> >
> > First you should check the CrawlDB using readdb -stats. My bet is that
> > your set contains some redirects and gone (404), or transient errors. The
> > number for fetched and notModified added up should be about the same as
> the
> > number of documents indexed.
> >
> > Regards,
> > Markus
> >
> >
> >
> > -----Original message-----
> > > From:Dave Beckstrom <dbeckst...@collectivefls.com>
> > > Sent: Wednesday 30th October 2019 20:00
> > > To: user@nutch.apache.org
> > > Subject: Nutch not crawling all pages
> > >
> > > Hi Everyone,
> > >
> > > I googled and researched and I am not finding any solutions.  I'm
> hoping
> > > someone here can help.
> > >
> > > I have txt files with about 50,000 seed urls that are fed to Nutch for
> > > crawling and then indexing in SOLR.  However, it will not index more
> than
> > > about 39,000 pages no matter what I do.   The robots.txt file gives
> Nutch
> > > access to the entire site.
> > >
> > > This is a snippet of the last Nutch run:
> > >
> > > nerator: starting at 2019-10-30 14:44:38
> > > Generator: Selecting best-scoring urls due for fetch.
> > > Generator: filtering: false
> > > Generator: normalizing: true
> > > Generator: topN: 80000
> > > Generator: 0 records selected for fetching, exiting ...
> > > Generate returned 1 (no new segments created)
> > > Escaping loop: no more URLs to fetch now
> > >
> > > I ran that crawl about 5 or 6  times.  It seems to index about 6,000
> > pages
> > > per run.  I planned to keep running it until it hit the 50,000+ page
> mark
> > > which would indicate that all of the pages where indexed.  That last
> run
> > it
> > > just ended without crawling anything more.
> > >
> > > Below are some of the potentially relevent config settings.  I removed
> > the
> > > "description" for brevity.
> > >
> > > <property>
> > >   <name>http.content.limit</name>
> > >   <value>-1</value>
> > > </property>
> > > <property>
> > >  <name>db.ignore.external.links</name>
> > >  <value>true</value>
> > > </property>
> > > <property>
> > >  <name>db.ignore.external.links.mode</name>
> > >  <value>byDomain</value>
> > > </property>
> > > <property>
> > >   <name>db.ignore.internal.links</name>
> > >   <value>false</value>
> > > </property>
> > > <property>
> > >   <name>db.update.additions.allowed</name>
> > >   <value>true</value>
> > >  </property>
> > >  <property>
> > >  <name>db.max.outlinks.per.page</name>
> > >   <value>-1</value>
> > >  </property>
> > >  <property>
> > >   <name>db.injector.overwrite</name>
> > >   <value>true</value>
> > >  </property>
> > >
> > > Anyone have any suggestions?  Its odd that when you give nutch a
> specific
> > > list of urls to be crawled that it wouldn't crawl all of them.
> > >
> > > I appreicate any help you can offer.   Thank you!
> > >
> > > --
> > > *Fig Leaf Software is now Collective FLS, Inc.*
> > > *
> > > *
> > > *Collective FLS, Inc.*
> > >
> > > https://www.collectivefls.com/ <https://www.collectivefls.com/>
> > >
> > >
> > >
> > >
> >
>
> --
> *Fig Leaf Software is now Collective FLS, Inc.*
> *
> *
> *Collective FLS, Inc.*
>
> https://www.collectivefls.com/ <https://www.collectivefls.com/>
>
>
>
> --
Enviado de dispositivo móvel.

Re: Nutch not crawling all pages

Reply via email to