Re: Nutch not crawling all pages

2019-10-30 Thread Dave Beckstrom
You guys were right.  We have one seed URL file which lists urls to 10
pages,  Each of those 10 pages has roughly 5,000 urls to be crawled.

The links to 3 out of the 10 pages were wrong (missing) -- which accounts
for roughly 15,000+ urls that were missing.  I didn't catch it because
there are multiple servers involved and everything was correct on the
server I was working on but on one of the other servers they were wrong.

I didn't set that part up so that was a mistake someone made before my
time.  But you guys clued me in to it.

Thank you!




On Wed, Oct 30, 2019 at 6:11 PM Markus Jelsma 
wrote:

> Hello,
>
> The CrawlDB does not lie, but you are two pages short of being indexed.
> That can happen for various different reasons and is hard to debug. But
> Bruno's point is valid. If you inject 50k but end up with 39k in the DB,
> this means some are filtered or multiple URLs were normalized back to the
> same.
>
> My experience with websites generating valid URLs only, is that this
> assumption is almost never true. In our case, out of the thousands of sites
> maybe only a few of those with just a dozen URLs are free from errors, e.g.
> not having ambiguous URLs, redirects or 404s or otherwise bogus entries.
>
> Markus
>
>
> -Original message-
> > From:Bruno Osiek 
> > Sent: Wednesday 30th October 2019 23:51
> > To: user@nutch.apache.org
> > Subject: Re: Nutch not crawling all pages
> >
> > What is the output of the inject command, ie, when you inject the 5
> > seeds justo before generating the first segment?
> >
> > On Wed, Oct 30, 2019 at 3:18 PM Dave Beckstrom <
> dbeckst...@collectivefls.com>
> > wrote:
> >
> > > Hi Markus,
> > >
> > > Thank you so much for the reply and the help!  The seed URL list is
> > > generated from a CMS.  I'm doubtful that many of the urls would be for
> > > redirects or missing pages as the CMS only writes out the urls for
> valid
> > > pages.  It's got me stumped!
> > >
> > > Here is the result of the readdb.  Not sure why the dates are wonky.
> The
> > > date on the server is correct.  SOLR shows 39148 pages.
> > >
> > > TOTAL urls: 39164
> > > shortest fetch interval:30 days, 00:00:00
> > > avg fetch interval: 30 days, 00:07:10
> > > longest fetch interval: 45 days, 00:00:00
> > > earliest fetch time:Mon Nov 25 07:08:00 EST 2019
> > > avg of fetch times: Wed Nov 27 18:46:00 EST 2019
> > > latest fetch time:  Sat Dec 14 08:18:00 EST 2019
> > > retry 0:39164
> > > score quantile 0.01:1.8460402498021722E-4
> > > score quantile 0.05:1.8460402498021722E-4
> > > score quantile 0.1: 1.8460402498021722E-4
> > > score quantile 0.2: 1.8642803479451686E-4
> > > score quantile 0.25:1.8642803479451686E-4
> > > score quantile 0.3: 1.960784284165129E-4
> > > score quantile 0.4: 1.9663813566079454E-4
> > > score quantile 0.5: 2.0251113164704293E-4
> > > score quantile 0.6: 2.037905069300905E-4
> > > score quantile 0.7: 2.1473052038345486E-4
> > > score quantile 0.75:2.1473052038345486E-4
> > > score quantile 0.8: 2.172968233935535E-4
> > > score quantile 0.9: 2.429802336152917E-4
> > > score quantile 0.95:2.4354603374376893E-4
> > > score quantile 0.99:2.542474209925616E-4
> > > min score:  3.0443254217971116E-5
> > > avg score:  7.001118352666182E-4
> > > max score:  1.3120110034942627
> > > status 2 (db_fetched):  39150
> > > status 3 (db_gone): 13
> > > status 4 (db_redir_temp):   1
> > > CrawlDb statistics: done
> > >
> > >
> > >
> > > On Wed, Oct 30, 2019 at 4:01 PM Markus Jelsma <
> markus.jel...@openindex.io>
> > > wrote:
> > >
> > > > Hello Dave,
> > > >
> > > > First you should check the CrawlDB using readdb -stats. My bet is
> that
> > > > your set contains some redirects and gone (404), or transient
> errors. The
> > > > number for fetched and notModified added up should be about the same
> as
> > > the
> > > > number of documents indexed.
> > > >
> > > > Regards,
> > > > Markus
> > > >
> > > >
> > > >
> > > > -Original message-
> > > > > From:Dave Beckstrom 
> > > > > Sent: Wednesday 30th October 2019 20:00
> > > > > To: user@nutch.apache.org
> > > > > Subject:

RE: Nutch not crawling all pages

2019-10-30 Thread Markus Jelsma
Hello,

The CrawlDB does not lie, but you are two pages short of being indexed. That 
can happen for various different reasons and is hard to debug. But Bruno's 
point is valid. If you inject 50k but end up with 39k in the DB, this means 
some are filtered or multiple URLs were normalized back to the same.

My experience with websites generating valid URLs only, is that this assumption 
is almost never true. In our case, out of the thousands of sites maybe only a 
few of those with just a dozen URLs are free from errors, e.g. not having 
ambiguous URLs, redirects or 404s or otherwise bogus entries.

Markus 
 
 
-Original message-
> From:Bruno Osiek 
> Sent: Wednesday 30th October 2019 23:51
> To: user@nutch.apache.org
> Subject: Re: Nutch not crawling all pages
> 
> What is the output of the inject command, ie, when you inject the 5
> seeds justo before generating the first segment?
> 
> On Wed, Oct 30, 2019 at 3:18 PM Dave Beckstrom 
> wrote:
> 
> > Hi Markus,
> >
> > Thank you so much for the reply and the help!  The seed URL list is
> > generated from a CMS.  I'm doubtful that many of the urls would be for
> > redirects or missing pages as the CMS only writes out the urls for valid
> > pages.  It's got me stumped!
> >
> > Here is the result of the readdb.  Not sure why the dates are wonky.  The
> > date on the server is correct.  SOLR shows 39148 pages.
> >
> > TOTAL urls: 39164
> > shortest fetch interval:30 days, 00:00:00
> > avg fetch interval: 30 days, 00:07:10
> > longest fetch interval: 45 days, 00:00:00
> > earliest fetch time:Mon Nov 25 07:08:00 EST 2019
> > avg of fetch times: Wed Nov 27 18:46:00 EST 2019
> > latest fetch time:  Sat Dec 14 08:18:00 EST 2019
> > retry 0:39164
> > score quantile 0.01:1.8460402498021722E-4
> > score quantile 0.05:1.8460402498021722E-4
> > score quantile 0.1: 1.8460402498021722E-4
> > score quantile 0.2: 1.8642803479451686E-4
> > score quantile 0.25:1.8642803479451686E-4
> > score quantile 0.3: 1.960784284165129E-4
> > score quantile 0.4: 1.9663813566079454E-4
> > score quantile 0.5: 2.0251113164704293E-4
> > score quantile 0.6: 2.037905069300905E-4
> > score quantile 0.7: 2.1473052038345486E-4
> > score quantile 0.75:2.1473052038345486E-4
> > score quantile 0.8: 2.172968233935535E-4
> > score quantile 0.9: 2.429802336152917E-4
> > score quantile 0.95:2.4354603374376893E-4
> > score quantile 0.99:2.542474209925616E-4
> > min score:  3.0443254217971116E-5
> > avg score:  7.001118352666182E-4
> > max score:  1.3120110034942627
> > status 2 (db_fetched):  39150
> > status 3 (db_gone): 13
> > status 4 (db_redir_temp):   1
> > CrawlDb statistics: done
> >
> >
> >
> > On Wed, Oct 30, 2019 at 4:01 PM Markus Jelsma 
> > wrote:
> >
> > > Hello Dave,
> > >
> > > First you should check the CrawlDB using readdb -stats. My bet is that
> > > your set contains some redirects and gone (404), or transient errors. The
> > > number for fetched and notModified added up should be about the same as
> > the
> > > number of documents indexed.
> > >
> > > Regards,
> > > Markus
> > >
> > >
> > >
> > > -Original message-
> > > > From:Dave Beckstrom 
> > > > Sent: Wednesday 30th October 2019 20:00
> > > > To: user@nutch.apache.org
> > > > Subject: Nutch not crawling all pages
> > > >
> > > > Hi Everyone,
> > > >
> > > > I googled and researched and I am not finding any solutions.  I'm
> > hoping
> > > > someone here can help.
> > > >
> > > > I have txt files with about 50,000 seed urls that are fed to Nutch for
> > > > crawling and then indexing in SOLR.  However, it will not index more
> > than
> > > > about 39,000 pages no matter what I do.   The robots.txt file gives
> > Nutch
> > > > access to the entire site.
> > > >
> > > > This is a snippet of the last Nutch run:
> > > >
> > > > nerator: starting at 2019-10-30 14:44:38
> > > > Generator: Selecting best-scoring urls due for fetch.
> > > > Generator: filtering: false
> > > > Generator: normalizing: true
> > > > Generator: topN: 8
> > > > Generator: 0 records selected for fetching, exiting ...
> > > > Generate returned 1 (no new segments created)
> > > > Esc

Re: Nutch not crawling all pages

2019-10-30 Thread Bruno Osiek
What is the output of the inject command, ie, when you inject the 5
seeds justo before generating the first segment?

On Wed, Oct 30, 2019 at 3:18 PM Dave Beckstrom 
wrote:

> Hi Markus,
>
> Thank you so much for the reply and the help!  The seed URL list is
> generated from a CMS.  I'm doubtful that many of the urls would be for
> redirects or missing pages as the CMS only writes out the urls for valid
> pages.  It's got me stumped!
>
> Here is the result of the readdb.  Not sure why the dates are wonky.  The
> date on the server is correct.  SOLR shows 39148 pages.
>
> TOTAL urls: 39164
> shortest fetch interval:30 days, 00:00:00
> avg fetch interval: 30 days, 00:07:10
> longest fetch interval: 45 days, 00:00:00
> earliest fetch time:Mon Nov 25 07:08:00 EST 2019
> avg of fetch times: Wed Nov 27 18:46:00 EST 2019
> latest fetch time:  Sat Dec 14 08:18:00 EST 2019
> retry 0:39164
> score quantile 0.01:1.8460402498021722E-4
> score quantile 0.05:1.8460402498021722E-4
> score quantile 0.1: 1.8460402498021722E-4
> score quantile 0.2: 1.8642803479451686E-4
> score quantile 0.25:1.8642803479451686E-4
> score quantile 0.3: 1.960784284165129E-4
> score quantile 0.4: 1.9663813566079454E-4
> score quantile 0.5: 2.0251113164704293E-4
> score quantile 0.6: 2.037905069300905E-4
> score quantile 0.7: 2.1473052038345486E-4
> score quantile 0.75:2.1473052038345486E-4
> score quantile 0.8: 2.172968233935535E-4
> score quantile 0.9: 2.429802336152917E-4
> score quantile 0.95:2.4354603374376893E-4
> score quantile 0.99:2.542474209925616E-4
> min score:  3.0443254217971116E-5
> avg score:  7.001118352666182E-4
> max score:  1.3120110034942627
> status 2 (db_fetched):  39150
> status 3 (db_gone): 13
> status 4 (db_redir_temp):   1
> CrawlDb statistics: done
>
>
>
> On Wed, Oct 30, 2019 at 4:01 PM Markus Jelsma 
> wrote:
>
> > Hello Dave,
> >
> > First you should check the CrawlDB using readdb -stats. My bet is that
> > your set contains some redirects and gone (404), or transient errors. The
> > number for fetched and notModified added up should be about the same as
> the
> > number of documents indexed.
> >
> > Regards,
> > Markus
> >
> >
> >
> > -Original message-
> > > From:Dave Beckstrom 
> > > Sent: Wednesday 30th October 2019 20:00
> > > To: user@nutch.apache.org
> > > Subject: Nutch not crawling all pages
> > >
> > > Hi Everyone,
> > >
> > > I googled and researched and I am not finding any solutions.  I'm
> hoping
> > > someone here can help.
> > >
> > > I have txt files with about 50,000 seed urls that are fed to Nutch for
> > > crawling and then indexing in SOLR.  However, it will not index more
> than
> > > about 39,000 pages no matter what I do.   The robots.txt file gives
> Nutch
> > > access to the entire site.
> > >
> > > This is a snippet of the last Nutch run:
> > >
> > > nerator: starting at 2019-10-30 14:44:38
> > > Generator: Selecting best-scoring urls due for fetch.
> > > Generator: filtering: false
> > > Generator: normalizing: true
> > > Generator: topN: 8
> > > Generator: 0 records selected for fetching, exiting ...
> > > Generate returned 1 (no new segments created)
> > > Escaping loop: no more URLs to fetch now
> > >
> > > I ran that crawl about 5 or 6  times.  It seems to index about 6,000
> > pages
> > > per run.  I planned to keep running it until it hit the 50,000+ page
> mark
> > > which would indicate that all of the pages where indexed.  That last
> run
> > it
> > > just ended without crawling anything more.
> > >
> > > Below are some of the potentially relevent config settings.  I removed
> > the
> > > "description" for brevity.
> > >
> > > 
> > >   http.content.limit
> > >   -1
> > > 
> > > 
> > >  db.ignore.external.links
> > >  true
> > > 
> > > 
> > >  db.ignore.external.links.mode
> > >  byDomain
> > > 
> > > 
> > >   db.ignore.internal.links
> > >   false
> > > 
> > > 
> > >   db.update.additions.allowed
> > >   true
> > >  
> > >  
> > >  db.max.outlinks.per.page
> > >   -1
> > >  
> > >  
> > >   db.injector.overwrite
> > >   true
> > >  
> > >
> > > Anyone have any suggestions?  Its odd that when you give nutch a
> specific
> > > list of urls to be crawled that it wouldn't crawl all of them.
> > >
> > > I appreicate any help you can offer.   Thank you!
> > >
> > > --
> > > *Fig Leaf Software is now Collective FLS, Inc.*
> > > *
> > > *
> > > *Collective FLS, Inc.*
> > >
> > > https://www.collectivefls.com/ 
> > >
> > >
> > >
> > >
> >
>
> --
> *Fig Leaf Software is now Collective FLS, Inc.*
> *
> *
> *Collective FLS, Inc.*
>
> https://www.collectivefls.com/ 
>
>
>
> --
Enviado de dispositivo móvel.


Re: Nutch not crawling all pages

2019-10-30 Thread Dave Beckstrom
Hi Markus,

Thank you so much for the reply and the help!  The seed URL list is
generated from a CMS.  I'm doubtful that many of the urls would be for
redirects or missing pages as the CMS only writes out the urls for valid
pages.  It's got me stumped!

Here is the result of the readdb.  Not sure why the dates are wonky.  The
date on the server is correct.  SOLR shows 39148 pages.

TOTAL urls: 39164
shortest fetch interval:30 days, 00:00:00
avg fetch interval: 30 days, 00:07:10
longest fetch interval: 45 days, 00:00:00
earliest fetch time:Mon Nov 25 07:08:00 EST 2019
avg of fetch times: Wed Nov 27 18:46:00 EST 2019
latest fetch time:  Sat Dec 14 08:18:00 EST 2019
retry 0:39164
score quantile 0.01:1.8460402498021722E-4
score quantile 0.05:1.8460402498021722E-4
score quantile 0.1: 1.8460402498021722E-4
score quantile 0.2: 1.8642803479451686E-4
score quantile 0.25:1.8642803479451686E-4
score quantile 0.3: 1.960784284165129E-4
score quantile 0.4: 1.9663813566079454E-4
score quantile 0.5: 2.0251113164704293E-4
score quantile 0.6: 2.037905069300905E-4
score quantile 0.7: 2.1473052038345486E-4
score quantile 0.75:2.1473052038345486E-4
score quantile 0.8: 2.172968233935535E-4
score quantile 0.9: 2.429802336152917E-4
score quantile 0.95:2.4354603374376893E-4
score quantile 0.99:2.542474209925616E-4
min score:  3.0443254217971116E-5
avg score:  7.001118352666182E-4
max score:  1.3120110034942627
status 2 (db_fetched):  39150
status 3 (db_gone): 13
status 4 (db_redir_temp):   1
CrawlDb statistics: done



On Wed, Oct 30, 2019 at 4:01 PM Markus Jelsma 
wrote:

> Hello Dave,
>
> First you should check the CrawlDB using readdb -stats. My bet is that
> your set contains some redirects and gone (404), or transient errors. The
> number for fetched and notModified added up should be about the same as the
> number of documents indexed.
>
> Regards,
> Markus
>
>
>
> -Original message-
> > From:Dave Beckstrom 
> > Sent: Wednesday 30th October 2019 20:00
> > To: user@nutch.apache.org
> > Subject: Nutch not crawling all pages
> >
> > Hi Everyone,
> >
> > I googled and researched and I am not finding any solutions.  I'm hoping
> > someone here can help.
> >
> > I have txt files with about 50,000 seed urls that are fed to Nutch for
> > crawling and then indexing in SOLR.  However, it will not index more than
> > about 39,000 pages no matter what I do.   The robots.txt file gives Nutch
> > access to the entire site.
> >
> > This is a snippet of the last Nutch run:
> >
> > nerator: starting at 2019-10-30 14:44:38
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: filtering: false
> > Generator: normalizing: true
> > Generator: topN: 8
> > Generator: 0 records selected for fetching, exiting ...
> > Generate returned 1 (no new segments created)
> > Escaping loop: no more URLs to fetch now
> >
> > I ran that crawl about 5 or 6  times.  It seems to index about 6,000
> pages
> > per run.  I planned to keep running it until it hit the 50,000+ page mark
> > which would indicate that all of the pages where indexed.  That last run
> it
> > just ended without crawling anything more.
> >
> > Below are some of the potentially relevent config settings.  I removed
> the
> > "description" for brevity.
> >
> > 
> >   http.content.limit
> >   -1
> > 
> > 
> >  db.ignore.external.links
> >  true
> > 
> > 
> >  db.ignore.external.links.mode
> >  byDomain
> > 
> > 
> >   db.ignore.internal.links
> >   false
> > 
> > 
> >   db.update.additions.allowed
> >   true
> >  
> >  
> >  db.max.outlinks.per.page
> >   -1
> >  
> >  
> >   db.injector.overwrite
> >   true
> >  
> >
> > Anyone have any suggestions?  Its odd that when you give nutch a specific
> > list of urls to be crawled that it wouldn't crawl all of them.
> >
> > I appreicate any help you can offer.   Thank you!
> >
> > --
> > *Fig Leaf Software is now Collective FLS, Inc.*
> > *
> > *
> > *Collective FLS, Inc.*
> >
> > https://www.collectivefls.com/ 
> >
> >
> >
> >
>

-- 
*Fig Leaf Software is now Collective FLS, Inc.*
*
*
*Collective FLS, Inc.* 

https://www.collectivefls.com/  





RE: Nutch not crawling all pages

2019-10-30 Thread Markus Jelsma
Hello Dave,

First you should check the CrawlDB using readdb -stats. My bet is that your set 
contains some redirects and gone (404), or transient errors. The number for 
fetched and notModified added up should be about the same as the number of 
documents indexed.

Regards,
Markus

 
 
-Original message-
> From:Dave Beckstrom 
> Sent: Wednesday 30th October 2019 20:00
> To: user@nutch.apache.org
> Subject: Nutch not crawling all pages
> 
> Hi Everyone,
> 
> I googled and researched and I am not finding any solutions.  I'm hoping
> someone here can help.
> 
> I have txt files with about 50,000 seed urls that are fed to Nutch for
> crawling and then indexing in SOLR.  However, it will not index more than
> about 39,000 pages no matter what I do.   The robots.txt file gives Nutch
> access to the entire site.
> 
> This is a snippet of the last Nutch run:
> 
> nerator: starting at 2019-10-30 14:44:38
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: false
> Generator: normalizing: true
> Generator: topN: 8
> Generator: 0 records selected for fetching, exiting ...
> Generate returned 1 (no new segments created)
> Escaping loop: no more URLs to fetch now
> 
> I ran that crawl about 5 or 6  times.  It seems to index about 6,000 pages
> per run.  I planned to keep running it until it hit the 50,000+ page mark
> which would indicate that all of the pages where indexed.  That last run it
> just ended without crawling anything more.
> 
> Below are some of the potentially relevent config settings.  I removed the
> "description" for brevity.
> 
> 
>   http.content.limit
>   -1
> 
> 
>  db.ignore.external.links
>  true
> 
> 
>  db.ignore.external.links.mode
>  byDomain
> 
> 
>   db.ignore.internal.links
>   false
> 
> 
>   db.update.additions.allowed
>   true
>  
>  
>  db.max.outlinks.per.page
>   -1
>  
>  
>   db.injector.overwrite
>   true
>  
> 
> Anyone have any suggestions?  Its odd that when you give nutch a specific
> list of urls to be crawled that it wouldn't crawl all of them.
> 
> I appreicate any help you can offer.   Thank you!
> 
> -- 
> *Fig Leaf Software is now Collective FLS, Inc.*
> *
> *
> *Collective FLS, Inc.* 
> 
> https://www.collectivefls.com/  
> 
> 
> 
>