Could be all 'gone' - by default Nutch still tries to refetch them but they
are not scheduled to be done in high priority.
Maybe try and generate a dump of the crawldb and look at the entries with a
high retry value to see what their status looks like.

On 21 September 2011 13:32, Marek Bachmann <[email protected]> wrote:

> Hello list,
>
> I was wondering why I get this stats from readdb
>
> ./nutch readdb /nutch/global-crawl/crawldb/ -stats
> CrawlDb statistics start: /nutch/global-crawl/crawldb/
> Statistics for CrawlDb: /nutch/global-crawl/crawldb/
> TOTAL urls:     509035
> retry 0:        507118
> retry 1:        591
> retry 10:       341
> retry 11:       239
> retry 2:        144
> retry 3:        95
> retry 4:        75
> retry 5:        149
> retry 6:        7
> retry 7:        102
> retry 8:        40
> retry 9:        134
> min score:      1.0
> avg score:      1.0
> max score:      1.0
> status 1 (db_unfetched):        89037
> status 2 (db_fetched):  279675
> status 3 (db_gone):     2805
> status 4 (db_redir_temp):       13630
> status 5 (db_redir_perm):       4831
> status 6 (db_notmodified):      119057
> CrawlDb statistics: done
>
> since
>
> <property>
>  <name>db.fetch.retry.max</**name>
>  <value>2</value>
>  <description>The maximum number of times a url that has encountered
>  recoverable errors is generated for fetch.</description>
> </property>
>
> Any suggestions?
>
> Greetings
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
  • retry count Marek Bachmann
    • Re: retry count Julien Nioche

Reply via email to