Could be all 'gone' - by default Nutch still tries to refetch them but they are not scheduled to be done in high priority. Maybe try and generate a dump of the crawldb and look at the entries with a high retry value to see what their status looks like.
On 21 September 2011 13:32, Marek Bachmann <[email protected]> wrote: > Hello list, > > I was wondering why I get this stats from readdb > > ./nutch readdb /nutch/global-crawl/crawldb/ -stats > CrawlDb statistics start: /nutch/global-crawl/crawldb/ > Statistics for CrawlDb: /nutch/global-crawl/crawldb/ > TOTAL urls: 509035 > retry 0: 507118 > retry 1: 591 > retry 10: 341 > retry 11: 239 > retry 2: 144 > retry 3: 95 > retry 4: 75 > retry 5: 149 > retry 6: 7 > retry 7: 102 > retry 8: 40 > retry 9: 134 > min score: 1.0 > avg score: 1.0 > max score: 1.0 > status 1 (db_unfetched): 89037 > status 2 (db_fetched): 279675 > status 3 (db_gone): 2805 > status 4 (db_redir_temp): 13630 > status 5 (db_redir_perm): 4831 > status 6 (db_notmodified): 119057 > CrawlDb statistics: done > > since > > <property> > <name>db.fetch.retry.max</**name> > <value>2</value> > <description>The maximum number of times a url that has encountered > recoverable errors is generated for fetch.</description> > </property> > > Any suggestions? > > Greetings > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

