> I mean that if in one cycle N domains show in DB > in the next cycle there is N - x domains left. > Number of domains left in crawldb decreases sometimes. >
That should not be possible at all. Perhaps the output of stats is not complete or a misinterpretation. > > Same with the number of fetched urls. That is possible but the numbers should add up. A fetched url can become a 404 (db_gone) or a not_modified status. > My understanding is that after every crawl cycle, the number of fetched > urls should keep increasing, i.e. the number is cumulative of the number > from previous cycle and this cycle. But it decreases as well. Please try the domain statistics tool and you may also want to readdb -dump between cycles and compare. Url's will change status over time. Either 404 or not modified or become a redirect. You may also want to limit the number of url's (e.g. 10 or 20) in a fetch cycle so you have a few url's to compare between dumps. Check the changed status of those few url's. > > Don't know if this is possible. > > > Gaurav > > > On Tue, Aug 30, 2011 at 1:24 PM, Markus Jelsma > > <[email protected]>wrote: > > Hi > > > > > I am using nutch 1.0 and after every updatedb, I take the stats with > > > the sort parameter which gives the details statistics regarding the > > > domains > > > > and > > > > > their count(number of urls for that domain in crawldb). > > > But I see that there is a variable number of domains that do not make > > > > into > > > > > the next round of statistics. > > > > Is my understanding of the above correct that you have N domains in the > > DB but > > not all N domains have incremented counts after a crawl cycle? > > > > > Example: > > > Suppose a domain will be in 4 rounds of crawling (by looking at readdb > > > stats -sort usage) but it will disappear from the next rounds. > > > Or some domain will be there for first two rounds but will disappear > > > from stats for the next few rounds and then reappear again. > > > > Disappear from stats? I am not sure how readdb writes stats but you may > > want > > to try the domainstatistics tool (more recent Nutch). That tool can write > > a complete list of domains and number of url's per domain. > > > > > Is it possible that the domains may be removed from the crawldb or/and > > > > then > > > > > added later? > > > > > > Regards > > > Gaurav

