Hi > I am using nutch 1.0 and after every updatedb, I take the stats with the > sort parameter which gives the details statistics regarding the domains and > their count(number of urls for that domain in crawldb). > But I see that there is a variable number of domains that do not make into > the next round of statistics. >
Is my understanding of the above correct that you have N domains in the DB but not all N domains have incremented counts after a crawl cycle? > Example: > Suppose a domain will be in 4 rounds of crawling (by looking at readdb > stats -sort usage) but it will disappear from the next rounds. > Or some domain will be there for first two rounds but will disappear from > stats for the next few rounds and then reappear again. Disappear from stats? I am not sure how readdb writes stats but you may want to try the domainstatistics tool (more recent Nutch). That tool can write a complete list of domains and number of url's per domain. > > Is it possible that the domains may be removed from the crawldb or/and then > added later? > > Regards > Gaurav

