Hi

> I am using nutch 1.0 and after every updatedb, I take the stats with the
> sort parameter which gives the details statistics regarding the domains and
> their count(number of urls for that domain in crawldb).
> But I see that there is a variable number of domains that do not make into
> the next round of statistics.
>

Is my understanding of the above correct that you have N domains in the DB but 
not all N domains have incremented counts after a crawl cycle? 

> Example:
> Suppose a domain will be in 4 rounds of crawling (by looking at readdb
> stats -sort usage) but it will disappear from the next rounds.
> Or some domain will be there for first two rounds but will disappear from
> stats for the next few rounds and then reappear again.

Disappear from stats? I am not sure how readdb writes stats but you may want 
to try the domainstatistics tool (more recent Nutch). That tool can write a 
complete list of domains and number of url's per domain.

> 
> Is it possible that the domains may be removed from the crawldb or/and then
> added later?
> 
> Regards
> Gaurav

Reply via email to