I am using nutch 1.0 and after every updatedb, I take the stats with the
sort parameter which gives the details statistics regarding the domains and
their count(number of urls for that domain in crawldb).
But I see that there is a variable number of domains that do not make into
the next round of statistics.

Example:
Suppose a domain will be in 4 rounds of crawling (by looking at readdb stats
-sort usage) but it will disappear from the next rounds.
Or some domain will be there for first two rounds but will disappear from
stats for the next few rounds and then reappear again.

Is it possible that the domains may be removed from the crawldb or/and then
added later?

Regards
Gaurav

Reply via email to