I mean that if in one cycle N domains show in DB
in the next cycle there is N - x domains left.
Number of domains left in crawldb decreases sometimes.


Same with the number of fetched urls.
My understanding is that after every crawl cycle, the number of fetched urls
should keep increasing, i.e.  the number is cumulative of the number from
previous cycle and this cycle. But it decreases as well.

Don't know if this is possible.


Gaurav


On Tue, Aug 30, 2011 at 1:24 PM, Markus Jelsma
<[email protected]>wrote:

> Hi
>
> > I am using nutch 1.0 and after every updatedb, I take the stats with the
> > sort parameter which gives the details statistics regarding the domains
> and
> > their count(number of urls for that domain in crawldb).
> > But I see that there is a variable number of domains that do not make
> into
> > the next round of statistics.
> >
>
> Is my understanding of the above correct that you have N domains in the DB
> but
> not all N domains have incremented counts after a crawl cycle?
>
> > Example:
> > Suppose a domain will be in 4 rounds of crawling (by looking at readdb
> > stats -sort usage) but it will disappear from the next rounds.
> > Or some domain will be there for first two rounds but will disappear from
> > stats for the next few rounds and then reappear again.
>
> Disappear from stats? I am not sure how readdb writes stats but you may
> want
> to try the domainstatistics tool (more recent Nutch). That tool can write a
> complete list of domains and number of url's per domain.
>
> >
> > Is it possible that the domains may be removed from the crawldb or/and
> then
> > added later?
> >
> > Regards
> > Gaurav
>

Reply via email to