Re: Regarding Decrease in number of domains in readdb -stats -sort

Markus Jelsma Tue, 30 Aug 2011 13:53:25 -0700

> I mean that if in one cycle N domains show in DB
> in the next cycle there is N - x domains left.
> Number of domains left in crawldb decreases sometimes.
>


That should not be possible at all. Perhaps the output of stats is not 
complete or a misinterpretation.

> 
> Same with the number of fetched urls.

That is possible but the numbers should add up. A fetched url can become a 404 
(db_gone) or a not_modified status.

> My understanding is that after every crawl cycle, the number of fetched
> urls should keep increasing, i.e.  the number is cumulative of the number
> from previous cycle and this cycle. But it decreases as well.

Please try the domain statistics tool and you may also want to readdb -dump 
between cycles and compare. Url's will change status over time. Either 404 or 
not modified or become a redirect.

You may also want to limit the number of url's (e.g. 10 or 20) in a fetch 
cycle so you have a few url's to compare between dumps. Check the changed 
status of those few url's.


> 
> Don't know if this is possible.
> 
> 
> Gaurav
> 
> 
> On Tue, Aug 30, 2011 at 1:24 PM, Markus Jelsma
> 
> <[email protected]>wrote:
> > Hi
> > 
> > > I am using nutch 1.0 and after every updatedb, I take the stats with
> > > the sort parameter which gives the details statistics regarding the
> > > domains
> > 
> > and
> > 
> > > their count(number of urls for that domain in crawldb).
> > > But I see that there is a variable number of domains that do not make
> > 
> > into
> > 
> > > the next round of statistics.
> > 
> > Is my understanding of the above correct that you have N domains in the
> > DB but
> > not all N domains have incremented counts after a crawl cycle?
> > 
> > > Example:
> > > Suppose a domain will be in 4 rounds of crawling (by looking at readdb
> > > stats -sort usage) but it will disappear from the next rounds.
> > > Or some domain will be there for first two rounds but will disappear
> > > from stats for the next few rounds and then reappear again.
> > 
> > Disappear from stats? I am not sure how readdb writes stats but you may
> > want
> > to try the domainstatistics tool (more recent Nutch). That tool can write
> > a complete list of domains and number of url's per domain.
> > 
> > > Is it possible that the domains may be removed from the crawldb or/and
> > 
> > then
> > 
> > > added later?
> > > 
> > > Regards
> > > Gaurav

Re: Regarding Decrease in number of domains in readdb -stats -sort

Reply via email to