Re: crawl time for depth param 50 and topN not passed

David Philip Sat, 06 Apr 2013 03:32:18 -0700

Hi Sebastian,

   yes, its taking 2-3 days. Ok I will consider to giving incremental depth
and check stats every step. Thanks.
Yes I have given like this +^http://([a-z0-9]*\.)*spicemobiles.co.in/ and
have removed  +.


what should be the depth for next recrawl case?  I mean this question: say
I had crawldb crawled with depth param 5 only and topN 10.. Now I find that
3-4 urls were deleted and 4 were modified.. I don’t know which those urls
are. So what I am doing is re-initate crawl.  At this time, what I should
give depth param?

Thanks - David



On Sat, Apr 6, 2013 at 12:54 AM, Sebastian Nagel <[email protected]
> wrote:

> Hi David,
>
> >  What can be crawl time for very big site, given depth param as 50, topN
> > default(not passed ) and default fetch interval as 2mins..
> afaik, the default of topN is Long.MAX_VALUE which is very large.
> So, the size of the crawl is mainly limited by the number of links you get.
> Anyway, a depth of 50 is a high values, with a delay of 2min. (which is
> very defensive) your crawl will take a long time.
>
> Try to start with small values for depth and topN, e.g. 3 and 50.
> Then look at your crawlDb statistics (bin/nutch readdb ... -stats)
> and check how the numbers of fetch/unfetched/gone/etc. URLs increase
> to get a feeling which values make sense for your crawl.
>
> > Case: Crawling website spicemobilephones.co.in, and in the
> > regexurlfilter.txt – added +^ http://(a-z 0-9)spicemobilephones.co.in.
> This doesn't look like a valid Java regex.
> Did you remove these lines:
>   # accept anything else
>   +.
>
> Sebastian
>

Re: crawl time for depth param 50 and topN not passed

Reply via email to