Hi David,

>  What can be crawl time for very big site, given depth param as 50, topN
> default(not passed ) and default fetch interval as 2mins..
afaik, the default of topN is Long.MAX_VALUE which is very large.
So, the size of the crawl is mainly limited by the number of links you get.
Anyway, a depth of 50 is a high values, with a delay of 2min. (which is
very defensive) your crawl will take a long time.

Try to start with small values for depth and topN, e.g. 3 and 50.
Then look at your crawlDb statistics (bin/nutch readdb ... -stats)
and check how the numbers of fetch/unfetched/gone/etc. URLs increase
to get a feeling which values make sense for your crawl.

> Case: Crawling website spicemobilephones.co.in, and in the
> regexurlfilter.txt – added +^ http://(a-z 0-9)spicemobilephones.co.in.
This doesn't look like a valid Java regex.
Did you remove these lines:
  # accept anything else
  +.

Sebastian

Reply via email to