Hi David, > What can be crawl time for very big site, given depth param as 50, topN > default(not passed ) and default fetch interval as 2mins.. afaik, the default of topN is Long.MAX_VALUE which is very large. So, the size of the crawl is mainly limited by the number of links you get. Anyway, a depth of 50 is a high values, with a delay of 2min. (which is very defensive) your crawl will take a long time.
Try to start with small values for depth and topN, e.g. 3 and 50. Then look at your crawlDb statistics (bin/nutch readdb ... -stats) and check how the numbers of fetch/unfetched/gone/etc. URLs increase to get a feeling which values make sense for your crawl. > Case: Crawling website spicemobilephones.co.in, and in the > regexurlfilter.txt – added +^ http://(a-z 0-9)spicemobilephones.co.in. This doesn't look like a valid Java regex. Did you remove these lines: # accept anything else +. Sebastian