Hi Sebastian, yes, its taking 2-3 days. Ok I will consider to giving incremental depth and check stats every step. Thanks. Yes I have given like this +^http://([a-z0-9]*\.)*spicemobiles.co.in/ and have removed +.
what should be the depth for next recrawl case? I mean this question: say I had crawldb crawled with depth param 5 only and topN 10.. Now I find that 3-4 urls were deleted and 4 were modified.. I don’t know which those urls are. So what I am doing is re-initate crawl. At this time, what I should give depth param? Thanks - David On Sat, Apr 6, 2013 at 12:54 AM, Sebastian Nagel <[email protected] > wrote: > Hi David, > > > What can be crawl time for very big site, given depth param as 50, topN > > default(not passed ) and default fetch interval as 2mins.. > afaik, the default of topN is Long.MAX_VALUE which is very large. > So, the size of the crawl is mainly limited by the number of links you get. > Anyway, a depth of 50 is a high values, with a delay of 2min. (which is > very defensive) your crawl will take a long time. > > Try to start with small values for depth and topN, e.g. 3 and 50. > Then look at your crawlDb statistics (bin/nutch readdb ... -stats) > and check how the numbers of fetch/unfetched/gone/etc. URLs increase > to get a feeling which values make sense for your crawl. > > > Case: Crawling website spicemobilephones.co.in, and in the > > regexurlfilter.txt – added +^ http://(a-z 0-9)spicemobilephones.co.in. > This doesn't look like a valid Java regex. > Did you remove these lines: > # accept anything else > +. > > Sebastian >

