On Sat, Apr 6, 2013 at 3:31 AM, David Philip <[email protected]>wrote:
> Hi Sebastian, > > yes, its taking 2-3 days. Ok I will consider to giving incremental depth > and check stats every step. Thanks. > Yes I have given like this +^http://([a-z0-9]*\.)*spicemobiles.co.in/ and > have removed +. > > what should be the depth for next recrawl case? I mean this question: say > I had crawldb crawled with depth param 5 only and topN 10.. Now I find that > 3-4 urls were deleted and 4 were modified.. I don’t know which those urls > are. So what I am doing is re-initate crawl. At this time, what I should > give depth param? > Once those urls enter the crawldb, crawler won't need to reach those from their parent page again. The crawler has stored those urls in its crawldb / webtable. With each url, a re-crawl interval is maintained (which is by default set to 30 days). Crawler wont pick a url for crawling if its fetch interval aint elapsed since the last time the url was fetched. Crawl interval can be configured using the db.fetch.interval.default property in nutch-site.xml. > > Thanks - David > > > > On Sat, Apr 6, 2013 at 12:54 AM, Sebastian Nagel < > [email protected] > > wrote: > > > Hi David, > > > > > What can be crawl time for very big site, given depth param as 50, > topN > > > default(not passed ) and default fetch interval as 2mins.. > > afaik, the default of topN is Long.MAX_VALUE which is very large. > > So, the size of the crawl is mainly limited by the number of links you > get. > > Anyway, a depth of 50 is a high values, with a delay of 2min. (which is > > very defensive) your crawl will take a long time. > > > > Try to start with small values for depth and topN, e.g. 3 and 50. > > Then look at your crawlDb statistics (bin/nutch readdb ... -stats) > > and check how the numbers of fetch/unfetched/gone/etc. URLs increase > > to get a feeling which values make sense for your crawl. > > > > > Case: Crawling website spicemobilephones.co.in, and in the > > > regexurlfilter.txt – added +^ http://(a-z 0-9)spicemobilephones.co.in. > > This doesn't look like a valid Java regex. > > Did you remove these lines: > > # accept anything else > > +. > > > > Sebastian > > >

