Hi Tejas, Thank you..So what I understand is, when we initiate *re-crawl* with depth 1 , it will check for all the urls due for fetch in the first loop itself and fetch all the urls due for fetch. Correct?
What I basically wanted to know is, as for the first fresh crawl depth was 20 and it took 1 day and now(afer a week) I want to reinitiate crawl(same crawl db and same solr host (nutch 1.6), what should be the depth? Thanks -David On Sat, Apr 6, 2013 at 4:53 PM, Tejas Patil <[email protected]>wrote: > On Sat, Apr 6, 2013 at 3:31 AM, David Philip <[email protected] > >wrote: > > > Hi Sebastian, > > > > yes, its taking 2-3 days. Ok I will consider to giving incremental > depth > > and check stats every step. Thanks. > > Yes I have given like this +^http://([a-z0-9]*\.)*spicemobiles.co.in/and > > have removed +. > > > > what should be the depth for next recrawl case? I mean this question: > say > > I had crawldb crawled with depth param 5 only and topN 10.. Now I find > that > > 3-4 urls were deleted and 4 were modified.. I don’t know which those urls > > are. So what I am doing is re-initate crawl. At this time, what I should > > give depth param? > > > Once those urls enter the crawldb, crawler won't need to reach those from > their parent page again. The crawler has stored those urls in its crawldb / > webtable. With each url, a re-crawl interval is maintained (which is by > default set to 30 days). Crawler wont pick a url for crawling if its fetch > interval aint elapsed since the last time the url was fetched. Crawl > interval can be configured using the db.fetch.interval.default property in > nutch-site.xml. > > > > > Thanks - David > > > > > > > > On Sat, Apr 6, 2013 at 12:54 AM, Sebastian Nagel < > > [email protected] > > > wrote: > > > > > Hi David, > > > > > > > What can be crawl time for very big site, given depth param as 50, > > topN > > > > default(not passed ) and default fetch interval as 2mins.. > > > afaik, the default of topN is Long.MAX_VALUE which is very large. > > > So, the size of the crawl is mainly limited by the number of links you > > get. > > > Anyway, a depth of 50 is a high values, with a delay of 2min. (which is > > > very defensive) your crawl will take a long time. > > > > > > Try to start with small values for depth and topN, e.g. 3 and 50. > > > Then look at your crawlDb statistics (bin/nutch readdb ... -stats) > > > and check how the numbers of fetch/unfetched/gone/etc. URLs increase > > > to get a feeling which values make sense for your crawl. > > > > > > > Case: Crawling website spicemobilephones.co.in, and in the > > > > regexurlfilter.txt – added +^ http://(a-z 0-9) > spicemobilephones.co.in. > > > This doesn't look like a valid Java regex. > > > Did you remove these lines: > > > # accept anything else > > > +. > > > > > > Sebastian > > > > > >

