Hi Abishek, You need to limit your crawl cycles. Say 2 million per fetch. Check the topN param for generate.
By default a URL will be eligible for recrawl after 30 days, which can be configured though. You can have continous crawl script with 2m urls for each cycle. You can purge old segments after a period, since most of it would gave been recrawled by then. Hope it helps Thanks Charan On Feb 4, 2011, at 6:32 AM, Amine BENHAMZA <[email protected]> wrote: > Hi, > > I saw this page on the wiki, may be could help you > http://wiki.apache.org/nutch/MonitoringNutchCrawls > > Good Luck > > Amine. > > On 4 February 2011 15:02, .: Abhishek :. <[email protected]> wrote: > >> Hi all, >> >> Any help on this would be highly appreciated. I am still stuck :( >> >> Thanks, >> Abi >> >> On Fri, Feb 4, 2011 at 8:42 AM, .: Abhishek :. <[email protected]> wrote: >> >>> Hi all, >>> >>> I am crawling a really huge site, and the crawl has been running like >> for >>> almost 5 days now and its still continuing. >>> >>> So until this crawl ends, I will not be able to see the results? What do >> I >>> do to get the results as the crawl still goes on? >>> >>> Also, in this case how do I configure re-crawls? What would be an >> optimal >>> re-crawl interval? >>> >>> Thanks, >>> Abi >>> >>

