Hi folks, Thanks for your help. I will try these and get back if I have more questions.
Regards, Gokul On Sat, Feb 5, 2011 at 4:17 AM, Charan K <[email protected]> wrote: > Hi Abishek, > > You need to limit your crawl cycles. Say 2 million per fetch. Check the > topN param for generate. > > By default a URL will be eligible for recrawl after 30 days, which can be > configured though. > > You can have continous crawl script with 2m urls for each cycle. You can > purge old segments after a period, since most of it would gave been > recrawled by then. > Hope it helps > > Thanks > Charan > > On Feb 4, 2011, at 6:32 AM, Amine BENHAMZA <[email protected]> > wrote: > > > Hi, > > > > I saw this page on the wiki, may be could help you > > http://wiki.apache.org/nutch/MonitoringNutchCrawls > > > > Good Luck > > > > Amine. > > > > On 4 February 2011 15:02, .: Abhishek :. <[email protected]> wrote: > > > >> Hi all, > >> > >> Any help on this would be highly appreciated. I am still stuck :( > >> > >> Thanks, > >> Abi > >> > >> On Fri, Feb 4, 2011 at 8:42 AM, .: Abhishek :. <[email protected]> > wrote: > >> > >>> Hi all, > >>> > >>> I am crawling a really huge site, and the crawl has been running like > >> for > >>> almost 5 days now and its still continuing. > >>> > >>> So until this crawl ends, I will not be able to see the results? What > do > >> I > >>> do to get the results as the crawl still goes on? > >>> > >>> Also, in this case how do I configure re-crawls? What would be an > >> optimal > >>> re-crawl interval? > >>> > >>> Thanks, > >>> Abi > >>> > >> >

