Re: Crawling and re-crawling huge sites

Charan K Fri, 04 Feb 2011 13:16:50 -0800

Hi Abishek,

  You need to limit your crawl cycles. Say 2 million per fetch. Check the topN 
param for generate.


  By default a URL will be eligible for recrawl after 30 days, which can be 
configured though.
 
  You can have continous crawl script with 2m urls for each cycle. You can 
purge old segments after a period, since most of it would gave been recrawled 
by then.
 Hope it helps

 Thanks
 Charan

On Feb 4, 2011, at 6:32 AM, Amine BENHAMZA <[email protected]> wrote:

> Hi,
> 
> I saw this page on the wiki, may be could help you
> http://wiki.apache.org/nutch/MonitoringNutchCrawls
> 
> Good Luck
> 
> Amine.
> 
> On 4 February 2011 15:02, .: Abhishek :. <[email protected]> wrote:
> 
>> Hi all,
>> 
>> Any help on this would be highly appreciated. I am still stuck :(
>> 
>> Thanks,
>> Abi
>> 
>> On Fri, Feb 4, 2011 at 8:42 AM, .: Abhishek :. <[email protected]> wrote:
>> 
>>> Hi all,
>>> 
>>> I am crawling a really huge site, and the crawl has been running like
>> for
>>> almost 5 days now and its still continuing.
>>> 
>>> So until this crawl ends, I will not be able to see the results? What do
>> I
>>> do to get the results as the crawl still goes on?
>>> 
>>> Also, in this case how do I configure re-crawls? What would be an
>> optimal
>>> re-crawl interval?
>>> 
>>> Thanks,
>>> Abi
>>> 
>>

Re: Crawling and re-crawling huge sites

Reply via email to