Re: Crawling and re-crawling huge sites

.: Abhishek :. Fri, 04 Feb 2011 19:52:43 -0800

Hi folks,

 Thanks for your help. I will try these and get back if I have more
questions.


Regards,
Gokul

On Sat, Feb 5, 2011 at 4:17 AM, Charan K <[email protected]> wrote:

> Hi Abishek,
>
>  You need to limit your crawl cycles. Say 2 million per fetch. Check the
> topN param for generate.
>
>  By default a URL will be eligible for recrawl after 30 days, which can be
> configured though.
>
>  You can have continous crawl script with 2m urls for each cycle. You can
> purge old segments after a period, since most of it would gave been
> recrawled by then.
>  Hope it helps
>
>  Thanks
>  Charan
>
> On Feb 4, 2011, at 6:32 AM, Amine BENHAMZA <[email protected]>
> wrote:
>
> > Hi,
> >
> > I saw this page on the wiki, may be could help you
> > http://wiki.apache.org/nutch/MonitoringNutchCrawls
> >
> > Good Luck
> >
> > Amine.
> >
> > On 4 February 2011 15:02, .: Abhishek :. <[email protected]> wrote:
> >
> >> Hi all,
> >>
> >> Any help on this would be highly appreciated. I am still stuck :(
> >>
> >> Thanks,
> >> Abi
> >>
> >> On Fri, Feb 4, 2011 at 8:42 AM, .: Abhishek :. <[email protected]>
> wrote:
> >>
> >>> Hi all,
> >>>
> >>> I am crawling a really huge site, and the crawl has been running like
> >> for
> >>> almost 5 days now and its still continuing.
> >>>
> >>> So until this crawl ends, I will not be able to see the results? What
> do
> >> I
> >>> do to get the results as the crawl still goes on?
> >>>
> >>> Also, in this case how do I configure re-crawls? What would be an
> >> optimal
> >>> re-crawl interval?
> >>>
> >>> Thanks,
> >>> Abi
> >>>
> >>
>

Re: Crawling and re-crawling huge sites

Reply via email to