Try Hadoop'in it up...

http://wiki.apache.org/nutch/NutchHadoopTutorial. The version of Nutch
in trunk is dependent on a project called Gora which is supposed to
help speed things up as well but I have yet to make it work...I'd
stick with the tagged version 1.2 and go the Hadoop route.

Best,
Adam

On Wed, Feb 2, 2011 at 7:39 AM, McGibbney, Lewis John
<[email protected]> wrote:
> Best one for this is the wiki. I managed to improve this by implementing as 
> many suggestions as possible
>
> http://wiki.apache.org/nutch/OptimizingCrawls
> Lewis
>
>
>
> -----Original Message-----
> From: Arjun Kumar Reddy [mailto:[email protected]]
> Sent: 02 February 2011 07:52
> To: [email protected]
> Subject: How to speed up nutch crawling!
>
> Hi list,
>
> I am Arjun.
>
> I am trying to develop an application in which I'll give a constrained set
> of urls to the urls file in Nutch. I am able to crawl these urls and get the
> contents of them by reading the data from the segments.
>
> I have crawled by giving the depth 1 as I am no way concerned about the
> outlinks or inlinks in the webpage. I only need the contents of that
> webpages in the urls file.
>
> But performing this crawl takes time. So, suggest me a way to decrease the
> crawl time and increase the speed of crawl. I also dont need indexing
> because I am not concerned about the search part.
>
> Kindly suggest me how to speed up the crawl.
>
> Thanks and regards,*
> *Ch. Arjun Kumar Reddy
>
> Email has been scanned for viruses by Altman Technologies' email management 
> service - www.altman.co.uk/emailsystems
>
> Glasgow Caledonian University is a registered Scottish charity, number 
> SC021474
>
> Winner: Times Higher Education’s Widening Participation Initiative of the 
> Year 2009 and Herald Society’s Education Initiative of the Year 2009.
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
>
> Winner: Times Higher Education’s Outstanding Support for Early Career 
> Researchers of the Year 2010, GCU as a lead with Universities Scotland 
> partners.
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html
>

Reply via email to