RE: Distributed Crawling

Markus Jelsma Tue, 12 Jan 2016 07:25:20 -0800

Hello - many you don't need Hadoop. How many URL's do you think you are going 
to crawl, and more importantly, from how many different websites. Nutch in 
local mode with some decent hardware and SSD can still handle ~20 mio URL's if 
spread over ~ 1000 different hosts.


Markus
 
-----Original message-----
> From:Sebastian Nagel <[email protected]>
> Sent: Tuesday 12th January 2016 14:21
> To: [email protected]
> Subject: Re: Distributed Crawling
> 
> Hi,
> 
> Nutch was designed as *distributed* crawler.
> 
> This tutorial should help:
>  https://wiki.apache.org/nutch/NutchHadoopTutorial
> (it may be a little bit outdated, esp. for 1.11
>  which switched from Hadoop 1.2 to 2.4
>  -- we are grateful for any updates and completions.
>  Thanks!)
> 
> It's not easy to manage a Hadoop cluster
> - you may first start to learn how to run  
>  Nutch in pseudo-distributed mode:  
>  http://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial
> - or run Nutch on a Hadoop cloud (e.g., on AWS)
> 
> There are many people sharing their experience out there,
> just google for:
>  nutch distributed crawling
>  nutch aws
> or have a look at Julien's recent video tutorial:
>  https://www.youtube.com/watch?v=v9zjcTjjjyU
> 
> Cheers,
> Sebastian
> 
> On 01/12/2016 01:19 AM, Manish Verma wrote:
> > Hello Friends,
> > 
> > I am using nutch 1.10 and want to do distributed crawling for speed, Is 
> > this supported in Nutch 1.x or 2.x ?
> > Any document on this ?
> > 
> > Thanks Manish
> > 
> 
>

RE: Distributed Crawling

Reply via email to