Re: CrawlDB, very slow

Markus Jelsma Tue, 28 Sep 2010 05:30:44 -0700

Thanks Andrzej,

I will make an effort in getting it to run on Hadoop but i'd rather go for a 
fully distributed set up (although with only a single node for now) so i can 
add more machines later. Will the HadoopNutch tutorial on the wiki allow me to 
set up for a cluster on a single node? Also, will it then still make use of 
multiple cores?


Cheers,

On Tuesday 28 September 2010 14:20:02 Andrzej Bialecki wrote:
> On 2010-09-28 14:02, Markus Jelsma wrote:
> > Hi,
> >
> > My test setup (only local) now has just over 20 million URL's, i fetched
> > 3m already and the rest needs to be fetched. It's now less time wasting
> > to fetch for 12 hours because merging takes now over 5.5 hours!
> >
> > I've searched but found little information so far. Would now be a good
> > time to try running Nutch on a Hadoop cluster (which i don't have) or try
> > to let Hadoop take advantage of my multiple cores?
> 
> Even running Hadoop in pseudo-distributed mode (on a single node but
> with real JobTracker/TaskTracker) would be much better. The reason is
> that in local mode tasks are NOT executed in parallel but serially.
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: CrawlDB, very slow

Reply via email to