Re: CrawlDB, very slow

Markus Jelsma Tue, 28 Sep 2010 07:39:38 -0700

Thanks for your comments. I'll consult this thread later when i've got the 
time to test the distributed mode and possibly set up HDFS immediately as i'm 
going to need it anyway.



On Tuesday 28 September 2010 16:26:48 Andrzej Bialecki wrote:
> On 2010-09-28 14:27, Markus Jelsma wrote:
> > Thanks Andrzej,
> >
> > I will make an effort in getting it to run on Hadoop but i'd rather go
> > for a fully distributed set up (although with only a single node for now)
> > so i can add more machines later.
> 
> That's what I meant, sorry for using jargon - pseudo-distributed is a
> "fully distributed Hadoop that runs on a single node". Please note that
> you don't have to use HDFS then - all nodes :) have direct access to the
> same local file system.
> 
> > Will the HadoopNutch tutorial on the wiki allow me to
> > set up for a cluster on a single node? Also, will it then still make use
> > of multiple cores?
> 
> Yes, because there will be multiple tasks running in parallel, in
> multiple processes, which will be likely run on different cores.
> 
> As I said, the main big difference between using LocalJobTracker and a
> real JobTracker is that with LocalJobTracker:
> 
> * all map tasks are run sequentially, there is no parallelism.
> * there is always one reduce task - if your dataset is large then this
> single task will have to handle the sorting of the whole dataset, which
> may take disproportionately longer than if the data were split among
> multiple reduce tasks.
> 
> Whereas with the JobTracker/TaskTracker, even when running on a single
>  node:
> 
> * tasks are run in separate processes and execute in parallel
> * there are many reduce tasks (as many as you configured), which handle
> portions of the output dataset, and which execute also in parallel.
> 
> So even on a single node a pseudo-distributed setup should be faster
> than running in local mode.
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: CrawlDB, very slow

Reply via email to