Re: CrawlDB, very slow

Andrzej Bialecki Tue, 28 Sep 2010 07:27:31 -0700

On 2010-09-28 14:27, Markus Jelsma wrote:

Thanks Andrzej,


I will make an effort in getting it to run on Hadoop but i'd rather go for a
fully distributed set up (although with only a single node for now) so i can
add more machines later.

That's what I meant, sorry for using jargon - pseudo-distributed is a"fully distributed Hadoop that runs on a single node". Please note thatyou don't have to use HDFS then - all nodes :) have direct access to thesame local file system.

Will the HadoopNutch tutorial on the wiki allow me to
set up for a cluster on a single node? Also, will it then still make use of
multiple cores?

Yes, because there will be multiple tasks running in parallel, inmultiple processes, which will be likely run on different cores.

As I said, the main big difference between using LocalJobTracker and areal JobTracker is that with LocalJobTracker:


* all map tasks are run sequentially, there is no parallelism.

* there is always one reduce task - if your dataset is large then thissingle task will have to handle the sorting of the whole dataset, whichmay take disproportionately longer than if the data were split amongmultiple reduce tasks.


Whereas with the JobTracker/TaskTracker, even when running on a single node:

* tasks are run in separate processes and execute in parallel

* there are many reduce tasks (as many as you configured), which handleportions of the output dataset, and which execute also in parallel.

So even on a single node a pseudo-distributed setup should be fasterthan running in local mode.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: CrawlDB, very slow

Reply via email to