On 2010-09-28 14:27, Markus Jelsma wrote:
Thanks Andrzej,
I will make an effort in getting it to run on Hadoop but i'd rather go for a
fully distributed set up (although with only a single node for now) so i can
add more machines later.
That's what I meant, sorry for using jargon - pseudo-distributed is a
"fully distributed Hadoop that runs on a single node". Please note that
you don't have to use HDFS then - all nodes :) have direct access to the
same local file system.
Will the HadoopNutch tutorial on the wiki allow me to
set up for a cluster on a single node? Also, will it then still make use of
multiple cores?
Yes, because there will be multiple tasks running in parallel, in
multiple processes, which will be likely run on different cores.
As I said, the main big difference between using LocalJobTracker and a
real JobTracker is that with LocalJobTracker:
* all map tasks are run sequentially, there is no parallelism.
* there is always one reduce task - if your dataset is large then this
single task will have to handle the sorting of the whole dataset, which
may take disproportionately longer than if the data were split among
multiple reduce tasks.
Whereas with the JobTracker/TaskTracker, even when running on a single node:
* tasks are run in separate processes and execute in parallel
* there are many reduce tasks (as many as you configured), which handle
portions of the output dataset, and which execute also in parallel.
So even on a single node a pseudo-distributed setup should be faster
than running in local mode.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com