Understanding of the hadoop distribution system (tuning)

Elaine Gan Mon, 10 Sep 2012 18:56:42 -0700

Hi,

I'm new to hadoop and i've just played around with map reduce.
I would like to check if my understanding to hadoop is correct and i
would appreciate if anyone could correct me if i'm wrong.


I have a data of around 518MB, and i wrote a MR program to process it.
Here are some of my settings in my mapred-site.xml.
---------------------------------------------------------------
mapred.tasktracker.map.tasks.maximum = 20
mapred.tasktracker.reduce.tasks.maximum = 20
---------------------------------------------------------------
My block size is default, 64MB
With my data size = 518MB, i guess setting the maximum for MR task to 20
is far more than enough (518/64 = 8) , did i get it correctly?

When i run the MR program, i could see in the Map/Reduce Administration
page that the number of Maps Total = 8, so i assume that everything is
going well here, once again if i'm wrong please correct me.
(Sometimes it shows only Maps Total = 3)

There's one thing which i'm uncertain about hadoop distribution.
Is the Maps Total = 8 means that there are 8 map tasks split among all
the data nodes (task trackers)?
Is there anyway i can checked whether all the tasks are shared among
datanodes (where task trackers are working). 
When i clicked on each link under that Task Id, i can see there's "Input
Split Locations" stated under each task details, if the inputs are
splitted between data nodes, does that means that everything is working
well?

I need to make sure i got everything running well because my MR took
around 6 hours to finish despite the input size is small.. (Well, i know
hadoop is not meant for small data), I'm not sure whether it's my
configuration that goes wrong or hadoop is just not suitable for my case.
I'm actually running a mahout kmeans analysis.

Thank you for your time.

Understanding of the hadoop distribution system (tuning)

Reply via email to