Hi, I'm new to hadoop and i've just played around with map reduce. I would like to check if my understanding to hadoop is correct and i would appreciate if anyone could correct me if i'm wrong.
I have a data of around 518MB, and i wrote a MR program to process it. Here are some of my settings in my mapred-site.xml. --------------------------------------------------------------- mapred.tasktracker.map.tasks.maximum = 20 mapred.tasktracker.reduce.tasks.maximum = 20 --------------------------------------------------------------- My block size is default, 64MB With my data size = 518MB, i guess setting the maximum for MR task to 20 is far more than enough (518/64 = 8) , did i get it correctly? When i run the MR program, i could see in the Map/Reduce Administration page that the number of Maps Total = 8, so i assume that everything is going well here, once again if i'm wrong please correct me. (Sometimes it shows only Maps Total = 3) There's one thing which i'm uncertain about hadoop distribution. Is the Maps Total = 8 means that there are 8 map tasks split among all the data nodes (task trackers)? Is there anyway i can checked whether all the tasks are shared among datanodes (where task trackers are working). When i clicked on each link under that Task Id, i can see there's "Input Split Locations" stated under each task details, if the inputs are splitted between data nodes, does that means that everything is working well? I need to make sure i got everything running well because my MR took around 6 hours to finish despite the input size is small.. (Well, i know hadoop is not meant for small data), I'm not sure whether it's my configuration that goes wrong or hadoop is just not suitable for my case. I'm actually running a mahout kmeans analysis. Thank you for your time.
