Hi Hermanth Thank you for your detailed answered. Your answers helped me much in understanding, especially on the Job UI.
Sorry, i missed out my specs. NameNode (JobTracker) : CPUx4 DataNode (TaskTracker) : CPUx4 I am replying inline too. > > I have a data of around 518MB, and i wrote a MR program to process it. > > Here are some of my settings in my mapred-site.xml. > > --------------------------------------------------------------- > > mapred.tasktracker.map.tasks.maximum = 20 > > mapred.tasktracker.reduce.tasks.maximum = 20 > > --------------------------------------------------------------- > > > > These two configurations essentially tell the tasktrackers that they can > run 20 maps and 20 reduces in parallel on a machine. Is this what you > intended ? (Generally the sum of these two values should equal the number > of cores on your tasktracker node, or a little more). > > Also, would help if you can tell us your cluster size - i.e. number of > slaves. Cluster size (No of slaves) = 4 Yes, i meant the maximum tasks that could be run in A machine is 20 tasks, both map & reduce. > > My block size is default, 64MB > > With my data size = 518MB, i guess setting the maximum for MR task to 20 > > is far more than enough (518/64 = 8) , did i get it correctly? > > > > > I suppose what you want is to run all the maps in parallel. For that, the > number of map slots in your cluster should be more than the number of maps > of your job (assuming there's a single job running). If the number of slots > is less than number of maps, the maps would be scheduled in multiple waves. > On your jobtracker main page, the Cluster Summary > Map Task Capacity gives > you the total slots available in your cluster. My Map Task Capacity = 80% So, from the explanation and from my data size and configuration, Data size = 518MB Number of map tasks required = 518/64 = 8 tasks This 8 tasks should be spread among 4 slaves, which means each nodes should be able to handle at least 2 tasks. And my settings was mapred.tasktracker.map.tasks.maximum = 20, which is more than enough, so it means the approach is correct? (Well i have CPUx4 in my machine, so in case of large data, i should divide it by 4 in order to determine the smallest figure for mapred.tasktracker.map.tasks.maximum) > > When i run the MR program, i could see in the Map/Reduce Administration > > page that the number of Maps Total = 8, so i assume that everything is > > going well here, once again if i'm wrong please correct me. > > (Sometimes it shows only Maps Total = 3) > > > This value tells us the number of maps that will run for the job. OK > > There's one thing which i'm uncertain about hadoop distribution. > > Is the Maps Total = 8 means that there are 8 map tasks split among all > > the data nodes (task trackers)? > > Is there anyway i can checked whether all the tasks are shared among > > datanodes (where task trackers are working). > > > There's no easy way to check this. The task page for every task shows the > attempts that ran for each task and where they ran under the 'Machine' > column. > Thank you, i see that they're processed on different "Machine", so i guess it's working correctly :) > > > When i clicked on each link under that Task Id, i can see there's "Input > > Split Locations" stated under each task details, if the inputs are > > splitted between data nodes, does that means that everything is working > > well? > > > > > I think this is just the location of the splits, including the replicas. > What you could see is if enough data local maps ran - which means that the > tasks mostly got their inputs from datanodes running on the same machine as > themselves. This is given by the counter "Data-local map tasks" on the job > UI page. > There are two cases under the Job UI. Counter Map Reduce Total ----------------------------------------- Case (1) Launched map tasks 0 0 4 Data-local map tasks 0 0 4 Case (2) Launched map tasks 0 0 2 Data-local map tasks 0 0 1 Hmm.. not quite understand this, if case (2) it means two map tasks are actually reading data from same datanode? But anyway, is this monitoring needed for tuning performance? Thank you.
