Hi , I hope you are doing fine. I had a clarification to make , and thought I shall shoot you a mail about the same. I am running Canopy and Kmeans clustering on my Hadoop dev cluster at my organization. , but , each time I run these on my data set (which is around 55 MB to 70 MB of sequence files ) , I only see , 1 mapper and 1 reducer running in the job tracker , both for Canopy and K means CLustering (for each iteration ) .
Is it dependant on the data file size being passed , or is there any way , I can configure the number of mappers being used by these algorithms (Though I feel I cant do this and it has to be decided by the job tracker about spawning the number of mappers . Because , with one mapper it takes quite a while to run my canopy clustering aroud 5-6 hours , and I am thinking if it can speed up if it can use multiple mappers somehow. ) The Kmeans also uses 1 mapper and 1 reducer but is it is comparatively fast , as the centroid points are decided by the canopy output result. Thanks and Regards, Abhik Banerjee 513 364 6591
