Hi Abhik, Looks like you need to set the hadoop job conf "-Dmapred.max.split.size=xxx(in bytes)" smaller than block size, if it is supported in mahout wrapper.
Shawn On Thu, Aug 25, 2011 at 11:13 AM, Abhik Banerjee <[email protected]> wrote: > Hi , > > I hope you are doing fine. I had a clarification to make , and thought > I shall shoot you a mail about the same. I am running Canopy and > Kmeans clustering on my Hadoop dev cluster at my organization. , but , > each time I run these on my data set (which is around 55 MB to 70 MB > of sequence files ) , I only see , 1 mapper and 1 reducer running in > the job tracker , both for Canopy and K means CLustering (for each > iteration ) . > > Is it dependant on the data file size being passed , or is there any > way , I can configure the number of mappers being used by these > algorithms (Though I feel I cant do this and it has to be decided by > the job tracker about spawning the number of mappers . Because , with > one mapper it takes quite a while to run my canopy clustering aroud > 5-6 hours , and I am thinking if it can speed up if it can use > multiple mappers somehow. ) > > The Kmeans also uses 1 mapper and 1 reducer but is it is comparatively > fast , as the centroid points are decided by the canopy output result. > > Thanks and Regards, > Abhik Banerjee > > 513 364 6591 >
