This is a common problem with canopy, since it is single-pass and uses a single reducer that must see the outputs of all mappers. You can adjust T2 upward and that will reduce the number of canopies produced by each mapper. T1 does not affect the number of canopies, only their centroid calculation neighbors. You can also specify a T4 value which affects the T2 threshold used by the reducer. Increasing this will also reduce the number of canopies but won't impact reducer times, since all mapper outputs need to be reduced, but it will affect the amount of memory needed in the reducer.
Another alternative posted concurrently (cf. Clustering: Number of Reducers) includes a patch to the CanopyMapper which suppresses output of canopies that have only a single point. This could be the basis of a more general mapper filter which would suppress output of canopies having <n points. I've run canopy jobs of this size on a similar cluster and, by adjusting the T values, have been able to run to completion in all cases. Hope this helps, Jeff -----Original Message----- From: Paritosh Ranjan [mailto:[email protected]] Sent: Saturday, September 24, 2011 2:32 AM To: [email protected] Subject: How much memory do I need? : Clustering : Hadoop Hi, I am clustering 5 million vectors ( 200 dimensions each ) on a 8 node cluster with 2 GB memory each using CanopyDriver. The replication factor is 3. The reduce phase of buildCluster is taking too long to finish. How can I Improve the performance? Is it related to memory? If yes, what configuration do you suggest? I can not reduce the dimension of vectors. Thanks and Regards, Paritosh Ranjan
