This is a common problem with canopy, since it is single-pass and uses a single 
reducer that must see the outputs of all mappers. You can adjust T2 upward and 
that will reduce the number of canopies produced by each mapper. T1 does not 
affect the number of canopies, only their centroid calculation neighbors. You 
can also specify a T4 value which affects the T2 threshold used by the reducer. 
Increasing this will also reduce the number of canopies but won't impact 
reducer times, since all mapper outputs need to be reduced, but it will affect 
the amount of memory needed in the reducer.

Another alternative posted concurrently (cf. Clustering: Number of Reducers) 
includes a patch to the CanopyMapper which suppresses output of canopies that 
have only a single point. This could be the basis of a more general mapper 
filter which would suppress output of canopies having <n points.

I've run canopy jobs of this size on a similar cluster and, by adjusting the T 
values, have been able to run to completion in all cases.

Hope this helps,
Jeff

-----Original Message-----
From: Paritosh Ranjan [mailto:[email protected]] 
Sent: Saturday, September 24, 2011 2:32 AM
To: [email protected]
Subject: How much memory do I need? : Clustering : Hadoop

Hi,

I am clustering 5 million vectors ( 200 dimensions each ) on a 8 node 
cluster with 2 GB memory each using CanopyDriver. The replication factor 
is 3.

The reduce phase of buildCluster is taking too long to finish.

How can I Improve the performance?

Is it related to memory? If yes, what configuration do you suggest? I 
can not reduce the dimension of vectors.

Thanks and Regards,
Paritosh Ranjan

Reply via email to