Re: Clustering : Number of Reducers

Paritosh Ranjan Tue, 20 Sep 2011 09:46:19 -0700

"but all the canopies gotta fit in memory."

If this is true, then CanopyDriver would not be able to cluster HUGEdata ( as the memory might blow up ).

I am using MeanShiftCanopyDriver of 0.6-snapshot which can use anynumber of reducers. Will it also need all the canopies in memory?

Or, which Clustering technique would you suggest to cluster really bigdata ( considering performance and big size as parameters )?


Thanks and Regards,
Paritosh Ranjan

On 20-09-2011 21:35, Jeff Eastman wrote:

Well, while it is true that the CanopyDriver writes all its canopies to the 
file system, they are written at the end of the reduce method. The mappers all 
output the same key, so the one reducer gets all the mapper pairs and these 
must fit into memory before they can be output. With T1/T2 values that are too 
small given the data, there will be a very large number of clusters output by 
each mapper and a corresponding deluge of clusters at the reducer. T3/T4 may be 
used to supply different thresholds in the reduce step, but all the canopies 
gotta fit in memory.

-----Original Message-----
From: Paritosh Ranjan [mailto:[email protected]]
Sent: Tuesday, September 20, 2011 12:31 AM
To: [email protected]
Subject: Re: Clustering : Number of Reducers

"The limit is that all the canopies need to fit into memory."
I don't think so. I think you can use CanopyDriver to write canopies in
a filesystem. This is done as a mapreduce job. Then the KMeansDriver
needs these canopy points as input to run KMeans.

On 20-09-2011 01:39, Jeff Eastman wrote:

Actually, most of the clustering jobs (including DirichletDriver) accept the 
-Dmapred.reduce.tasks=n argument as noted below. Canopy is the only job which 
forces n=1 and this is so the reducer will see all of the mapper outputs. 
Generally, by adjusting T2&   T1 to suitably-large values you can get canopy to 
handle pretty large datasets. The limit is that all the canopies need to fit into 
memory.

-----Original Message-----
From: Paritosh Ranjan [mailto:[email protected]]
Sent: Sunday, September 18, 2011 10:03 PM
To: [email protected]
Subject: Re: Clustering : Number of Reducers

So, does this mean that Mahout can not support clustering for large data?

Even in DirichletDriver the number of reducers is hardcoded to 1. And we
need canopies to run KMeansDriver.

Paritosh

On 19-09-2011 01:47, Konstantin Shmakov wrote:

For most of the tasks one can force the number of reducers with
mapred.reduce.tasks=<N>
where<N>    the desired number of reducers.

It will not necessary increase the performance though - with kmeans and
fuzzykmeans combiners do reducers job and increasing the number of reducers
won't usually affect performance.

With the canopy the distributed
algorithm<http://svn.apache.org/viewvc/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/canopy/CanopyDriver.java?revision=1134456&view=markup>has
no combiners and has 1 reducer hardcoded
- trying to increase #reducers won't have any effect as the algorithm
doesn't work with>1 reducer. My experience that the canopy won't scale to
large data and need improvement.

-- Konstantin



On Sun, Sep 18, 2011 at 10:50 AM, Paritosh Ranjan<[email protected]>    wrote:

Hi,

I have been trying to cluster some hundreds of millions of records using
Mahout Clustering techniques.

The number of reducers is always one which I am not able to change. This is
effecting the performance. I am using Mahout 0.5

In 0.6-SNAPSHOT, I see that the MeanShiftCanopyDriver has been changed to
use any number of reducers. Will other ClusterDrivers also get changed to
use any number of reducers in 0.6?

Thanks and Regards,
Paritosh Ranjan


-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 10.0.1410 / Virus Database: 1520/3906 - Release Date: 09/19/11



-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 10.0.1410 / Virus Database: 1520/3908 - Release Date: 09/20/11

Re: Clustering : Number of Reducers

Reply via email to