The wonderful thing about open source is you can contribute these improvements! 
But look at them in some detail and you may find a few gotchas. For example:

- you still have to store the 1-vector clusters somewhere during processing. 
With Canopy you won't know which ones to throw away until the end of the 
reducer step and by then you have an OME.
- moving to persistent cluster implementations for any of the clustering 
algorithms is possible, but at a huge performance penalty. You basically have 
to stream over the entire set of clusters for each input vector. And on Hadoop, 
where appending to files is problematic, the mechanics of adding to the 
persistent cluster set seems daunting. Perhaps HBase?

I'm not quibbling with your business case, I just don't think Mahout has a 
solution for it.
Jeff



-----Original Message-----
From: Paritosh Ranjan [mailto:[email protected]] 
Sent: Tuesday, September 20, 2011 11:20 AM
To: [email protected]
Subject: Re: Clustering : Number of Reducers

The bigger problem, in my opinion is, the existence of canopies 
containing single vectors. Since, these canopies with only vector inside 
it are not clusters, so, there would be almost a billion canopies 
formed, if the vectors are far from each other.

I think, two improvements, can be applied to the current algorithm.

1) To ask for minimum number of vectors to be inside a canopy/cluster, 
or the cluster is discarded.
2) To change this "in memory" version of clustering to a "persisted" 
one. The current implementation is not scalable. I have a valid business 
scenario with 5 million clusters, and I think there would be more users 
with bigger datasets/cluster numbers.

Thanks and Regards,
Paritosh Ranjan

On 20-09-2011 23:35, Jeff Eastman wrote:
> As all the Mahout clustering implementations keep their clusters in memory, I 
> don't believe any of them will handle that many clusters. I'm a bit 
> skeptical; however, that 5 million clusters over a billion, 300-d vectors 
> will produce anything useful by way of analytics. You've got the curse of 
> dimensionality working against you and your vectors will be nearly 
> equidistant from each other. This means that very small (=noise) differences 
> in distance will be driving the clustering.
>
>
> -----Original Message-----
> From: Paritosh Ranjan [mailto:[email protected]]
> Sent: Tuesday, September 20, 2011 10:41 AM
> To: [email protected]
> Subject: Re: Clustering : Number of Reducers
>
>
> The max load I expect is 1 billion vectors. Around 300 dimensions per
> vector. The number of clusters with more than one vector inside it can
> be around 5 million, with an average of 10-20 vector per cluster.
>
> But, When most of the vectors are really far away in the worst case
> (apart from the similar ones, which will be inside the canopy) , most of
> the canopies might contain only one vector. So, the number of canopies
> will be really high ( As lots of canopies will result into clusters
> having single vector ).
>
> On 20-09-2011 22:56, Jeff Eastman wrote:
>> I guess it depends upon what you expect from your HUGE data set: How many 
>> clusters do you believe it contains? A hundred? A thousand? A million? A 
>> billion? With the right T-values I believe Canopy can handle the first three 
>> but not the last. It will also depend upon the size of your vectors. This is 
>> because, as canopy centroids are calculated, the centroid vectors become 
>> more dense and these take up more space in memory. So a million, really wide 
>> clusters might have trouble fitting into a 4GB reducer memory. But what are 
>> you really going to do with a million clusters? This number seems vastly 
>> larger than one might find useful in summarizing a data set. I would think a 
>> couple hundred clusters would be the limit of human-understandable 
>> clustering. Canopy can do that with no problem.
>>
>> MeanShiftCanopy, as its name implies, is really just an iterative canopy 
>> implementation. It allows the specification of an arbitrary number of 
>> initial reducers, but it counts them down to 1 in each iteration in order to 
>> properly process all the input. It is an agglomerative clustering algorithm, 
>> and the clusters it builds contain the indices of each of the input points 
>> that have been agglomerated. This makes the mean shift canopy larger in 
>> memory than vanilla canopies since the list of points is maintained too. It 
>> is possible to avoid the points accumulation and it won't happen unless the 
>> -cl option is provided. In this case the memory consumption will be about 
>> the same as vanilla canopy.
>>
>> Bottom line: How many clusters do you expect to find?
>>
>>
>>
>>
>> -----Original Message-----
>> From: Paritosh Ranjan [mailto:[email protected]]
>> Sent: Tuesday, September 20, 2011 9:46 AM
>> To: [email protected]
>> Subject: Re: Clustering : Number of Reducers
>>
>> "but all the canopies gotta fit in memory."
>>
>> If this is true, then CanopyDriver would not be able to cluster HUGE
>> data ( as the memory might blow up ).
>>
>> I am using MeanShiftCanopyDriver of 0.6-snapshot which can use any
>> number of reducers. Will it also need all the canopies in memory?
>>
>> Or, which Clustering technique would you suggest to cluster really big
>> data ( considering performance and big size as parameters )?
>>
>> Thanks and Regards,
>> Paritosh Ranjan
>>
>> On 20-09-2011 21:35, Jeff Eastman wrote:
>>> Well, while it is true that the CanopyDriver writes all its canopies to the 
>>> file system, they are written at the end of the reduce method. The mappers 
>>> all output the same key, so the one reducer gets all the mapper pairs and 
>>> these must fit into memory before they can be output. With T1/T2 values 
>>> that are too small given the data, there will be a very large number of 
>>> clusters output by each mapper and a corresponding deluge of clusters at 
>>> the reducer. T3/T4 may be used to supply different thresholds in the reduce 
>>> step, but all the canopies gotta fit in memory.
>>>
>>> -----Original Message-----
>>> From: Paritosh Ranjan [mailto:[email protected]]
>>> Sent: Tuesday, September 20, 2011 12:31 AM
>>> To: [email protected]
>>> Subject: Re: Clustering : Number of Reducers
>>>
>>> "The limit is that all the canopies need to fit into memory."
>>> I don't think so. I think you can use CanopyDriver to write canopies in
>>> a filesystem. This is done as a mapreduce job. Then the KMeansDriver
>>> needs these canopy points as input to run KMeans.
>>>
>>> On 20-09-2011 01:39, Jeff Eastman wrote:
>>>> Actually, most of the clustering jobs (including DirichletDriver) accept 
>>>> the -Dmapred.reduce.tasks=n argument as noted below. Canopy is the only 
>>>> job which forces n=1 and this is so the reducer will see all of the mapper 
>>>> outputs. Generally, by adjusting T2&     T1 to suitably-large values you 
>>>> can get canopy to handle pretty large datasets. The limit is that all the 
>>>> canopies need to fit into memory.
>>>>
>>>> -----Original Message-----
>>>> From: Paritosh Ranjan [mailto:[email protected]]
>>>> Sent: Sunday, September 18, 2011 10:03 PM
>>>> To: [email protected]
>>>> Subject: Re: Clustering : Number of Reducers
>>>>
>>>> So, does this mean that Mahout can not support clustering for large data?
>>>>
>>>> Even in DirichletDriver the number of reducers is hardcoded to 1. And we
>>>> need canopies to run KMeansDriver.
>>>>
>>>> Paritosh
>>>>
>>>> On 19-09-2011 01:47, Konstantin Shmakov wrote:
>>>>> For most of the tasks one can force the number of reducers with
>>>>> mapred.reduce.tasks=<N>
>>>>> where<N>      the desired number of reducers.
>>>>>
>>>>> It will not necessary increase the performance though - with kmeans and
>>>>> fuzzykmeans combiners do reducers job and increasing the number of 
>>>>> reducers
>>>>> won't usually affect performance.
>>>>>
>>>>> With the canopy the distributed
>>>>> algorithm<http://svn.apache.org/viewvc/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/canopy/CanopyDriver.java?revision=1134456&view=markup>has
>>>>> no combiners and has 1 reducer hardcoded
>>>>> - trying to increase #reducers won't have any effect as the algorithm
>>>>> doesn't work with>1 reducer. My experience that the canopy won't scale to
>>>>> large data and need improvement.
>>>>>
>>>>> -- Konstantin
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Sep 18, 2011 at 10:50 AM, Paritosh Ranjan<[email protected]>      
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have been trying to cluster some hundreds of millions of records using
>>>>>> Mahout Clustering techniques.
>>>>>>
>>>>>> The number of reducers is always one which I am not able to change. This 
>>>>>> is
>>>>>> effecting the performance. I am using Mahout 0.5
>>>>>>
>>>>>> In 0.6-SNAPSHOT, I see that the MeanShiftCanopyDriver has been changed to
>>>>>> use any number of reducers. Will other ClusterDrivers also get changed to
>>>>>> use any number of reducers in 0.6?
>>>>>>
>>>>>> Thanks and Regards,
>>>>>> Paritosh Ranjan
>>>>>>
>>>>>>
>>>>>>
>>>> -----
>>>> No virus found in this message.
>>>> Checked by AVG - www.avg.com
>>>> Version: 10.0.1410 / Virus Database: 1520/3906 - Release Date: 09/19/11
>>> -----
>>> No virus found in this message.
>>> Checked by AVG - www.avg.com
>>> Version: 10.0.1410 / Virus Database: 1520/3908 - Release Date: 09/20/11
>>
>> -----
>> No virus found in this message.
>> Checked by AVG - www.avg.com
>> Version: 10.0.1410 / Virus Database: 1520/3908 - Release Date: 09/20/11
>>
>
>
> -----
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 10.0.1410 / Virus Database: 1520/3908 - Release Date: 09/20/11
>

Reply via email to