With potentially 5 million clusters, I would suggest a multi-pass hierarchical approach where you cluster the data and then sub-cluster the larger clusters.
On Tue, Sep 20, 2011 at 12:42 PM, Jeff Eastman <[email protected]> wrote: > +1 And do it in the reducer too, perhaps controlled by a min-cluster-size > argument? > > -----Original Message----- > From: Konstantin Shmakov [mailto:[email protected]] > Sent: Tuesday, September 20, 2011 12:21 PM > To: [email protected] > Subject: Re: Clustering : Number of Reducers > > This became technical but I believe a single product requirement should not > drive generic implementation. Canopy suppose to produce a fast "hint" for > other clustering techniques; one can experiment with custom variations to > do > just that. For instance for 1) I'd suggest to try adding one line in > CanopyMapper to output only canopies with >1 points: > > protected void cleanup(Context context) throws IOException, > InterruptedException { > for (Canopy canopy : canopies) { > - context.write(new Text("centroid"), new > VectorWritable(canopy.computeCentroid())); > + if(canopy.getNumPoints() > 1) { > + context.write(new Text("centroid"), new > VectorWritable(canopy.computeCentroid())); > + } > } > > Even though it will filter canopies at the earlier stage and can > potentially > filter canopies with up to #mappers points it can be an effective data > reduction technique. One can even write these canopies with the different > key and cluster them separately but that would be more custom variations. > > --Konstantin > > On Tue, Sep 20, 2011 at 11:20 AM, Paritosh Ranjan <[email protected]> > wrote: > > > The bigger problem, in my opinion is, the existence of canopies > containing > > single vectors. Since, these canopies with only vector inside it are not > > clusters, so, there would be almost a billion canopies formed, if the > > vectors are far from each other. > > > > I think, two improvements, can be applied to the current algorithm. > > > > 1) To ask for minimum number of vectors to be inside a canopy/cluster, or > > the cluster is discarded. > > 2) To change this "in memory" version of clustering to a "persisted" one. > > The current implementation is not scalable. I have a valid business > scenario > > with 5 million clusters, and I think there would be more users with > bigger > > datasets/cluster numbers. > > > > > > Thanks and Regards, > > Paritosh Ranjan > > > > On 20-09-2011 23:35, Jeff Eastman wrote: > > > >> As all the Mahout clustering implementations keep their clusters in > >> memory, I don't believe any of them will handle that many clusters. I'm > a > >> bit skeptical; however, that 5 million clusters over a billion, 300-d > >> vectors will produce anything useful by way of analytics. You've got the > >> curse of dimensionality working against you and your vectors will be > nearly > >> equidistant from each other. This means that very small (=noise) > differences > >> in distance will be driving the clustering. > >> > >> > >> -----Original Message----- > >> From: Paritosh Ranjan [mailto:[email protected]] > >> Sent: Tuesday, September 20, 2011 10:41 AM > >> To: [email protected] > >> Subject: Re: Clustering : Number of Reducers > >> > >> > >> The max load I expect is 1 billion vectors. Around 300 dimensions per > >> vector. The number of clusters with more than one vector inside it can > >> be around 5 million, with an average of 10-20 vector per cluster. > >> > >> But, When most of the vectors are really far away in the worst case > >> (apart from the similar ones, which will be inside the canopy) , most of > >> the canopies might contain only one vector. So, the number of canopies > >> will be really high ( As lots of canopies will result into clusters > >> having single vector ). > >> > >> On 20-09-2011 22:56, Jeff Eastman wrote: > >> > >>> I guess it depends upon what you expect from your HUGE data set: How > many > >>> clusters do you believe it contains? A hundred? A thousand? A million? > A > >>> billion? With the right T-values I believe Canopy can handle the first > three > >>> but not the last. It will also depend upon the size of your vectors. > This is > >>> because, as canopy centroids are calculated, the centroid vectors > become > >>> more dense and these take up more space in memory. So a million, really > wide > >>> clusters might have trouble fitting into a 4GB reducer memory. But what > are > >>> you really going to do with a million clusters? This number seems > vastly > >>> larger than one might find useful in summarizing a data set. I would > think a > >>> couple hundred clusters would be the limit of human-understandable > >>> clustering. Canopy can do that with no problem. > >>> > >>> MeanShiftCanopy, as its name implies, is really just an iterative > canopy > >>> implementation. It allows the specification of an arbitrary number of > >>> initial reducers, but it counts them down to 1 in each iteration in > order to > >>> properly process all the input. It is an agglomerative clustering > algorithm, > >>> and the clusters it builds contain the indices of each of the input > points > >>> that have been agglomerated. This makes the mean shift canopy larger in > >>> memory than vanilla canopies since the list of points is maintained > too. It > >>> is possible to avoid the points accumulation and it won't happen unless > the > >>> -cl option is provided. In this case the memory consumption will be > about > >>> the same as vanilla canopy. > >>> > >>> Bottom line: How many clusters do you expect to find? > >>> > >>> > >>> > >>> > >>> -----Original Message----- > >>> From: Paritosh Ranjan [mailto:[email protected]] > >>> Sent: Tuesday, September 20, 2011 9:46 AM > >>> To: [email protected] > >>> Subject: Re: Clustering : Number of Reducers > >>> > >>> "but all the canopies gotta fit in memory." > >>> > >>> If this is true, then CanopyDriver would not be able to cluster HUGE > >>> data ( as the memory might blow up ). > >>> > >>> I am using MeanShiftCanopyDriver of 0.6-snapshot which can use any > >>> number of reducers. Will it also need all the canopies in memory? > >>> > >>> Or, which Clustering technique would you suggest to cluster really big > >>> data ( considering performance and big size as parameters )? > >>> > >>> Thanks and Regards, > >>> Paritosh Ranjan > >>> > >>> On 20-09-2011 21:35, Jeff Eastman wrote: > >>> > >>>> Well, while it is true that the CanopyDriver writes all its canopies > to > >>>> the file system, they are written at the end of the reduce method. The > >>>> mappers all output the same key, so the one reducer gets all the > mapper > >>>> pairs and these must fit into memory before they can be output. With > T1/T2 > >>>> values that are too small given the data, there will be a very large > number > >>>> of clusters output by each mapper and a corresponding deluge of > clusters at > >>>> the reducer. T3/T4 may be used to supply different thresholds in the > reduce > >>>> step, but all the canopies gotta fit in memory. > >>>> > >>>> -----Original Message----- > >>>> From: Paritosh Ranjan [mailto:[email protected]] > >>>> Sent: Tuesday, September 20, 2011 12:31 AM > >>>> To: [email protected] > >>>> Subject: Re: Clustering : Number of Reducers > >>>> > >>>> "The limit is that all the canopies need to fit into memory." > >>>> I don't think so. I think you can use CanopyDriver to write canopies > in > >>>> a filesystem. This is done as a mapreduce job. Then the KMeansDriver > >>>> needs these canopy points as input to run KMeans. > >>>> > >>>> On 20-09-2011 01:39, Jeff Eastman wrote: > >>>> > >>>>> Actually, most of the clustering jobs (including DirichletDriver) > >>>>> accept the -Dmapred.reduce.tasks=n argument as noted below. Canopy is > the > >>>>> only job which forces n=1 and this is so the reducer will see all of > the > >>>>> mapper outputs. Generally, by adjusting T2& T1 to suitably-large > values > >>>>> you can get canopy to handle pretty large datasets. The limit is that > all > >>>>> the canopies need to fit into memory. > >>>>> > >>>>> -----Original Message----- > >>>>> From: Paritosh Ranjan [mailto:[email protected]] > >>>>> Sent: Sunday, September 18, 2011 10:03 PM > >>>>> To: [email protected] > >>>>> Subject: Re: Clustering : Number of Reducers > >>>>> > >>>>> So, does this mean that Mahout can not support clustering for large > >>>>> data? > >>>>> > >>>>> Even in DirichletDriver the number of reducers is hardcoded to 1. And > >>>>> we > >>>>> need canopies to run KMeansDriver. > >>>>> > >>>>> Paritosh > >>>>> > >>>>> On 19-09-2011 01:47, Konstantin Shmakov wrote: > >>>>> > >>>>>> For most of the tasks one can force the number of reducers with > >>>>>> mapred.reduce.tasks=<N> > >>>>>> where<N> the desired number of reducers. > >>>>>> > >>>>>> It will not necessary increase the performance though - with kmeans > >>>>>> and > >>>>>> fuzzykmeans combiners do reducers job and increasing the number of > >>>>>> reducers > >>>>>> won't usually affect performance. > >>>>>> > >>>>>> With the canopy the distributed > >>>>>> algorithm<http://svn.apache.**org/viewvc/mahout/trunk/core/** > >>>>>> src/main/java/org/apache/**mahout/clustering/canopy/** > >>>>>> CanopyDriver.java?revision=**1134456&view=markup< > http://svn.apache.org/viewvc/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/canopy/CanopyDriver.java?revision=1134456&view=markup > > > >>>>>> >has > >>>>>> no combiners and has 1 reducer hardcoded > >>>>>> - trying to increase #reducers won't have any effect as the > algorithm > >>>>>> doesn't work with>1 reducer. My experience that the canopy won't > scale > >>>>>> to > >>>>>> large data and need improvement. > >>>>>> > >>>>>> -- Konstantin > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Sun, Sep 18, 2011 at 10:50 AM, Paritosh Ranjan<[email protected] > > > >>>>>> wrote: > >>>>>> > >>>>>> Hi, > >>>>>>> > >>>>>>> I have been trying to cluster some hundreds of millions of records > >>>>>>> using > >>>>>>> Mahout Clustering techniques. > >>>>>>> > >>>>>>> The number of reducers is always one which I am not able to change. > >>>>>>> This is > >>>>>>> effecting the performance. I am using Mahout 0.5 > >>>>>>> > >>>>>>> In 0.6-SNAPSHOT, I see that the MeanShiftCanopyDriver has been > >>>>>>> changed to > >>>>>>> use any number of reducers. Will other ClusterDrivers also get > >>>>>>> changed to > >>>>>>> use any number of reducers in 0.6? > >>>>>>> > >>>>>>> Thanks and Regards, > >>>>>>> Paritosh Ranjan > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> ----- > >>>>> No virus found in this message. > >>>>> Checked by AVG - www.avg.com > >>>>> Version: 10.0.1410 / Virus Database: 1520/3906 - Release Date: > 09/19/11 > >>>>> > >>>> ----- > >>>> No virus found in this message. > >>>> Checked by AVG - www.avg.com > >>>> Version: 10.0.1410 / Virus Database: 1520/3908 - Release Date: > 09/20/11 > >>>> > >>> > >>> ----- > >>> No virus found in this message. > >>> Checked by AVG - www.avg.com > >>> Version: 10.0.1410 / Virus Database: 1520/3908 - Release Date: 09/20/11 > >>> > >>> > >> > >> ----- > >> No virus found in this message. > >> Checked by AVG - www.avg.com > >> Version: 10.0.1410 / Virus Database: 1520/3908 - Release Date: 09/20/11 > >> > >> > > > > > -- > ksh: >
