With potentially 5 million clusters, I would suggest a multi-pass
hierarchical approach where you cluster the data and then sub-cluster the
larger clusters.

On Tue, Sep 20, 2011 at 12:42 PM, Jeff Eastman <[email protected]> wrote:

> +1 And do it in the reducer too, perhaps controlled by a min-cluster-size
> argument?
>
> -----Original Message-----
> From: Konstantin Shmakov [mailto:[email protected]]
> Sent: Tuesday, September 20, 2011 12:21 PM
> To: [email protected]
> Subject: Re: Clustering : Number of Reducers
>
> This became technical but I believe a single product requirement should not
> drive generic implementation. Canopy suppose to produce a fast "hint" for
> other clustering techniques; one can experiment with custom variations to
> do
> just that. For instance for 1) I'd suggest to try adding one line in
> CanopyMapper to output only canopies with >1 points:
>
>   protected void cleanup(Context context) throws IOException,
> InterruptedException {
>     for (Canopy canopy : canopies) {
> -      context.write(new Text("centroid"), new
> VectorWritable(canopy.computeCentroid()));
> +      if(canopy.getNumPoints() > 1) {
> +       context.write(new Text("centroid"), new
> VectorWritable(canopy.computeCentroid()));
> +      }
>     }
>
> Even though it will filter canopies at the earlier stage and can
> potentially
> filter canopies with up to #mappers points it can be an effective data
> reduction technique. One can even write these canopies with the different
> key and cluster them separately but that would be more custom variations.
>
> --Konstantin
>
> On Tue, Sep 20, 2011 at 11:20 AM, Paritosh Ranjan <[email protected]>
> wrote:
>
> > The bigger problem, in my opinion is, the existence of canopies
> containing
> > single vectors. Since, these canopies with only vector inside it are not
> > clusters, so, there would be almost a billion canopies formed, if the
> > vectors are far from each other.
> >
> > I think, two improvements, can be applied to the current algorithm.
> >
> > 1) To ask for minimum number of vectors to be inside a canopy/cluster, or
> > the cluster is discarded.
> > 2) To change this "in memory" version of clustering to a "persisted" one.
> > The current implementation is not scalable. I have a valid business
> scenario
> > with 5 million clusters, and I think there would be more users with
> bigger
> > datasets/cluster numbers.
> >
> >
> > Thanks and Regards,
> > Paritosh Ranjan
> >
> > On 20-09-2011 23:35, Jeff Eastman wrote:
> >
> >> As all the Mahout clustering implementations keep their clusters in
> >> memory, I don't believe any of them will handle that many clusters. I'm
> a
> >> bit skeptical; however, that 5 million clusters over a billion, 300-d
> >> vectors will produce anything useful by way of analytics. You've got the
> >> curse of dimensionality working against you and your vectors will be
> nearly
> >> equidistant from each other. This means that very small (=noise)
> differences
> >> in distance will be driving the clustering.
> >>
> >>
> >> -----Original Message-----
> >> From: Paritosh Ranjan [mailto:[email protected]]
> >> Sent: Tuesday, September 20, 2011 10:41 AM
> >> To: [email protected]
> >> Subject: Re: Clustering : Number of Reducers
> >>
> >>
> >> The max load I expect is 1 billion vectors. Around 300 dimensions per
> >> vector. The number of clusters with more than one vector inside it can
> >> be around 5 million, with an average of 10-20 vector per cluster.
> >>
> >> But, When most of the vectors are really far away in the worst case
> >> (apart from the similar ones, which will be inside the canopy) , most of
> >> the canopies might contain only one vector. So, the number of canopies
> >> will be really high ( As lots of canopies will result into clusters
> >> having single vector ).
> >>
> >> On 20-09-2011 22:56, Jeff Eastman wrote:
> >>
> >>> I guess it depends upon what you expect from your HUGE data set: How
> many
> >>> clusters do you believe it contains? A hundred? A thousand? A million?
> A
> >>> billion? With the right T-values I believe Canopy can handle the first
> three
> >>> but not the last. It will also depend upon the size of your vectors.
> This is
> >>> because, as canopy centroids are calculated, the centroid vectors
> become
> >>> more dense and these take up more space in memory. So a million, really
> wide
> >>> clusters might have trouble fitting into a 4GB reducer memory. But what
> are
> >>> you really going to do with a million clusters? This number seems
> vastly
> >>> larger than one might find useful in summarizing a data set. I would
> think a
> >>> couple hundred clusters would be the limit of human-understandable
> >>> clustering. Canopy can do that with no problem.
> >>>
> >>> MeanShiftCanopy, as its name implies, is really just an iterative
> canopy
> >>> implementation. It allows the specification of an arbitrary number of
> >>> initial reducers, but it counts them down to 1 in each iteration in
> order to
> >>> properly process all the input. It is an agglomerative clustering
> algorithm,
> >>> and the clusters it builds contain the indices of each of the input
> points
> >>> that have been agglomerated. This makes the mean shift canopy larger in
> >>> memory than vanilla canopies since the list of points is maintained
> too. It
> >>> is possible to avoid the points accumulation and it won't happen unless
> the
> >>> -cl option is provided. In this case the memory consumption will be
> about
> >>> the same as vanilla canopy.
> >>>
> >>> Bottom line: How many clusters do you expect to find?
> >>>
> >>>
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Paritosh Ranjan [mailto:[email protected]]
> >>> Sent: Tuesday, September 20, 2011 9:46 AM
> >>> To: [email protected]
> >>> Subject: Re: Clustering : Number of Reducers
> >>>
> >>> "but all the canopies gotta fit in memory."
> >>>
> >>> If this is true, then CanopyDriver would not be able to cluster HUGE
> >>> data ( as the memory might blow up ).
> >>>
> >>> I am using MeanShiftCanopyDriver of 0.6-snapshot which can use any
> >>> number of reducers. Will it also need all the canopies in memory?
> >>>
> >>> Or, which Clustering technique would you suggest to cluster really big
> >>> data ( considering performance and big size as parameters )?
> >>>
> >>> Thanks and Regards,
> >>> Paritosh Ranjan
> >>>
> >>> On 20-09-2011 21:35, Jeff Eastman wrote:
> >>>
> >>>> Well, while it is true that the CanopyDriver writes all its canopies
> to
> >>>> the file system, they are written at the end of the reduce method. The
> >>>> mappers all output the same key, so the one reducer gets all the
> mapper
> >>>> pairs and these must fit into memory before they can be output. With
> T1/T2
> >>>> values that are too small given the data, there will be a very large
> number
> >>>> of clusters output by each mapper and a corresponding deluge of
> clusters at
> >>>> the reducer. T3/T4 may be used to supply different thresholds in the
> reduce
> >>>> step, but all the canopies gotta fit in memory.
> >>>>
> >>>> -----Original Message-----
> >>>> From: Paritosh Ranjan [mailto:[email protected]]
> >>>> Sent: Tuesday, September 20, 2011 12:31 AM
> >>>> To: [email protected]
> >>>> Subject: Re: Clustering : Number of Reducers
> >>>>
> >>>> "The limit is that all the canopies need to fit into memory."
> >>>> I don't think so. I think you can use CanopyDriver to write canopies
> in
> >>>> a filesystem. This is done as a mapreduce job. Then the KMeansDriver
> >>>> needs these canopy points as input to run KMeans.
> >>>>
> >>>> On 20-09-2011 01:39, Jeff Eastman wrote:
> >>>>
> >>>>> Actually, most of the clustering jobs (including DirichletDriver)
> >>>>> accept the -Dmapred.reduce.tasks=n argument as noted below. Canopy is
> the
> >>>>> only job which forces n=1 and this is so the reducer will see all of
> the
> >>>>> mapper outputs. Generally, by adjusting T2&     T1 to suitably-large
> values
> >>>>> you can get canopy to handle pretty large datasets. The limit is that
> all
> >>>>> the canopies need to fit into memory.
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: Paritosh Ranjan [mailto:[email protected]]
> >>>>> Sent: Sunday, September 18, 2011 10:03 PM
> >>>>> To: [email protected]
> >>>>> Subject: Re: Clustering : Number of Reducers
> >>>>>
> >>>>> So, does this mean that Mahout can not support clustering for large
> >>>>> data?
> >>>>>
> >>>>> Even in DirichletDriver the number of reducers is hardcoded to 1. And
> >>>>> we
> >>>>> need canopies to run KMeansDriver.
> >>>>>
> >>>>> Paritosh
> >>>>>
> >>>>> On 19-09-2011 01:47, Konstantin Shmakov wrote:
> >>>>>
> >>>>>> For most of the tasks one can force the number of reducers with
> >>>>>> mapred.reduce.tasks=<N>
> >>>>>> where<N>      the desired number of reducers.
> >>>>>>
> >>>>>> It will not necessary increase the performance though - with kmeans
> >>>>>> and
> >>>>>> fuzzykmeans combiners do reducers job and increasing the number of
> >>>>>> reducers
> >>>>>> won't usually affect performance.
> >>>>>>
> >>>>>> With the canopy the distributed
> >>>>>> algorithm<http://svn.apache.**org/viewvc/mahout/trunk/core/**
> >>>>>> src/main/java/org/apache/**mahout/clustering/canopy/**
> >>>>>> CanopyDriver.java?revision=**1134456&view=markup<
> http://svn.apache.org/viewvc/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/canopy/CanopyDriver.java?revision=1134456&view=markup
> >
> >>>>>> >has
> >>>>>> no combiners and has 1 reducer hardcoded
> >>>>>> - trying to increase #reducers won't have any effect as the
> algorithm
> >>>>>> doesn't work with>1 reducer. My experience that the canopy won't
> scale
> >>>>>> to
> >>>>>> large data and need improvement.
> >>>>>>
> >>>>>> -- Konstantin
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Sun, Sep 18, 2011 at 10:50 AM, Paritosh Ranjan<[email protected]
> >
> >>>>>>      wrote:
> >>>>>>
> >>>>>>  Hi,
> >>>>>>>
> >>>>>>> I have been trying to cluster some hundreds of millions of records
> >>>>>>> using
> >>>>>>> Mahout Clustering techniques.
> >>>>>>>
> >>>>>>> The number of reducers is always one which I am not able to change.
> >>>>>>> This is
> >>>>>>> effecting the performance. I am using Mahout 0.5
> >>>>>>>
> >>>>>>> In 0.6-SNAPSHOT, I see that the MeanShiftCanopyDriver has been
> >>>>>>> changed to
> >>>>>>> use any number of reducers. Will other ClusterDrivers also get
> >>>>>>> changed to
> >>>>>>> use any number of reducers in 0.6?
> >>>>>>>
> >>>>>>> Thanks and Regards,
> >>>>>>> Paritosh Ranjan
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>  -----
> >>>>> No virus found in this message.
> >>>>> Checked by AVG - www.avg.com
> >>>>> Version: 10.0.1410 / Virus Database: 1520/3906 - Release Date:
> 09/19/11
> >>>>>
> >>>> -----
> >>>> No virus found in this message.
> >>>> Checked by AVG - www.avg.com
> >>>> Version: 10.0.1410 / Virus Database: 1520/3908 - Release Date:
> 09/20/11
> >>>>
> >>>
> >>> -----
> >>> No virus found in this message.
> >>> Checked by AVG - www.avg.com
> >>> Version: 10.0.1410 / Virus Database: 1520/3908 - Release Date: 09/20/11
> >>>
> >>>
> >>
> >> -----
> >> No virus found in this message.
> >> Checked by AVG - www.avg.com
> >> Version: 10.0.1410 / Virus Database: 1520/3908 - Release Date: 09/20/11
> >>
> >>
> >
>
>
> --
> ksh:
>

Reply via email to