RE: Incremental clustering - Kmeans + Canopy

Veronica Joh Sat, 22 Jan 2011 17:34:21 -0800

Hi Jeff
I've tried KMeansDriver.clusterData(), but I got same Cardinality exception.  
So I guess I have to do this with hash vectors as Ted mentioned?
Also, can you please explain what a single threshold classification means?
Thank you!
Veronica



> Date: Thu, 20 Jan 2011 11:56:45 -0700
> From: [email protected]
> To: [email protected]
> Subject: Re: Incremental clustering - Kmeans + Canopy
> 
> Hi Veronica,
> 
> I've only tried incremental clustering as a thought-experiment but the 
> kind of problem you are attacking has many areas of applicability. The 
> problem you are seeing is the new articles bring new terms with them and 
> this will produce different cardinality vectors as new articles are 
> added. You can trick the Vector implementation by creating all the 
> vectors with maxInt cardinality but the current Mahout text 
> vectorization (seq2sparse) does not handle the growth in the directory 
> which incremental additions would entail. If we could prime seq2sparse 
> with with the dictionary from the last addition we might be able to 
> support incremental vectorization with minimal changes.
> 
> I don't completely agree with MIA 11.3.1's "use canopy clustering" 
> phrase; I think it is a bit misleading. Each of the clustering 
> algorithms (including canopy) has two phases: cluster generation and 
> vector classification using those clusters. I think the best choice for 
> a maximum likelihood classifier would actually be 
> KMeansDriver.clusterData() and not the CanopyDriver version (which 
> requires t1 and t2 values to initialize the clusterer but these are 
> never used for classification).
> 
> To really implement the case study it would seem to me to require a 
> single threshold classification to avoid assigning new articles to 
> existing clusters which were too dissimilar to really fit. Then these 
> leftovers could be used to generate new clusters which could then be 
> added to the list.
> 
> Perhaps one of the authors can add some clarification on this too?
> 
> Jeff
> 
> On 1/20/11 8:24 AM, Veronica Joh wrote:
> > Hi
> > I have large number of artcles clustered by kmeans.
> > For the new articles that comes in, it says I need to "use canopy 
> > clustering to assign it to the cluster whose centroid is closest based on a 
> > very small distance threshold" according to Mahout in Action book.
> > I'm not sure how to add new article canopies to the existing cluster.
> >
> > So I'm saving batch articles in a list of Cluster like this.
> > List<Cluster>  clusters = new ArrayList<Cluster>();
> >
> > For the new article canopies, I'm trying following to measure the distance, 
> > but I get error like this. "Required cardinality 11981 but got 77372"
> > Text key = new Text();
> > Canopy value = new Canopy();
> > DistanceMeasure measure = new EuclideanDistanceMeasure();
> > while (reader.next(key, value)){
> >       for (int i=0; i<clusters.size(); i++){
> >          double d = measure.distance(clusters.get(i).getCenter(), 
> > value.getCenter());
> >       }
> > }
> >
> > Is this how to compare cluster centroids with new canopies?  or Did I 
> > misundertand something?
> > Can you please help me so I can get the online news clustering working?
> > Thank you very much!                                        
>

RE: Incremental clustering - Kmeans + Canopy

Reply via email to