Re: Incremental clustering - Kmeans + Canopy

Jeff Eastman Thu, 20 Jan 2011 10:57:23 -0800

Hi Veronica,

I've only tried incremental clustering as a thought-experiment but thekind of problem you are attacking has many areas of applicability. Theproblem you are seeing is the new articles bring new terms with them andthis will produce different cardinality vectors as new articles areadded. You can trick the Vector implementation by creating all thevectors with maxInt cardinality but the current Mahout textvectorization (seq2sparse) does not handle the growth in the directorywhich incremental additions would entail. If we could prime seq2sparsewith with the dictionary from the last addition we might be able tosupport incremental vectorization with minimal changes.

I don't completely agree with MIA 11.3.1's "use canopy clustering"phrase; I think it is a bit misleading. Each of the clusteringalgorithms (including canopy) has two phases: cluster generation andvector classification using those clusters. I think the best choice fora maximum likelihood classifier would actually beKMeansDriver.clusterData() and not the CanopyDriver version (whichrequires t1 and t2 values to initialize the clusterer but these arenever used for classification).

To really implement the case study it would seem to me to require asingle threshold classification to avoid assigning new articles toexisting clusters which were too dissimilar to really fit. Then theseleftovers could be used to generate new clusters which could then beadded to the list.


Perhaps one of the authors can add some clarification on this too?

Jeff

On 1/20/11 8:24 AM, Veronica Joh wrote:

Hi
I have large number of artcles clustered by kmeans.
For the new articles that comes in, it says I need to "use canopy clustering to 
assign it to the cluster whose centroid is closest based on a very small distance 
threshold" according to Mahout in Action book.
I'm not sure how to add new article canopies to the existing cluster.

So I'm saving batch articles in a list of Cluster like this.
List<Cluster>  clusters = new ArrayList<Cluster>();

For the new article canopies, I'm trying following to measure the distance, but I get 
error like this. "Required cardinality 11981 but got 77372"
Text key = new Text();
Canopy value = new Canopy();
DistanceMeasure measure = new EuclideanDistanceMeasure();
while (reader.next(key, value)){
      for (int i=0; i<clusters.size(); i++){
         double d = measure.distance(clusters.get(i).getCenter(), 
value.getCenter());
      }
}

Is this how to compare cluster centroids with new canopies?  or Did I 
misundertand something?
Can you please help me so I can get the online news clustering working?
Thank you very much!

Re: Incremental clustering - Kmeans + Canopy

Reply via email to