Hi Veronica,

I've only tried incremental clustering as a thought-experiment but the kind of problem you are attacking has many areas of applicability. The problem you are seeing is the new articles bring new terms with them and this will produce different cardinality vectors as new articles are added. You can trick the Vector implementation by creating all the vectors with maxInt cardinality but the current Mahout text vectorization (seq2sparse) does not handle the growth in the directory which incremental additions would entail. If we could prime seq2sparse with with the dictionary from the last addition we might be able to support incremental vectorization with minimal changes.

I don't completely agree with MIA 11.3.1's "use canopy clustering" phrase; I think it is a bit misleading. Each of the clustering algorithms (including canopy) has two phases: cluster generation and vector classification using those clusters. I think the best choice for a maximum likelihood classifier would actually be KMeansDriver.clusterData() and not the CanopyDriver version (which requires t1 and t2 values to initialize the clusterer but these are never used for classification).

To really implement the case study it would seem to me to require a single threshold classification to avoid assigning new articles to existing clusters which were too dissimilar to really fit. Then these leftovers could be used to generate new clusters which could then be added to the list.

Perhaps one of the authors can add some clarification on this too?

Jeff

On 1/20/11 8:24 AM, Veronica Joh wrote:
Hi
I have large number of artcles clustered by kmeans.
For the new articles that comes in, it says I need to "use canopy clustering to 
assign it to the cluster whose centroid is closest based on a very small distance 
threshold" according to Mahout in Action book.
I'm not sure how to add new article canopies to the existing cluster.

So I'm saving batch articles in a list of Cluster like this.
List<Cluster>  clusters = new ArrayList<Cluster>();

For the new article canopies, I'm trying following to measure the distance, but I get 
error like this. "Required cardinality 11981 but got 77372"
Text key = new Text();
Canopy value = new Canopy();
DistanceMeasure measure = new EuclideanDistanceMeasure();
while (reader.next(key, value)){
      for (int i=0; i<clusters.size(); i++){
         double d = measure.distance(clusters.get(i).getCenter(), 
value.getCenter());
      }
}

Is this how to compare cluster centroids with new canopies?  or Did I 
misundertand something?
Can you please help me so I can get the online news clustering working?
Thank you very much!                                    

Reply via email to