Hi Jeff I've tried KMeansDriver.clusterData(), but I got same Cardinality exception. So I guess I have to do this with hash vectors as Ted mentioned? Also, can you please explain what a single threshold classification means? Thank you! Veronica
> Date: Thu, 20 Jan 2011 11:56:45 -0700 > From: [email protected] > To: [email protected] > Subject: Re: Incremental clustering - Kmeans + Canopy > > Hi Veronica, > > I've only tried incremental clustering as a thought-experiment but the > kind of problem you are attacking has many areas of applicability. The > problem you are seeing is the new articles bring new terms with them and > this will produce different cardinality vectors as new articles are > added. You can trick the Vector implementation by creating all the > vectors with maxInt cardinality but the current Mahout text > vectorization (seq2sparse) does not handle the growth in the directory > which incremental additions would entail. If we could prime seq2sparse > with with the dictionary from the last addition we might be able to > support incremental vectorization with minimal changes. > > I don't completely agree with MIA 11.3.1's "use canopy clustering" > phrase; I think it is a bit misleading. Each of the clustering > algorithms (including canopy) has two phases: cluster generation and > vector classification using those clusters. I think the best choice for > a maximum likelihood classifier would actually be > KMeansDriver.clusterData() and not the CanopyDriver version (which > requires t1 and t2 values to initialize the clusterer but these are > never used for classification). > > To really implement the case study it would seem to me to require a > single threshold classification to avoid assigning new articles to > existing clusters which were too dissimilar to really fit. Then these > leftovers could be used to generate new clusters which could then be > added to the list. > > Perhaps one of the authors can add some clarification on this too? > > Jeff > > On 1/20/11 8:24 AM, Veronica Joh wrote: > > Hi > > I have large number of artcles clustered by kmeans. > > For the new articles that comes in, it says I need to "use canopy > > clustering to assign it to the cluster whose centroid is closest based on a > > very small distance threshold" according to Mahout in Action book. > > I'm not sure how to add new article canopies to the existing cluster. > > > > So I'm saving batch articles in a list of Cluster like this. > > List<Cluster> clusters = new ArrayList<Cluster>(); > > > > For the new article canopies, I'm trying following to measure the distance, > > but I get error like this. "Required cardinality 11981 but got 77372" > > Text key = new Text(); > > Canopy value = new Canopy(); > > DistanceMeasure measure = new EuclideanDistanceMeasure(); > > while (reader.next(key, value)){ > > for (int i=0; i<clusters.size(); i++){ > > double d = measure.distance(clusters.get(i).getCenter(), > > value.getCenter()); > > } > > } > > > > Is this how to compare cluster centroids with new canopies? or Did I > > misundertand something? > > Can you please help me so I can get the online news clustering working? > > Thank you very much! >
