Hi Veronica,
I've only tried incremental clustering as a thought-experiment but the
kind of problem you are attacking has many areas of applicability. The
problem you are seeing is the new articles bring new terms with them and
this will produce different cardinality vectors as new articles are
added. You can trick the Vector implementation by creating all the
vectors with maxInt cardinality but the current Mahout text
vectorization (seq2sparse) does not handle the growth in the directory
which incremental additions would entail. If we could prime seq2sparse
with with the dictionary from the last addition we might be able to
support incremental vectorization with minimal changes.
I don't completely agree with MIA 11.3.1's "use canopy clustering"
phrase; I think it is a bit misleading. Each of the clustering
algorithms (including canopy) has two phases: cluster generation and
vector classification using those clusters. I think the best choice for
a maximum likelihood classifier would actually be
KMeansDriver.clusterData() and not the CanopyDriver version (which
requires t1 and t2 values to initialize the clusterer but these are
never used for classification).
To really implement the case study it would seem to me to require a
single threshold classification to avoid assigning new articles to
existing clusters which were too dissimilar to really fit. Then these
leftovers could be used to generate new clusters which could then be
added to the list.
Perhaps one of the authors can add some clarification on this too?
Jeff
On 1/20/11 8:24 AM, Veronica Joh wrote:
Hi
I have large number of artcles clustered by kmeans.
For the new articles that comes in, it says I need to "use canopy clustering to
assign it to the cluster whose centroid is closest based on a very small distance
threshold" according to Mahout in Action book.
I'm not sure how to add new article canopies to the existing cluster.
So I'm saving batch articles in a list of Cluster like this.
List<Cluster> clusters = new ArrayList<Cluster>();
For the new article canopies, I'm trying following to measure the distance, but I get
error like this. "Required cardinality 11981 but got 77372"
Text key = new Text();
Canopy value = new Canopy();
DistanceMeasure measure = new EuclideanDistanceMeasure();
while (reader.next(key, value)){
for (int i=0; i<clusters.size(); i++){
double d = measure.distance(clusters.get(i).getCenter(),
value.getCenter());
}
}
Is this how to compare cluster centroids with new canopies? or Did I
misundertand something?
Can you please help me so I can get the online news clustering working?
Thank you very much!