Ted, can you add a little more about hashed vectorization? Would that
come from hashing the terms to determine their dictionary indices? I can
see how hashing terms into a large dimensional space could result in
constant sized vectors, though there might be problems if multiple terms
hashed to the same index. Is this something we could consider for
seq2sparse to make online clustering processes work more smoothly?
On 1/20/11 12:08 PM, Ted Dunning wrote:
On Thu, Jan 20, 2011 at 10:56 AM, Jeff Eastman
<[email protected]>wrote:
Hi Veronica,
I've only tried incremental clustering as a thought-experiment but the kind
of problem you are attacking has many areas of applicability. The problem
you are seeing is the new articles bring new terms with them and this will
produce different cardinality vectors as new articles are added. You can
trick the Vector implementation by creating all the vectors with maxInt
cardinality but the current Mahout text vectorization (seq2sparse) does not
handle the growth in the directory which incremental additions would entail.
If we could prime seq2sparse with with the dictionary from the last addition
we might be able to support incremental vectorization with minimal changes.
Jeff, using hashed vectorization would solve this as well because the
document vectors will always have constant size. Commonly used distances
should work unchanged with a hashed representation although you might have a
few scaling surprises with multiple probes.
I don't completely agree with MIA 11.3.1's "use canopy clustering" phrase;
I think it is a bit misleading. Each of the clustering algorithms (including
canopy) has two phases: cluster generation and vector classification using
those clusters. I think the best choice for a maximum likelihood classifier
would actually be KMeansDriver.clusterData() and not the CanopyDriver
version (which requires t1 and t2 values to initialize the clusterer but
these are never used for classification).
To really implement the case study it would seem to me to require a single
threshold classification to avoid assigning new articles to existing
clusters which were too dissimilar to really fit. Then these leftovers could
be used to generate new clusters which could then be added to the list.
Perhaps one of the authors can add some clarification on this too?
Jeff
On 1/20/11 8:24 AM, Veronica Joh wrote:
Hi
I have large number of artcles clustered by kmeans.
For the new articles that comes in, it says I need to "use canopy
clustering to assign it to the cluster whose centroid is closest based on a
very small distance threshold" according to Mahout in Action book.
I'm not sure how to add new article canopies to the existing cluster.
So I'm saving batch articles in a list of Cluster like this.
List<Cluster> clusters = new ArrayList<Cluster>();
For the new article canopies, I'm trying following to measure the
distance, but I get error like this. "Required cardinality 11981 but got
77372"
Text key = new Text();
Canopy value = new Canopy();
DistanceMeasure measure = new EuclideanDistanceMeasure();
while (reader.next(key, value)){
for (int i=0; i<clusters.size(); i++){
double d = measure.distance(clusters.get(i).getCenter(),
value.getCenter());
}
}
Is this how to compare cluster centroids with new canopies? or Did I
misundertand something?
Can you please help me so I can get the online news clustering working?
Thank you very much!