On Thu, Jan 20, 2011 at 10:56 AM, Jeff Eastman
<[email protected]>wrote:

> Hi Veronica,
>
> I've only tried incremental clustering as a thought-experiment but the kind
> of problem you are attacking has many areas of applicability. The problem
> you are seeing is the new articles bring new terms with them and this will
> produce different cardinality vectors as new articles are added. You can
> trick the Vector implementation by creating all the vectors with maxInt
> cardinality but the current Mahout text vectorization (seq2sparse) does not
> handle the growth in the directory which incremental additions would entail.
> If we could prime seq2sparse with with the dictionary from the last addition
> we might be able to support incremental vectorization with minimal changes.
>

Jeff, using hashed vectorization would solve this as well because the
document vectors will always have constant size.  Commonly used distances
should work unchanged with a hashed representation although you might have a
few scaling surprises with multiple probes.


>
> I don't completely agree with MIA 11.3.1's "use canopy clustering" phrase;
> I think it is a bit misleading. Each of the clustering algorithms (including
> canopy) has two phases: cluster generation and vector classification using
> those clusters. I think the best choice for a maximum likelihood classifier
> would actually be KMeansDriver.clusterData() and not the CanopyDriver
> version (which requires t1 and t2 values to initialize the clusterer but
> these are never used for classification).
>
> To really implement the case study it would seem to me to require a single
> threshold classification to avoid assigning new articles to existing
> clusters which were too dissimilar to really fit. Then these leftovers could
> be used to generate new clusters which could then be added to the list.
>
> Perhaps one of the authors can add some clarification on this too?
>
> Jeff
>
> On 1/20/11 8:24 AM, Veronica Joh wrote:
>
>> Hi
>> I have large number of artcles clustered by kmeans.
>> For the new articles that comes in, it says I need to "use canopy
>> clustering to assign it to the cluster whose centroid is closest based on a
>> very small distance threshold" according to Mahout in Action book.
>> I'm not sure how to add new article canopies to the existing cluster.
>>
>> So I'm saving batch articles in a list of Cluster like this.
>> List<Cluster>  clusters = new ArrayList<Cluster>();
>>
>> For the new article canopies, I'm trying following to measure the
>> distance, but I get error like this. "Required cardinality 11981 but got
>> 77372"
>> Text key = new Text();
>> Canopy value = new Canopy();
>> DistanceMeasure measure = new EuclideanDistanceMeasure();
>> while (reader.next(key, value)){
>>      for (int i=0; i<clusters.size(); i++){
>>         double d = measure.distance(clusters.get(i).getCenter(),
>> value.getCenter());
>>      }
>> }
>>
>> Is this how to compare cluster centroids with new canopies?  or Did I
>> misundertand something?
>> Can you please help me so I can get the online news clustering working?
>> Thank you very much!
>>
>
>

Reply via email to