Re: Canopy Generation

Christoph Brücke Tue, 28 Jun 2011 02:04:00 -0700

Hi Mark,

the T1 threshold should be strict larger than the T2 one (T1 > T2). And yes the 
cluster dumper utility should give you more than one cluster if present. The 
output looks like:
CL-0 { n=116 c=[29.922, 30.407, 30.373, 30.094, 29.886, ...] r=[3.463, 3.351, 
3.452, 3.438, 3.371, ...] }
CL-1 { n=... c=[... , ...] r=[3... , ...] }


Whereas CL-0 is the cluster id, n is the number of vectors within the cluster, 
c is the centroid and r is the radius.

Am 27.06.2011 um 16:33 schrieb Mark:

> My input data is a bunch of product item titles so I first created sparse 
> vectors seq2sparse:
> 
> bin/mahout seq2sparse -i sequence-input -o sparse-output -ow output 5 -md 2 
> -wt TFIDF -n 2 -ml 50 -nr 2 -ng 4 -seq -nv -x 80
> 
> 
> I then generated canopies:
> 
> mahout canopy -i sequence-input/tfidf-vectors -o canopies -dm 
> org.apache.mahout.common.distance.EuclideanDistanceMeasure -ow -xm sequential 
> -t1 100 -t2 200
> 
> 
> Also tried 1, 2 for t1,t2 respectively.
> 
> I guess I'll have to play with some other sample data and configurations to 
> see the results I get. If everything goes well I should see multiple 
> key/value pairs when inspecting the canopies via ClusterDump correct?
> 
> Something like this?
> 
> Key: C-0: Value: C-0: ...
> Key: C-1: Value: C-1: ...
> Key: C-2: Value: C-2: ...
> 
> 
> Thanks
> 
> 
> On 6/27/11 2:12 AM, Christoph Brücke wrote:
>> Hi,
>> 
>> usually, regarding the input data, there should be more than just one 
>> cluster. You may use the cluster dumper utility to output the cluster data.  
>> (https://cwiki.apache.org/confluence/display/MAHOUT/Cluster+Dumper)
>> 
>> It seems that your t1 and t2 threshold for the canopies are chosen to large, 
>> so that all data points are assigned to just one canopy. Could you describe 
>> your input data (number of dimensions, range, distribution, ...) and give 
>> the parameters you used for the clustering?
>> 
>> Regards,
>> Christoph
>> 
>> Am 27.06.2011 um 00:40 schrieb Mark:
>> 
>>> Is there an easy way to know hot many canopies where generated after 
>>> running the canopy generation tool?
>>> 
>>> I tried viewing the file clusters-0/part-r-00000 via seqdumper but it 
>>> always returns:
>>> 
>>> Key: C-0: Value: C-0: 
>>> {437:0.005630003188145648,478:0.006034746778989781,591:0.020761514762446885...
>>> Count: 1
>>> 
>>> Should there be multiple key value pairs or just this one?
>>> 
>>> Thanks
>>> 
>>> 
>> 
>

Re: Canopy Generation

Reply via email to