Re: Canopy Generation

Mark Mon, 27 Jun 2011 07:34:15 -0700

My input data is a bunch of product item titles so I first createdsparse vectors seq2sparse:


bin/mahout seq2sparse -i sequence-input -o sparse-output -ow output 5 -md 2 -wt 
TFIDF -n 2 -ml 50 -nr 2 -ng 4 -seq -nv -x 80



I then generated canopies:

mahout canopy -i sequence-input/tfidf-vectors -o canopies -dm 
org.apache.mahout.common.distance.EuclideanDistanceMeasure -ow -xm sequential 
-t1 100 -t2 200


Also tried 1, 2 for t1,t2 respectively.

I guess I'll have to play with some other sample data and configurationsto see the results I get. If everything goes well I should see multiplekey/value pairs when inspecting the canopies via ClusterDump correct?


Something like this?

Key: C-0: Value: C-0: ...
Key: C-1: Value: C-1: ...
Key: C-2: Value: C-2: ...


Thanks


On 6/27/11 2:12 AM, Christoph Brücke wrote:

Hi,

usually, regarding the input data, there should be more than just one cluster. 
You may use the cluster dumper utility to output the cluster data.  
(https://cwiki.apache.org/confluence/display/MAHOUT/Cluster+Dumper)

It seems that your t1 and t2 threshold for the canopies are chosen to large, so 
that all data points are assigned to just one canopy. Could you describe your 
input data (number of dimensions, range, distribution, ...) and give the 
parameters you used for the clustering?

Regards,
Christoph

Am 27.06.2011 um 00:40 schrieb Mark:

Is there an easy way to know hot many canopies where generated after running 
the canopy generation tool?

I tried viewing the file clusters-0/part-r-00000 via seqdumper but it always 
returns:

Key: C-0: Value: C-0: 
{437:0.005630003188145648,478:0.006034746778989781,591:0.020761514762446885...
Count: 1

Should there be multiple key value pairs or just this one?

Thanks

Re: Canopy Generation

Reply via email to