My input data is a bunch of product item titles so I first created sparse vectors seq2sparse:

bin/mahout seq2sparse -i sequence-input -o sparse-output -ow output 5 -md 2 -wt 
TFIDF -n 2 -ml 50 -nr 2 -ng 4 -seq -nv -x 80


I then generated canopies:

mahout canopy -i sequence-input/tfidf-vectors -o canopies -dm 
org.apache.mahout.common.distance.EuclideanDistanceMeasure -ow -xm sequential 
-t1 100 -t2 200


Also tried 1, 2 for t1,t2 respectively.

I guess I'll have to play with some other sample data and configurations to see the results I get. If everything goes well I should see multiple key/value pairs when inspecting the canopies via ClusterDump correct?

Something like this?

Key: C-0: Value: C-0: ...
Key: C-1: Value: C-1: ...
Key: C-2: Value: C-2: ...


Thanks


On 6/27/11 2:12 AM, Christoph Brücke wrote:
Hi,

usually, regarding the input data, there should be more than just one cluster. 
You may use the cluster dumper utility to output the cluster data.  
(https://cwiki.apache.org/confluence/display/MAHOUT/Cluster+Dumper)

It seems that your t1 and t2 threshold for the canopies are chosen to large, so 
that all data points are assigned to just one canopy. Could you describe your 
input data (number of dimensions, range, distribution, ...) and give the 
parameters you used for the clustering?

Regards,
Christoph

Am 27.06.2011 um 00:40 schrieb Mark:

Is there an easy way to know hot many canopies where generated after running 
the canopy generation tool?

I tried viewing the file clusters-0/part-r-00000 via seqdumper but it always 
returns:

Key: C-0: Value: C-0: 
{437:0.005630003188145648,478:0.006034746778989781,591:0.020761514762446885...
Count: 1

Should there be multiple key value pairs or just this one?

Thanks



Reply via email to