Hi Mark,
the T1 threshold should be strict larger than the T2 one (T1 > T2). And yes the
cluster dumper utility should give you more than one cluster if present. The
output looks like:
CL-0 { n=116 c=[29.922, 30.407, 30.373, 30.094, 29.886, ...] r=[3.463, 3.351,
3.452, 3.438, 3.371, ...] }
CL-1 { n=... c=[... , ...] r=[3... , ...] }
Whereas CL-0 is the cluster id, n is the number of vectors within the cluster,
c is the centroid and r is the radius.
Am 27.06.2011 um 16:33 schrieb Mark:
> My input data is a bunch of product item titles so I first created sparse
> vectors seq2sparse:
>
> bin/mahout seq2sparse -i sequence-input -o sparse-output -ow output 5 -md 2
> -wt TFIDF -n 2 -ml 50 -nr 2 -ng 4 -seq -nv -x 80
>
>
> I then generated canopies:
>
> mahout canopy -i sequence-input/tfidf-vectors -o canopies -dm
> org.apache.mahout.common.distance.EuclideanDistanceMeasure -ow -xm sequential
> -t1 100 -t2 200
>
>
> Also tried 1, 2 for t1,t2 respectively.
>
> I guess I'll have to play with some other sample data and configurations to
> see the results I get. If everything goes well I should see multiple
> key/value pairs when inspecting the canopies via ClusterDump correct?
>
> Something like this?
>
> Key: C-0: Value: C-0: ...
> Key: C-1: Value: C-1: ...
> Key: C-2: Value: C-2: ...
>
>
> Thanks
>
>
> On 6/27/11 2:12 AM, Christoph Brücke wrote:
>> Hi,
>>
>> usually, regarding the input data, there should be more than just one
>> cluster. You may use the cluster dumper utility to output the cluster data.
>> (https://cwiki.apache.org/confluence/display/MAHOUT/Cluster+Dumper)
>>
>> It seems that your t1 and t2 threshold for the canopies are chosen to large,
>> so that all data points are assigned to just one canopy. Could you describe
>> your input data (number of dimensions, range, distribution, ...) and give
>> the parameters you used for the clustering?
>>
>> Regards,
>> Christoph
>>
>> Am 27.06.2011 um 00:40 schrieb Mark:
>>
>>> Is there an easy way to know hot many canopies where generated after
>>> running the canopy generation tool?
>>>
>>> I tried viewing the file clusters-0/part-r-00000 via seqdumper but it
>>> always returns:
>>>
>>> Key: C-0: Value: C-0:
>>> {437:0.005630003188145648,478:0.006034746778989781,591:0.020761514762446885...
>>> Count: 1
>>>
>>> Should there be multiple key value pairs or just this one?
>>>
>>> Thanks
>>>
>>>
>>
>