I got the same issue. What I found is that the initial centers have many empty
ones, the final number of clusters are decided by the number of nonempty
centers.
Here are some example of my cases:
...
CL-34358205{n=0 c=[] r=[]}
CL-34358207{n=0 c=[] r=[]}
CL-34358209{n=0 c=[] r=[]}
CL-34358213{n=0 c=[0:1.000] r=[]}
CL-34358215{n=0 c=[] r=[]}
CL-34358216{n=0 c=[] r=[]}
CL-34358217{n=0 c=[] r=[]}
CL-34358220{n=0 c=[] r=[]}
CL-34358221{n=0 c=[] r=[]}
CL-34358222{n=0 c=[] r=[]}
CL-34358223{n=0 c=[] r=[]}
CL-34358224{n=0 c=[] r=[]}
CL-34358227{n=0 c=[0:1.000] r=[]}
CL-34358228{n=0 c=[] r=[]}
CL-34358229{n=0 c=[] r=[]}
...
Is it the case there is a bug in initialization?
Thanks.
Dan
-----Original Message-----
From: Pat Ferrel [mailto:[email protected]]
Sent: Tuesday, May 08, 2012 9:13 AM
To: [email protected]
Subject: Re: kmeans not returning k clusters
Here is a sample data set. In this case I asked for 30 and got 28 but in other
cases the discrepancy has been greater like ask for 200 and get 38 but that was
for a much larger data set.
Running on my mac laptop in a single node pseudo cluster hadoop 0.20.205,
mahout 0.6
command line:
mahout kmeans \
-i b2/bixo-vectors/tfidf-vectors/ \
-c b2/bixo-kmeans-centroids \
-cl \
-o b2/bixo-kmeans-clusters \
-k 30 \
-ow \
-cd 0.01 \
-x 20 \
-dm org.apache.mahout.common.distance.TanimotoDistanceMeasure
Find the data here:
http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=0b2dacddca05c0ee48cbebd05048434425b86740
BTW when I run rowsimilarity asking for 20 similar docs I get a max of
20 but sometimes many less. Shouldn't this always return the requested number?
I'll post this question again to the the attention of the right person.
On 5/8/12 6:15 AM, Paritosh Ranjan wrote:
> I looked at the 0.6 version's code but was not able to find any reason.
> If possible, can you share the data you are trying to cluster along
> with the execution parameters?
>
> You can also open a Jira for this and provide the info there.
>
> On 07-05-2012 19:45, Pat Ferrel wrote:
>> 0.6
>>
>> I take it this is not expected behavior? I could be doing something
>> stupid. I only look in the "final" directory. Looking in the others
>> with clusterdump shows the same number of clusters and I assumed they
>> were iterations.
>>
>> On 5/7/12 1:21 AM, Paritosh Ranjan wrote:
>>> Which version are you using ? 0.6 or the current 0.7-snapshot?
>>>
>>> On 07-05-2012 02:19, Pat Ferrel wrote:
>>>> What would cause kmeans to not return k clusters? As I tweak
>>>> parameters I get different numbers of clusters but it's usually
>>>> less than the k I pass in. Since I am not using canopies at present
>>>> I would expect k to always be honored but the quality of the
>>>> clusters would depend on the convergence amount and number of
>>>> iterations allowed. No?
>>>
>>>
>>>
>
>
>