Re: kmeans not returning k clusters

Pat Ferrel Wed, 09 May 2012 17:26:11 -0700

I have not checked with canopy since you don't really tell kmeans howmany to create, it's a little hidden. That's why I said I don't careabout the number, just that I'm not loosing real/important clusters.

The size of the vectors is in the data set is something like 3000 Ithink. Very little pruning, just a bixo + boilerpipe crawl of a fewsites at a minimum depth. Here is the seq2sparse command I ran:


mahout seq2sparse \
    -i b2/bixo-seqfiles/ \
    -o b2/bixo-vectors/ \
    -ow -chunk 2000 \
    -x 90 \
    -seq \
    -n 2 \
    -nv

I agree if no one else is seeing this, it may be weirdness of my owncreation.


On 5/9/12 12:24 PM, Jeff Eastman wrote:

Does this cluster reduction happen when you prime k-means with canopy?Can you first adjust T1==T2 to get about 200 canopies and feed that tok-means? How wide are your term vectors? Have you tried other distancemeasures?
If anybody else out there is experiencing similar problems, pleasechime in.
Jeff

On 5/9/12 1:07 PM, Pat Ferrel wrote:
That's what I'm doing now. Random seeds is not really the best way todo kmeans. However my results are repeatable as far as I've gone. Andcanopy wants to generate a much larger set of clusters, with a widerange of T1 and T2 for this data set so the theory that it does notsupport 30 clusters seems unlikely although the may be a fairdistance apart.
Since I've tried several times with several random seed so the "seedsare too close" theory doesn't seem likely.Given canopy wants to generate more clusters, the "doesn't support k= 30" theory doesn't seem likely.
I'm not saying that there is a real problem here but when I noticedit I had 16,000 documents and was asking for 200 clusters and got 38.If there is some good reason for this it would be nice to find it andreport it to the user. The "good reason" might be very helpful in theanalysis. Or it could be a bug.
At least it's out there in case others are seeing lost clusters.

On 5/9/12 7:49 AM, Jeff Eastman wrote:
Paratosh is correct in his analysis. K-means can work itself into asituation where there are some empty clusters if the initial clustercenters are too closely spaced or if the data really doesn't supportk clusters. This is because it assigns each vector to the mostlikely (closest) cluster. If two prior clusters are very closetogether this can cause one of them to become empty.
Have you tried priming k-means with canopy instead of the randomsampler?
On 5/9/12 10:35 AM, Pat Ferrel wrote:
I suspect you are right Paritosh. I ran the random seed with kmeanseveral times on the supplied data set and always got 28 ratherthan 30 clusters. I don't care so much about the number but itmight mean that some clusters are thrown out and without lookingyou couldn't tell if they were important ones or not. Just upping kto 32 doesn't really work if you still get some thrown out.
At least i think the issue is repeatable with this data.

On 5/9/12 1:14 AM, Paritosh Ranjan wrote:
Printouts of Mahout vectors prints only the non-zero elements.
So, the centers are not empty, rather they are zero.
Prima facie, I suspect that you are getting lot of empty clusters.This might be occurring due to the combination of distancemeasure, convergence threshold and distances between vectors.
Can you try to analyze and change/play around with these parameters?
I will try to look into how the Random Cluster Initialization isworking. I will log a jira if I find some issue. However, I thinkthat there will be no problem in cluster initialization part.
On 09-05-2012 03:21, Danfeng Li wrote:
I got the same issue. What I found is that the initial centershave many empty ones, the final number of clusters are decided bythe number of nonempty centers.
Here are some example of my cases:

...
CL-34358205{n=0 c=[] r=[]}
CL-34358207{n=0 c=[] r=[]}
CL-34358209{n=0 c=[] r=[]}
CL-34358213{n=0 c=[0:1.000] r=[]}
CL-34358215{n=0 c=[] r=[]}
CL-34358216{n=0 c=[] r=[]}
CL-34358217{n=0 c=[] r=[]}
CL-34358220{n=0 c=[] r=[]}
CL-34358221{n=0 c=[] r=[]}
CL-34358222{n=0 c=[] r=[]}
CL-34358223{n=0 c=[] r=[]}
CL-34358224{n=0 c=[] r=[]}
CL-34358227{n=0 c=[0:1.000] r=[]}
CL-34358228{n=0 c=[] r=[]}
CL-34358229{n=0 c=[] r=[]}
...

Is it the case there is a bug in initialization?

Thanks.
Dan

-----Original Message-----
From: Pat Ferrel [mailto:[email protected]]
Sent: Tuesday, May 08, 2012 9:13 AM
To: [email protected]
Subject: Re: kmeans not returning k clusters
Here is a sample data set. In this case I asked for 30 and got 28but in other cases the discrepancy has been greater like ask for200 and get 38 but that was for a much larger data set.
Running on my mac laptop in a single node pseudo cluster hadoop0.20.205, mahout 0.6
command line:

mahout kmeans \
      -i b2/bixo-vectors/tfidf-vectors/ \
      -c b2/bixo-kmeans-centroids \
      -cl \
      -o b2/bixo-kmeans-clusters \
      -k 30 \
      -ow \
      -cd 0.01 \
      -x 20 \
      -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure

Find the data here:
http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=0b2dacddca05c0ee48cbebd05048434425b86740
BTW when I run rowsimilarity asking for 20 similar docs I get amax of20 but sometimes many less. Shouldn't this always return therequested number? I'll post this question again to the theattention of the right person.
On 5/8/12 6:15 AM, Paritosh Ranjan wrote:
I looked at the 0.6 version's code but was not able to find anyreason.
If possible, can you share the data you are trying to cluster along
with the execution parameters?

You can also open a Jira for this and provide the info there.

On 07-05-2012 19:45, Pat Ferrel wrote:
0.6
I take it this is not expected behavior? I could be doingsomethingstupid. I only look in the "final" directory. Looking in theotherswith clusterdump shows the same number of clusters and Iassumed they
were iterations.

On 5/7/12 1:21 AM, Paritosh Ranjan wrote:
Which version are you using ? 0.6 or the current 0.7-snapshot?

On 07-05-2012 02:19, Pat Ferrel wrote:
What would cause kmeans to not return k clusters? As I tweak
parameters I get different numbers of clusters but it's usually
less than the k I pass in. Since I am not using canopies atpresent
I would expect k to always be honored but the quality of the
clusters would depend on the convergence amount and number of
iterations allowed. No?

Re: kmeans not returning k clusters

Reply via email to