Re: [scikit-learn] Issues with kmeans: Difference in centroid values

2018-04-16 Thread Andreas Mueller



On 04/16/2018 04:07 PM, Sidak Pal Singh wrote:

Hi everyone,

I was using scikit-learn KMeans algorithm to cluster pretrained 
word-vectors. There are a few things which I found to be surprising 
and wanted to get some feedback on.


- Based upon the 'labels_' assigned to each word-vector (i.e. cluster 
memberships), I compute every cluster centroid as the average of the 
word-vectors (corresponding to that cluster). Surprisingly, this seems 
to be pretty different from the 'cluster_centers_'. Is there anything 
that I am missing here?
If the algorithm did not fully converge, you just did one more step, so 
the results are expected to be different.


- I was later using the verbose option to see if the clustering has 
converged or not. I saw on the console log messages such as /"//center 
shift 7.994126e-04 within tolerance 1.243425e-06"/. It seems that this 
corresponds to some code in *kmeans_elkan.pyx* 
(https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/_k_means_elkan.pyx). 

- Lastly, another thing that seems strange is that I hadn't set the 
tolerance value. So the default of 1e-4 should have been used. But if 
you look again at the above log, it says /within tolerance 
1.243425e-06 instead of 1e-4.

/

/https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/k_means_.py#L159
The tolerance is scaled by the variance of the data to be independent of 
the scal/e


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] Issues with kmeans: Difference in centroid values

2018-04-16 Thread Sidak Pal Singh
Hi everyone,

I was using scikit-learn KMeans algorithm to cluster pretrained
word-vectors. There are a few things which I found to be surprising and
wanted to get some feedback on.

- Based upon the 'labels_' assigned to each word-vector (i.e. cluster
memberships), I compute every cluster centroid as the average of the
word-vectors (corresponding to that cluster). Surprisingly, this seems to
be pretty different from the 'cluster_centers_'. Is there anything that I
am missing here?

- I was later using the verbose option to see if the clustering has
converged or not. I saw on the console log messages such as *"**center
shift 7.994126e-04 within tolerance 1.243425e-06"*. It seems that this
corresponds to some code in *kmeans_elkan.pyx* (
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/_k_means_elkan.pyx
).

- Lastly, another thing that seems strange is that I hadn't set the
tolerance value. So the default of 1e-4 should have been used. But if you
look again at the above log, it says *within tolerance 1.243425e-06 instead
of 1e-4. *

It would be great if you can look into this and help me out.

Thank you so much! :)

Best,
Sidak Pal Singh
EPFL
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn