Re: [R] Kmeans centers

2007-03-30 Thread Sergio Della Franca
My simple problem is that when i run kmeans this give me different results
because if centers is a number, a random set of (distinct) rows in x is
chosen as the initial centres.

About me the problem is simple.

The question i ask you is if it possible that centers could be different
from number.
i.e. instead of indicate a number of center, could be possible indicate
different character lable to identify the cluster i want to obtain?

thk you



2007/3/29, Gavin Simpson [EMAIL PROTECTED]:

 On Thu, 2007-03-29 at 15:02 +0200, Sergio Della Franca wrote:
  Dear R-Helpers,
 
  I read in the R documentation, about kmeans:
 
centers
 
  Either the number of clusters or a set of initial (distinct) cluster
  centres. *If a number*, a random set of (distinct) rows in x is chosen
 as
  the initial centres.
  My question is: could it be possible that the centers are character and
 not
  number?

 I think you misunderstand - centers is the number of clusters you want
 to partition your data into. How else would you specify the number of
 clusters other than by a number? So no, it has to be a numeric number.

 The alternative use of centers is to provide known starting points for
 the algorithm, such as from the results of a hierarchical cluster
 analysis, that are the locations of the cluster centroids, for each
 cluster, on each of the feature variables.

 Also, argument x to kmeans() is specific about requiring a numeric
 matrix (or something coercible to one), so characters here are not
 allowed either.

 But then again, I may not have understood what it is that you are
 asking, but that is not surprising given that you have not provided an
 example of what you are trying to do, and how you tried to do it but
 failed.

  and provide commented, minimal, self-contained, reproducible code.
  ^^^
 G
 --
 %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Gavin Simpson [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,  [f] +44 (0)20 7679 0565
 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London  [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT. [w] http://www.freshwaters.org.uk
 %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%



[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Kmeans centers

2007-03-30 Thread Gavin Simpson
On Fri, 2007-03-30 at 09:07 +0200, Sergio Della Franca wrote:
 My simple problem is that when i run kmeans this give me different
 results because if centers is a number, a random set of (distinct)
 rows in x is chosen as the initial centres.

You can stop this and make it reproducible by setting the seed for the
random number generator before doing kmeans - this way the same
(pseudo)random set of rows get selected each time:

dat - data.frame(a = rnorm(100), b = rnorm(100), c = rnorm(100))
set.seed(1234)
km - kmeans(dat, 2)
set.seed(1234)
km2 - kmeans(dat, 2)
all.equal(km, km2) ## TRUE

But ask yourself is this is helpful? Are the solutions similar each time
you run the function (without setting the seed) and get different
results? If the runs give very different results then it is likely that
you are finding local minima not an optimal solution - a common problem
with iterative algorithms using random starts.

One solution to this /is/ to use several random starts and see if you
get similar results. Some samples may switch clusters, but if the bulk
of samples assigned to same cluster (i.e. together, not in cluster 1
as the cluster number is random) then you can be happy with the result.
That some samples switch clusters may just indicate that there isn't a
clearly defined clustering of all your samples - some are intermediate
between clusters.

Another is to use a hierarchical cluster analysis (via hclust()). Cut it
at the number of clusters you want and use the centers (sic) of those
clusters as the starting points for kmeans. This way the hclust()
results get you close to a good solution, which kmeans then updates as
it is not constrained by having a hierarchical structure.

There is an example of this in Modern Applied Statistics with S (2002 -
Venables and Ripley, Springer), but if you don't have this book, you can
see the MASS scripts for Chapter 11 of the book. The MASS scripts should
have been provided with your copy of R, in
RINSTALL/library/MASS/scripts/ where RINSTALL is the where your version
of R is installed. Then you want ch11.R in that directory. Look at
section 11.2 Cluster Analysis in that file

  
 About me the problem is simple. 
  
 The question i ask you is if it possible that centers could be
 different from number. 
 i.e. instead of indicate a number of center, could be possible
 indicate different character lable to identify the cluster i want to
 obtain?

No. And this is why, despite how clear and simple the problem is to you,
you need to show us an example of your data! Surly, if you have
information that exactly identifies the clusters you want to find, why
do you need a clustering algorithm to find them for you?

G


 thk you
 
 
  
 2007/3/29, Gavin Simpson [EMAIL PROTECTED]: 
 On Thu, 2007-03-29 at 15:02 +0200, Sergio Della Franca wrote:
  Dear R-Helpers,
 
  I read in the R documentation, about kmeans: 
 
centers
 
  Either the number of clusters or a set of initial (distinct)
 cluster
  centres. *If a number*, a random set of (distinct) rows in x
 is chosen as
  the initial centres. 
  My question is: could it be possible that the centers are
 character and not
  number?
 
 I think you misunderstand - centers is the number of clusters
 you want
 to partition your data into. How else would you specify the
 number of 
 clusters other than by a number? So no, it has to be a numeric
 number.
 
 The alternative use of centers is to provide known starting
 points for
 the algorithm, such as from the results of a hierarchical
 cluster 
 analysis, that are the locations of the cluster centroids, for
 each
 cluster, on each of the feature variables.
 
 Also, argument x to kmeans() is specific about requiring a
 numeric
 matrix (or something coercible to one), so characters here are
 not 
 allowed either.
 
 But then again, I may not have understood what it is that you
 are
 asking, but that is not surprising given that you have not
 provided an
 example of what you are trying to do, and how you tried to do
 it but 
 failed.
 
  and provide commented, minimal, self-contained, reproducible
 code.
 
 ^^^
 G
 --
 %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~
 %~%~%~% 
 Gavin Simpson [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,  [f] +44 (0)20 7679 0565
 Pearson Building, [e]
 gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London  [w]
 http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT.  

Re: [R] Kmeans centers

2007-03-30 Thread Sergio Della Franca
Thank you very much Gavin,

The set.seed is the correct function i need.

Now the kmeans is permanent and doesn't change results every time i run.





2007/3/30, Gavin Simpson [EMAIL PROTECTED]:

 On Fri, 2007-03-30 at 09:07 +0200, Sergio Della Franca wrote:
  My simple problem is that when i run kmeans this give me different
  results because if centers is a number, a random set of (distinct)
  rows in x is chosen as the initial centres.

 You can stop this and make it reproducible by setting the seed for the
 random number generator before doing kmeans - this way the same
 (pseudo)random set of rows get selected each time:

 dat - data.frame(a = rnorm(100), b = rnorm(100), c = rnorm(100))
 set.seed(1234)
 km - kmeans(dat, 2)
 set.seed(1234)
 km2 - kmeans(dat, 2)
 all.equal(km, km2) ## TRUE

 But ask yourself is this is helpful? Are the solutions similar each time
 you run the function (without setting the seed) and get different
 results? If the runs give very different results then it is likely that
 you are finding local minima not an optimal solution - a common problem
 with iterative algorithms using random starts.

 One solution to this /is/ to use several random starts and see if you
 get similar results. Some samples may switch clusters, but if the bulk
 of samples assigned to same cluster (i.e. together, not in cluster 1
 as the cluster number is random) then you can be happy with the result.
 That some samples switch clusters may just indicate that there isn't a
 clearly defined clustering of all your samples - some are intermediate
 between clusters.

 Another is to use a hierarchical cluster analysis (via hclust()). Cut it
 at the number of clusters you want and use the centers (sic) of those
 clusters as the starting points for kmeans. This way the hclust()
 results get you close to a good solution, which kmeans then updates as
 it is not constrained by having a hierarchical structure.

 There is an example of this in Modern Applied Statistics with S (2002 -
 Venables and Ripley, Springer), but if you don't have this book, you can
 see the MASS scripts for Chapter 11 of the book. The MASS scripts should
 have been provided with your copy of R, in
 RINSTALL/library/MASS/scripts/ where RINSTALL is the where your version
 of R is installed. Then you want ch11.R in that directory. Look at
 section 11.2 Cluster Analysis in that file

 
  About me the problem is simple.
 
  The question i ask you is if it possible that centers could be
  different from number.
  i.e. instead of indicate a number of center, could be possible
  indicate different character lable to identify the cluster i want to
  obtain?

 No. And this is why, despite how clear and simple the problem is to you,
 you need to show us an example of your data! Surly, if you have
 information that exactly identifies the clusters you want to find, why
 do you need a clustering algorithm to find them for you?

 G

 
  thk you
 
 
 
  2007/3/29, Gavin Simpson [EMAIL PROTECTED]:
  On Thu, 2007-03-29 at 15:02 +0200, Sergio Della Franca wrote:
   Dear R-Helpers,
  
   I read in the R documentation, about kmeans:
  
 centers
  
   Either the number of clusters or a set of initial (distinct)
  cluster
   centres. *If a number*, a random set of (distinct) rows in x
  is chosen as
   the initial centres.
   My question is: could it be possible that the centers are
  character and not
   number?
 
  I think you misunderstand - centers is the number of clusters
  you want
  to partition your data into. How else would you specify the
  number of
  clusters other than by a number? So no, it has to be a numeric
  number.
 
  The alternative use of centers is to provide known starting
  points for
  the algorithm, such as from the results of a hierarchical
  cluster
  analysis, that are the locations of the cluster centroids, for
  each
  cluster, on each of the feature variables.
 
  Also, argument x to kmeans() is specific about requiring a
  numeric
  matrix (or something coercible to one), so characters here are
  not
  allowed either.
 
  But then again, I may not have understood what it is that you
  are
  asking, but that is not surprising given that you have not
  provided an
  example of what you are trying to do, and how you tried to do
  it but
  failed.
 
   and provide commented, minimal, self-contained, reproducible
  code.
 
  ^^^
  G
  --
  %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~
  %~%~%~%
  Gavin Simpson [t] +44 (0)20 7679 0522
  ECRC, UCL 

[R] Kmeans centers

2007-03-29 Thread Sergio Della Franca
Dear R-Helpers,

I read in the R documentation, about kmeans:

  centers

Either the number of clusters or a set of initial (distinct) cluster
centres. *If a number*, a random set of (distinct) rows in x is chosen as
the initial centres.
My question is: could it be possible that the centers are character and not
number?


Thank you in advance.


Sergio Della Franca

[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Kmeans centers

2007-03-29 Thread Gavin Simpson
On Thu, 2007-03-29 at 15:02 +0200, Sergio Della Franca wrote:
 Dear R-Helpers,
 
 I read in the R documentation, about kmeans:
 
   centers
 
 Either the number of clusters or a set of initial (distinct) cluster
 centres. *If a number*, a random set of (distinct) rows in x is chosen as
 the initial centres.
 My question is: could it be possible that the centers are character and not
 number?

I think you misunderstand - centers is the number of clusters you want
to partition your data into. How else would you specify the number of
clusters other than by a number? So no, it has to be a numeric number.

The alternative use of centers is to provide known starting points for
the algorithm, such as from the results of a hierarchical cluster
analysis, that are the locations of the cluster centroids, for each
cluster, on each of the feature variables.

Also, argument x to kmeans() is specific about requiring a numeric
matrix (or something coercible to one), so characters here are not
allowed either.

But then again, I may not have understood what it is that you are
asking, but that is not surprising given that you have not provided an
example of what you are trying to do, and how you tried to do it but
failed.

 and provide commented, minimal, self-contained, reproducible code.
  ^^^
G
-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Gavin Simpson [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,  [f] +44 (0)20 7679 0565
 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London  [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT. [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.