Re: [R] Kmeans centers
My simple problem is that when i run kmeans this give me different results because if centers is a number, a random set of (distinct) rows in x is chosen as the initial centres. About me the problem is simple. The question i ask you is if it possible that centers could be different from number. i.e. instead of indicate a number of center, could be possible indicate different character lable to identify the cluster i want to obtain? thk you 2007/3/29, Gavin Simpson [EMAIL PROTECTED]: On Thu, 2007-03-29 at 15:02 +0200, Sergio Della Franca wrote: Dear R-Helpers, I read in the R documentation, about kmeans: centers Either the number of clusters or a set of initial (distinct) cluster centres. *If a number*, a random set of (distinct) rows in x is chosen as the initial centres. My question is: could it be possible that the centers are character and not number? I think you misunderstand - centers is the number of clusters you want to partition your data into. How else would you specify the number of clusters other than by a number? So no, it has to be a numeric number. The alternative use of centers is to provide known starting points for the algorithm, such as from the results of a hierarchical cluster analysis, that are the locations of the cluster centroids, for each cluster, on each of the feature variables. Also, argument x to kmeans() is specific about requiring a numeric matrix (or something coercible to one), so characters here are not allowed either. But then again, I may not have understood what it is that you are asking, but that is not surprising given that you have not provided an example of what you are trying to do, and how you tried to do it but failed. and provide commented, minimal, self-contained, reproducible code. ^^^ G -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Gavin Simpson [t] +44 (0)20 7679 0522 ECRC, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Kmeans centers
On Fri, 2007-03-30 at 09:07 +0200, Sergio Della Franca wrote: My simple problem is that when i run kmeans this give me different results because if centers is a number, a random set of (distinct) rows in x is chosen as the initial centres. You can stop this and make it reproducible by setting the seed for the random number generator before doing kmeans - this way the same (pseudo)random set of rows get selected each time: dat - data.frame(a = rnorm(100), b = rnorm(100), c = rnorm(100)) set.seed(1234) km - kmeans(dat, 2) set.seed(1234) km2 - kmeans(dat, 2) all.equal(km, km2) ## TRUE But ask yourself is this is helpful? Are the solutions similar each time you run the function (without setting the seed) and get different results? If the runs give very different results then it is likely that you are finding local minima not an optimal solution - a common problem with iterative algorithms using random starts. One solution to this /is/ to use several random starts and see if you get similar results. Some samples may switch clusters, but if the bulk of samples assigned to same cluster (i.e. together, not in cluster 1 as the cluster number is random) then you can be happy with the result. That some samples switch clusters may just indicate that there isn't a clearly defined clustering of all your samples - some are intermediate between clusters. Another is to use a hierarchical cluster analysis (via hclust()). Cut it at the number of clusters you want and use the centers (sic) of those clusters as the starting points for kmeans. This way the hclust() results get you close to a good solution, which kmeans then updates as it is not constrained by having a hierarchical structure. There is an example of this in Modern Applied Statistics with S (2002 - Venables and Ripley, Springer), but if you don't have this book, you can see the MASS scripts for Chapter 11 of the book. The MASS scripts should have been provided with your copy of R, in RINSTALL/library/MASS/scripts/ where RINSTALL is the where your version of R is installed. Then you want ch11.R in that directory. Look at section 11.2 Cluster Analysis in that file About me the problem is simple. The question i ask you is if it possible that centers could be different from number. i.e. instead of indicate a number of center, could be possible indicate different character lable to identify the cluster i want to obtain? No. And this is why, despite how clear and simple the problem is to you, you need to show us an example of your data! Surly, if you have information that exactly identifies the clusters you want to find, why do you need a clustering algorithm to find them for you? G thk you 2007/3/29, Gavin Simpson [EMAIL PROTECTED]: On Thu, 2007-03-29 at 15:02 +0200, Sergio Della Franca wrote: Dear R-Helpers, I read in the R documentation, about kmeans: centers Either the number of clusters or a set of initial (distinct) cluster centres. *If a number*, a random set of (distinct) rows in x is chosen as the initial centres. My question is: could it be possible that the centers are character and not number? I think you misunderstand - centers is the number of clusters you want to partition your data into. How else would you specify the number of clusters other than by a number? So no, it has to be a numeric number. The alternative use of centers is to provide known starting points for the algorithm, such as from the results of a hierarchical cluster analysis, that are the locations of the cluster centroids, for each cluster, on each of the feature variables. Also, argument x to kmeans() is specific about requiring a numeric matrix (or something coercible to one), so characters here are not allowed either. But then again, I may not have understood what it is that you are asking, but that is not surprising given that you have not provided an example of what you are trying to do, and how you tried to do it but failed. and provide commented, minimal, self-contained, reproducible code. ^^^ G -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~ %~%~%~% Gavin Simpson [t] +44 (0)20 7679 0522 ECRC, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT.
Re: [R] Kmeans centers
Thank you very much Gavin, The set.seed is the correct function i need. Now the kmeans is permanent and doesn't change results every time i run. 2007/3/30, Gavin Simpson [EMAIL PROTECTED]: On Fri, 2007-03-30 at 09:07 +0200, Sergio Della Franca wrote: My simple problem is that when i run kmeans this give me different results because if centers is a number, a random set of (distinct) rows in x is chosen as the initial centres. You can stop this and make it reproducible by setting the seed for the random number generator before doing kmeans - this way the same (pseudo)random set of rows get selected each time: dat - data.frame(a = rnorm(100), b = rnorm(100), c = rnorm(100)) set.seed(1234) km - kmeans(dat, 2) set.seed(1234) km2 - kmeans(dat, 2) all.equal(km, km2) ## TRUE But ask yourself is this is helpful? Are the solutions similar each time you run the function (without setting the seed) and get different results? If the runs give very different results then it is likely that you are finding local minima not an optimal solution - a common problem with iterative algorithms using random starts. One solution to this /is/ to use several random starts and see if you get similar results. Some samples may switch clusters, but if the bulk of samples assigned to same cluster (i.e. together, not in cluster 1 as the cluster number is random) then you can be happy with the result. That some samples switch clusters may just indicate that there isn't a clearly defined clustering of all your samples - some are intermediate between clusters. Another is to use a hierarchical cluster analysis (via hclust()). Cut it at the number of clusters you want and use the centers (sic) of those clusters as the starting points for kmeans. This way the hclust() results get you close to a good solution, which kmeans then updates as it is not constrained by having a hierarchical structure. There is an example of this in Modern Applied Statistics with S (2002 - Venables and Ripley, Springer), but if you don't have this book, you can see the MASS scripts for Chapter 11 of the book. The MASS scripts should have been provided with your copy of R, in RINSTALL/library/MASS/scripts/ where RINSTALL is the where your version of R is installed. Then you want ch11.R in that directory. Look at section 11.2 Cluster Analysis in that file About me the problem is simple. The question i ask you is if it possible that centers could be different from number. i.e. instead of indicate a number of center, could be possible indicate different character lable to identify the cluster i want to obtain? No. And this is why, despite how clear and simple the problem is to you, you need to show us an example of your data! Surly, if you have information that exactly identifies the clusters you want to find, why do you need a clustering algorithm to find them for you? G thk you 2007/3/29, Gavin Simpson [EMAIL PROTECTED]: On Thu, 2007-03-29 at 15:02 +0200, Sergio Della Franca wrote: Dear R-Helpers, I read in the R documentation, about kmeans: centers Either the number of clusters or a set of initial (distinct) cluster centres. *If a number*, a random set of (distinct) rows in x is chosen as the initial centres. My question is: could it be possible that the centers are character and not number? I think you misunderstand - centers is the number of clusters you want to partition your data into. How else would you specify the number of clusters other than by a number? So no, it has to be a numeric number. The alternative use of centers is to provide known starting points for the algorithm, such as from the results of a hierarchical cluster analysis, that are the locations of the cluster centroids, for each cluster, on each of the feature variables. Also, argument x to kmeans() is specific about requiring a numeric matrix (or something coercible to one), so characters here are not allowed either. But then again, I may not have understood what it is that you are asking, but that is not surprising given that you have not provided an example of what you are trying to do, and how you tried to do it but failed. and provide commented, minimal, self-contained, reproducible code. ^^^ G -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~ %~%~%~% Gavin Simpson [t] +44 (0)20 7679 0522 ECRC, UCL
[R] Kmeans centers
Dear R-Helpers, I read in the R documentation, about kmeans: centers Either the number of clusters or a set of initial (distinct) cluster centres. *If a number*, a random set of (distinct) rows in x is chosen as the initial centres. My question is: could it be possible that the centers are character and not number? Thank you in advance. Sergio Della Franca [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Kmeans centers
On Thu, 2007-03-29 at 15:02 +0200, Sergio Della Franca wrote: Dear R-Helpers, I read in the R documentation, about kmeans: centers Either the number of clusters or a set of initial (distinct) cluster centres. *If a number*, a random set of (distinct) rows in x is chosen as the initial centres. My question is: could it be possible that the centers are character and not number? I think you misunderstand - centers is the number of clusters you want to partition your data into. How else would you specify the number of clusters other than by a number? So no, it has to be a numeric number. The alternative use of centers is to provide known starting points for the algorithm, such as from the results of a hierarchical cluster analysis, that are the locations of the cluster centroids, for each cluster, on each of the feature variables. Also, argument x to kmeans() is specific about requiring a numeric matrix (or something coercible to one), so characters here are not allowed either. But then again, I may not have understood what it is that you are asking, but that is not surprising given that you have not provided an example of what you are trying to do, and how you tried to do it but failed. and provide commented, minimal, self-contained, reproducible code. ^^^ G -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Gavin Simpson [t] +44 (0)20 7679 0522 ECRC, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.