[R] Clustering with clara
Hello everyone I am trying to use CLARA method for finding clusters in my spatial surface temperature data and noticed one problem. My data are in the form lat,lon,temperature. I extract lat,lon and cluster number for each point in the dataset. When I plotted a map of cluster numbers I found empty areas in the map. The point is that the number of points that were assigned a cluster number are less than the original temperature analyzed points. Why are there less points in the clustering results? is there any option in the CLARA method to retain every single point? is there another clustering method that preserves all the points? Thanks in advance Paco -- _ El ponent la mou, el llevant la plou Usuari Linux registrat: 363952 --- Fotos: http://picasaweb.google.es/pacomet [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] CLARA and determining the right number of clusters
Hi everyone I have a question about clustering. I've managed using CLARA to get a clustering analysis of a large data set. But now I want to find which is the right number of clusters. The clara.object gives some information like the ratio between maximal and minimal dissimilarity that says (maybe if lower than 1??) if a cluster is well-separated from the other. I've also read something about silhouette and abut cluster.stats but can't manage to get how to find the right number of clusters. I've tried a suggestion from the mailing list but when using dist d1-dist(mydata$sst) it says that specified vector size is too big Is there any method to find the right number of clusters when using clara? Maybe something I've tried but with a small and simple trick I can't find Thanks in advance -- _ El ponent la mou, el llevant la plou Usuari Linux registrat: 363952 --- Fotos: http://picasaweb.google.es/pacomet [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] CLARA and determining the right number of clusters
Hi Christian and thanks I've tried your suggestion and it seems promising. But I have a couple of questions. I am reading a three column ASCII file (lon, lat, sst) mydata - read.table(INFILE, header=FALSE,sep=, na.strings=99.00,dec=.,strip.white=TRUE,col.names=c(lon,lat,sst)) then I extract a subset of the data and try to get the right number of clusters just for third var, sst x-mydata$sst asw - numeric(10) for (k in 4:10) + asw[k] - clara(x, k) $ silinfo $ avg.width k.best - which.max(asw) cat(silhouette-optimal number of clusters:, k.best, \n) silhouette-optimal number of clusters: 5 I've changed the maximum number of clusters in your example from 20 just to 10 as I am expecting a number between 5 and 8 clusters would be right. Is there any problem with this change? Maybe this restriction is too strict if I just consider the data are just numbers but as it is sea surface temperature under certain environmental-meteorological conditions in this particular case I think there should not be more than 8-9 clusters (If 20 is retained I get 11 clusters). The second question is how should one understand the plot? Is the right number the one with greater average silhouette width? Thanks again 2008/9/30 Christian Hennig [EMAIL PROTECTED] Hi there, generally finding the right number of clusters is a difficult problem and depends heavily on the cluster concept needed for the particular application. No outcome of any automatic mathod should be taken for granted. Having said that, I guess that something like the example given in ?pam.object (replacing pam by clara) should work with clara, too. Regards, Christian On Tue, 30 Sep 2008, pacomet wrote: Hi everyone I have a question about clustering. I've managed using CLARA to get a clustering analysis of a large data set. But now I want to find which is the right number of clusters. The clara.object gives some information like the ratio between maximal and minimal dissimilarity that says (maybe if lower than 1??) if a cluster is well-separated from the other. I've also read something about silhouette and abut cluster.stats but can't manage to get how to find the right number of clusters. I've tried a suggestion from the mailing list but when using dist d1-dist(mydata$sst) it says that specified vector size is too big Is there any method to find the right number of clusters when using clara? Maybe something I've tried but with a small and simple trick I can't find Thanks in advance -- _ El ponent la mou, el llevant la plou Usuari Linux registrat: 363952 --- Fotos: http://picasaweb.google.es/pacomet [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 [EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakchehttp://www.homepages.ucl.ac.uk/%7Eucakche -- _ El ponent la mou, el llevant la plou Usuari Linux registrat: 363952 --- Fotos: http://picasaweb.google.es/pacomet [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Exporting data to a text file
Hi John I don't get an error message but a warning write.table(myclara$clustering,cluster.dat,append=TRUE) Warning message: In write.table(myclara$clustering, cluster.dat, append = TRUE) : appending column names to file Here it is the output of str(myclara), it looks strange to me. I think clustering are integers and data are real numbers str(myclara) List of 10 $ sample: chr [1:56] 32356 33277 43230 52386 ... $ medoids : num [1:8, 1:14] 7.888 12.019 5.427 0.725 17.688 ... ..- attr(*, dimnames)=List of 2 .. ..$ : chr [1:8] 109056 98194 56959 109806 ... .. ..$ : chr [1:14] lon lat sst01 sst02 ... $ i.med : int [1:8] 20482 16158 5137 20722 48599 56033 68028 64308 $ clustering: Named int [1:75459] 1 1 1 1 1 1 1 1 1 1 ... ..- attr(*, names)= chr [1:75459] 12296 12297 12298 12299 ... $ objective : num 3.22 $ clusinfo : num [1:8, 1:4] 15055 9474 5164 13702 11340 ... ..- attr(*, dimnames)=List of 2 .. ..$ : NULL .. ..$ : chr [1:4] size max_diss av_diss isolation $ diss :Classes 'dissimilarity', 'dist' atomic [1:1540] 1.11 6.54 4.62 3.30 4.32 ... .. ..- attr(*, Size)= int 56 .. ..- attr(*, Metric)= chr euclidean .. ..- attr(*, Labels)= chr [1:56] 32356 33277 43230 52386 ... $ call : language clara(x = mydata, k = 8) $ silinfo :List of 3 ..$ widths : num [1:56, 1:3] 1 1 1 1 1 1 1 1 2 2 ... .. ..- attr(*, dimnames)=List of 2 .. .. ..$ : chr [1:56] 96250 109056 130058 116317 ... .. .. ..$ : chr [1:3] cluster neighbor sil_width ..$ clus.avg.widths: num [1:8] 0.343 0.355 0.533 0.265 0.308 ... ..$ avg.width : num 0.362 $ data : num [1:75459, 1:14] 8.68 8.72 8.77 8.81 8.86 ... ..- attr(*, dimnames)=List of 2 .. ..$ : chr [1:75459] 12296 12297 12298 12299 ... .. ..$ : chr [1:14] lon lat sst01 sst02 ... - attr(*, class)= chr [1:2] clara partition I can't output the two variables in two different files without any problem. Thanks 2008/8/1 John Kane [EMAIL PROTECTED] try str(myclara) to see what you have - a data frame , matrix etc Are you getting any error messages? I tried your write.table commands and they work okay. --- On Fri, 8/1/08, pacomet [EMAIL PROTECTED] wrote: From: pacomet [EMAIL PROTECTED] Subject: [R] Exporting data to a text file To: r-help@r-project.org Received: Friday, August 1, 2008, 12:49 PM HI R users With clara function I get a data frame (maybe this is not the exact word, I'm new to R) with the following variables: names(myclara) [1] sample medoids i.med clustering objective [6] clusinfo diss call silinfo data I want to export clustering and data to a new text file so I try write.table(myclara$data,cluster.dat) write.table(myclara$clustering,cluster.dat,append=TRUE) Variable data is properly exported but clustering is not appended to the output file. Please, where is the mistake? is it possible to export the two variables in just a sentence? thanks in advance Paco -- _ El ponent la mou, el llevant la plou Usuari Linux registrat: 363952 --- Fotos: http://picasaweb.google.es/pacomet [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ Connect with friends from any web browser - no download required. Try the new Yahoo! Canada Messenger for the Web BETA at http://ca.messenger.yahoo.com/webmessengerpromo.php -- _ El ponent la mou, el llevant la plou Usuari Linux registrat: 363952 --- Fotos: http://picasaweb.google.es/pacomet [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Exporting data to a text file
HI R users With clara function I get a data frame (maybe this is not the exact word, I'm new to R) with the following variables: names(myclara) [1] sample medoidsi.med clustering objective [6] clusinfo diss call silinfodata I want to export clustering and data to a new text file so I try write.table(myclara$data,cluster.dat) write.table(myclara$clustering,cluster.dat,append=TRUE) Variable data is properly exported but clustering is not appended to the output file. Please, where is the mistake? is it possible to export the two variables in just a sentence? thanks in advance Paco -- _ El ponent la mou, el llevant la plou Usuari Linux registrat: 363952 --- Fotos: http://picasaweb.google.es/pacomet [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] About clustering techniques
Hi Christian I've been reading about daisy and think I need to do something like.. mydaisydata - daisy(mydata,metric=c(euclidean),stand=FALSE) Error en vector(double, length) : tamaƱo del vector especificado es muy grande(which means, specified vector size is too big) mydata is an anual file with 14 columns by 124716 rows. Is it possible that daisy can't handle this data? maybe I'm missing something when using daisy. Another question, if I get daisy running I can use kmeans like this? mykmeansdata - kmeans(mydaisydata, 5) or pamk that I've read it gives the optimal number of clusters. Thanks again -- _ El ponent la mou, el llevant la plou Usuari Linux registrat: 363952 --- Fotos: http://picasaweb.google.es/pacomet [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] About clustering techniques
Hello R users It's some time I am playing with a dataset to do some cluster analysis. The data set consists of 14 columns being geographical coordinates and monthly temperatures in annual files latitutde - longitude - temperature 1 -. - temperature 12 I have some missing values in some cases, maybe there are 8 monthly valid values at some points with four non valid. I don't want to supress the whole row with 8 good/4 bad values as I wanna try annual and monthy analysis. I first tried kmeans but found a problem with missing values. When trying without omitting missing values kmeans gives an error and when excluding invalid data too many values are excluded in some years of the data series. Now I have been reading about pam, pamk and clara, I think they can handle missing values. But can't find out the way to perform the analysis with these functions. As I'm not an statistics nor an R expert the fpc or cluster package documentation is not enough for me. If you know about a website or a tutorial explaining the way to use that functions, with examples to check if possible, please post them. Any other help or suggestion is greatly appreciated. Thanks in advance Paco -- _ El ponent la mou, el llevant la plou Usuari Linux registrat: 363952 --- Fotos: http://picasaweb.google.es/pacomet [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.