Re: [R] About clustering techniques
Hi Christian I've been reading about daisy and think I need to do something like.. mydaisydata - daisy(mydata,metric=c(euclidean),stand=FALSE) Error en vector(double, length) : tamaƱo del vector especificado es muy grande(which means, specified vector size is too big) mydata is an anual file with 14 columns by 124716 rows. Is it possible that daisy can't handle this data? maybe I'm missing something when using daisy. Another question, if I get daisy running I can use kmeans like this? mykmeansdata - kmeans(mydaisydata, 5) or pamk that I've read it gives the optimal number of clusters. Thanks again -- _ El ponent la mou, el llevant la plou Usuari Linux registrat: 363952 --- Fotos: http://picasaweb.google.es/pacomet [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] About clustering techniques
Hello R users It's some time I am playing with a dataset to do some cluster analysis. The data set consists of 14 columns being geographical coordinates and monthly temperatures in annual files latitutde - longitude - temperature 1 -. - temperature 12 I have some missing values in some cases, maybe there are 8 monthly valid values at some points with four non valid. I don't want to supress the whole row with 8 good/4 bad values as I wanna try annual and monthy analysis. I first tried kmeans but found a problem with missing values. When trying without omitting missing values kmeans gives an error and when excluding invalid data too many values are excluded in some years of the data series. Now I have been reading about pam, pamk and clara, I think they can handle missing values. But can't find out the way to perform the analysis with these functions. As I'm not an statistics nor an R expert the fpc or cluster package documentation is not enough for me. If you know about a website or a tutorial explaining the way to use that functions, with examples to check if possible, please post them. Any other help or suggestion is greatly appreciated. Thanks in advance Paco -- _ El ponent la mou, el llevant la plou Usuari Linux registrat: 363952 --- Fotos: http://picasaweb.google.es/pacomet [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] About clustering techniques
Hi Paco, I got the same problem with you before. Thus, I just impute the missing values For example: newdata-as.matrix(impute(olddata, fun=random)) then I believe that you could analyze your data. Hopefully it helps. Chunhao Quoting pacomet [EMAIL PROTECTED]: Hello R users It's some time I am playing with a dataset to do some cluster analysis. The data set consists of 14 columns being geographical coordinates and monthly temperatures in annual files latitutde - longitude - temperature 1 -. - temperature 12 I have some missing values in some cases, maybe there are 8 monthly valid values at some points with four non valid. I don't want to supress the whole row with 8 good/4 bad values as I wanna try annual and monthy analysis. I first tried kmeans but found a problem with missing values. When trying without omitting missing values kmeans gives an error and when excluding invalid data too many values are excluded in some years of the data series. Now I have been reading about pam, pamk and clara, I think they can handle missing values. But can't find out the way to perform the analysis with these functions. As I'm not an statistics nor an R expert the fpc or cluster package documentation is not enough for me. If you know about a website or a tutorial explaining the way to use that functions, with examples to check if possible, please post them. Any other help or suggestion is greatly appreciated. Thanks in advance Paco -- _ El ponent la mou, el llevant la plou Usuari Linux registrat: 363952 --- Fotos: http://picasaweb.google.es/pacomet [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] About clustering techniques
Dear Paco, in order to use the methods in the cluster package (including pam), look up the help page of daisy, which is able to compute dissimilarity matrices handling missing values appropriately (in most situations). A good reference is the Kaufman and Rousseeuw book cited on that help page. Christian On Tue, 29 Jul 2008, pacomet wrote: Hello R users It's some time I am playing with a dataset to do some cluster analysis. The data set consists of 14 columns being geographical coordinates and monthly temperatures in annual files latitutde - longitude - temperature 1 -. - temperature 12 I have some missing values in some cases, maybe there are 8 monthly valid values at some points with four non valid. I don't want to supress the whole row with 8 good/4 bad values as I wanna try annual and monthy analysis. I first tried kmeans but found a problem with missing values. When trying without omitting missing values kmeans gives an error and when excluding invalid data too many values are excluded in some years of the data series. Now I have been reading about pam, pamk and clara, I think they can handle missing values. But can't find out the way to perform the analysis with these functions. As I'm not an statistics nor an R expert the fpc or cluster package documentation is not enough for me. If you know about a website or a tutorial explaining the way to use that functions, with examples to check if possible, please post them. Any other help or suggestion is greatly appreciated. Thanks in advance Paco -- _ El ponent la mou, el llevant la plou Usuari Linux registrat: 363952 --- Fotos: http://picasaweb.google.es/pacomet [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 [EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] About clustering techniques
A quick comment on this: imputation is an option to make things technically work, but it is not necessarily good. Imputation always introduces some noise, ie, it fakes information that is not really there. Whether it is good depends strongly on the data, the situation and the imputation method (random often not being a very sensible choice). Christian On Tue, 29 Jul 2008, [EMAIL PROTECTED] wrote: Hi Paco, I got the same problem with you before. Thus, I just impute the missing values For example: newdata-as.matrix(impute(olddata, fun=random)) then I believe that you could analyze your data. Hopefully it helps. Chunhao Quoting pacomet [EMAIL PROTECTED]: Hello R users It's some time I am playing with a dataset to do some cluster analysis. The data set consists of 14 columns being geographical coordinates and monthly temperatures in annual files latitutde - longitude - temperature 1 -. - temperature 12 I have some missing values in some cases, maybe there are 8 monthly valid values at some points with four non valid. I don't want to supress the whole row with 8 good/4 bad values as I wanna try annual and monthy analysis. I first tried kmeans but found a problem with missing values. When trying without omitting missing values kmeans gives an error and when excluding invalid data too many values are excluded in some years of the data series. Now I have been reading about pam, pamk and clara, I think they can handle missing values. But can't find out the way to perform the analysis with these functions. As I'm not an statistics nor an R expert the fpc or cluster package documentation is not enough for me. If you know about a website or a tutorial explaining the way to use that functions, with examples to check if possible, please post them. Any other help or suggestion is greatly appreciated. Thanks in advance Paco -- _ El ponent la mou, el llevant la plou Usuari Linux registrat: 363952 --- Fotos: http://picasaweb.google.es/pacomet [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 [EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.