[R] CLARA and determining the right number of clusters

2008-09-30 Thread pacomet
Hi everyone

I have a question about clustering. I've managed using CLARA to get a
clustering analysis of a large data set. But now I want to find which is the
right number of clusters.

The clara.object gives some information like the ratio between maximal and
minimal dissimilarity that says (maybe if lower than 1??) if a cluster is
well-separated from the other. I've also read something about silhouette and
abut cluster.stats but can't manage to get how to find the right number of
clusters.

I've tried a suggestion from the mailing list but when using dist

d1-dist(mydata$sst)

it says that specified vector size is too big

Is there any method to find the right number of clusters when using clara?
Maybe something I've tried but with a small and simple trick I can't find

Thanks in advance

-- 
_
El ponent la mou, el llevant la plou
Usuari Linux registrat: 363952
---
Fotos: http://picasaweb.google.es/pacomet

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] CLARA and determining the right number of clusters

2008-09-30 Thread Christian Hennig

Hi there,

generally finding the right number of clusters is a difficult problem and 
depends heavily on the cluster concept needed for the particular 
application.

No outcome of any automatic mathod should be taken for granted.

Having said that, I guess that something like the example given in

?pam.object

(replacing pam by clara) should work with clara, too.

Regards,
Christian


On Tue, 30 Sep 2008, pacomet wrote:


Hi everyone

I have a question about clustering. I've managed using CLARA to get a
clustering analysis of a large data set. But now I want to find which is the
right number of clusters.

The clara.object gives some information like the ratio between maximal and
minimal dissimilarity that says (maybe if lower than 1??) if a cluster is
well-separated from the other. I've also read something about silhouette and
abut cluster.stats but can't manage to get how to find the right number of
clusters.

I've tried a suggestion from the mailing list but when using dist

d1-dist(mydata$sst)

it says that specified vector size is too big

Is there any method to find the right number of clusters when using clara?
Maybe something I've tried but with a small and simple trick I can't find

Thanks in advance

--
_
El ponent la mou, el llevant la plou
Usuari Linux registrat: 363952
---
Fotos: http://picasaweb.google.es/pacomet

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
[EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] CLARA and determining the right number of clusters

2008-09-30 Thread pacomet
Hi Christian and thanks

I've tried your suggestion and it seems promising. But I have a couple of
questions. I am reading a three column ASCII file (lon, lat, sst)

 mydata - read.table(INFILE, header=FALSE,sep=,
na.strings=99.00,dec=.,strip.white=TRUE,col.names=c(lon,lat,sst))

then I extract a subset of the data and try to get the right number of
clusters just for third var, sst

 x-mydata$sst
 asw - numeric(10)
 for (k in 4:10)
+  asw[k] - clara(x, k) $ silinfo $ avg.width
  k.best - which.max(asw)
 cat(silhouette-optimal number of clusters:, k.best, \n)
silhouette-optimal number of clusters: 5


I've changed the maximum number of clusters in your example from 20 just to
10 as I am expecting a number between 5 and 8 clusters would be right. Is
there any problem with this change? Maybe this restriction is too strict if
I just consider the data are just numbers but as it is sea surface
temperature under certain environmental-meteorological conditions in this
particular case I think there should not be more than 8-9 clusters (If 20 is
retained I get 11 clusters).

The second question is how should one understand the plot? Is the right
number the one with greater average silhouette width?

Thanks again


2008/9/30 Christian Hennig [EMAIL PROTECTED]

 Hi there,

 generally finding the right number of clusters is a difficult problem and
 depends heavily on the cluster concept needed for the particular
 application.
 No outcome of any automatic mathod should be taken for granted.

 Having said that, I guess that something like the example given in

 ?pam.object

 (replacing pam by clara) should work with clara, too.

 Regards,
 Christian



 On Tue, 30 Sep 2008, pacomet wrote:

  Hi everyone

 I have a question about clustering. I've managed using CLARA to get a
 clustering analysis of a large data set. But now I want to find which is
 the
 right number of clusters.

 The clara.object gives some information like the ratio between maximal and
 minimal dissimilarity that says (maybe if lower than 1??) if a cluster is
 well-separated from the other. I've also read something about silhouette
 and
 abut cluster.stats but can't manage to get how to find the right number of
 clusters.

 I've tried a suggestion from the mailing list but when using dist

 d1-dist(mydata$sst)

 it says that specified vector size is too big

 Is there any method to find the right number of clusters when using clara?
 Maybe something I've tried but with a small and simple trick I can't find

 Thanks in advance

 --
 _
 El ponent la mou, el llevant la plou
 Usuari Linux registrat: 363952
 ---
 Fotos: http://picasaweb.google.es/pacomet

[[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


 *** --- ***
 Christian Hennig
 University College London, Department of Statistical Science
 Gower St., London WC1E 6BT, phone +44 207 679 1698
 [EMAIL PROTECTED], 
 www.homepages.ucl.ac.uk/~ucakchehttp://www.homepages.ucl.ac.uk/%7Eucakche




-- 
_
El ponent la mou, el llevant la plou
Usuari Linux registrat: 363952
---
Fotos: http://picasaweb.google.es/pacomet

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] CLARA and determining the right number of clusters

2008-09-30 Thread Christian Hennig

Hi there,


I've tried your suggestion and it seems promising. But I have a couple of
questions. I am reading a three column ASCII file (lon, lat, sst)


mydata - read.table(INFILE, header=FALSE,sep=,

na.strings=99.00,dec=.,strip.white=TRUE,col.names=c(lon,lat,sst))

then I extract a subset of the data and try to get the right number of
clusters just for third var, sst


I'm not sure whether you feel that you have to look at a single variable at 
a time, but the whole thing should work for more than one as well.



x-mydata$sst
asw - numeric(10)
for (k in 4:10)

+  asw[k] - clara(x, k) $ silinfo $ avg.width

 k.best - which.max(asw)
cat(silhouette-optimal number of clusters:, k.best, \n)

silhouette-optimal number of clusters: 5

I've changed the maximum number of clusters in your example from 20 just to
10 as I am expecting a number between 5 and 8 clusters would be right. Is
there any problem with this change? Maybe this restriction is too strict if
I just consider the data are just numbers but as it is sea surface
temperature under certain environmental-meteorological conditions in this
particular case I think there should not be more than 8-9 clusters (If 20 is
retained I get 11 clusters).


This kind of problem concerns your data and your application and cannot be 
solved on such a mailing list. Perhaps you should go for professional 
advice about your particular data. Quite obviously, if you restrict the 
number of clusters to be at most 10, you cannot find eleven, and how strong 
you think that there should not be more than 8-9 clusters and how good 
your arguments against 11 are, nobody on this list can decide.
The general problem is that there is no unique statistical definition of 
what a true cluster is and whether your dataset rather contains 5 or 11 
clusters (or any other number) depends on what you want to call a 
cluster.



The second question is how should one understand the plot? Is the right
number the one with greater average silhouette width?


I don't know which plot you refer to but you may have a look at the Kaufman 
and Rousseeuw book quoted on the help page.


Best wishes,
Christian



Thanks again


2008/9/30 Christian Hennig [EMAIL PROTECTED]


Hi there,

generally finding the right number of clusters is a difficult problem and
depends heavily on the cluster concept needed for the particular
application.
No outcome of any automatic mathod should be taken for granted.

Having said that, I guess that something like the example given in


?pam.object


(replacing pam by clara) should work with clara, too.

Regards,
Christian



On Tue, 30 Sep 2008, pacomet wrote:

 Hi everyone


I have a question about clustering. I've managed using CLARA to get a
clustering analysis of a large data set. But now I want to find which is
the
right number of clusters.

The clara.object gives some information like the ratio between maximal and
minimal dissimilarity that says (maybe if lower than 1??) if a cluster is
well-separated from the other. I've also read something about silhouette
and
abut cluster.stats but can't manage to get how to find the right number of
clusters.

I've tried a suggestion from the mailing list but when using dist

d1-dist(mydata$sst)

it says that specified vector size is too big

Is there any method to find the right number of clusters when using clara?
Maybe something I've tried but with a small and simple trick I can't find

Thanks in advance

--
_
El ponent la mou, el llevant la plou
Usuari Linux registrat: 363952
---
Fotos: http://picasaweb.google.es/pacomet

   [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
[EMAIL PROTECTED], 
www.homepages.ucl.ac.uk/~ucakchehttp://www.homepages.ucl.ac.uk/%7Eucakche





--
_
El ponent la mou, el llevant la plou
Usuari Linux registrat: 363952
---
Fotos: http://picasaweb.google.es/pacomet



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
[EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.