[R] clustering on scaled dataset or not?

2010-10-28 Thread array chip
Hi, just a general question: when we do hierarchical clustering, should we 
compute the dissimilarity matrix based on scaled dataset or non-scaled dataset? 
daisy() in cluster package allow standardizing the variables before calculating 
dissimilarity matrix; but dist() doesn't have that option at all. Appreciate if 
you can share your thoughts?

Thanks

John



  
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] clustering on scaled dataset or not?

2010-10-28 Thread Claudia Beleites

John,



Hi, just a general question: when we do hierarchical clustering, should we
compute the dissimilarity matrix based on scaled dataset or non-scaled dataset?




daisy() in cluster package allow standardizing the variables before calculating
dissimilarity matrix;


I'd say that should depend on your data.

- if your data is all (physically) different kinds of things (and thus 
different orders of magnitude), then you should probably scale.


- On the other hand, I cluster spectra. Thus my variates are all the 
same unit, and moreover I'd be afraid that scaling would blow up 
noise-only variates (i.e. the spectra do have low or no intensity 
regions), thus I usually don't scale.


- It also depends on your distance. E.g. Mahalanobis should do the 
scaling by itself, if think correctly at this time of the day...


What I do frequently, though, is subtracting something like the minimum 
spectrum (in practice, I calculate the 5th percentile for each variate - 
it's less noisy). You can also center, but I'm strongly for having a 
physical meaning, and for my samples that's the minimum spectrum is 
better interpretable (it represents the matrix composition).



but dist() doesn't have that option at all. Appreciate if
you can share your thoughts?

but you could call scale () and then dist ().

Claudia




Thanks

John




[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.