subject:"\[R\] Clustering"

Re: [R] Clustering Functions used by Reverse-Dependencies

2024-02-29 Thread Leo Mada via R-help

Dear Ivan,

Thank you very much for this interesting information.

Regarding:
"For well-behaved packages that declare their dependencies correctly,
parsing the NAMESPACE for importFrom() and import() calls should give
you the explicit imports."

I did learn something new (I am not very experienced in package writing). 
Unfortunately, Roxygen2 as of the current version still suggests to use the 
pkg::fname approach:
"If you are using just a few functions from another package, we recommending 
adding the package to the Imports: field of the DESCRIPTION file and calling 
the functions explicitly using ::, e.g., pkg::fun()."
https://roxygen2.r-lib.org/articles/namespace.html

Regarding analysing the actual code: it is good to know that CMD check has also 
some functionality. I will look into it, when I find some free time.

tools:::.check_packages_used is a few pages of code. On the other hand, the 
help page for codetools::checkUsage is quite cryptic. But it's good to know at 
least where to look.

Sincerely,

Leonard


From: Ivan Krylov 
Sent: Wednesday, February 28, 2024 10:36 AM
To: Leo Mada via R-help 
Cc: Leo Mada 
Subject: Re: [R] Clustering Functions used by Reverse-Dependencies

� Sat, 24 Feb 2024 03:08:26 +
Leo Mada via R-help  �:

> Are there any tools to extract the function names called by
> reverse-dependencies?

For well-behaved packages that declare their dependencies correctly,
parsing the NAMESPACE for importFrom() and import() calls should give
you the explicit imports. (What if the package imports the whole
dependency? The safe assumption is that all functions are used, but it
comes with false positives. You could also walk the package code
looking for function names that may belong to the imported package, but
that may involve both false positives and false negatives.)

For the rest of the imports and uses of weak dependencies, you'll have
to walk the package code looking for the uses of the `::` operator. See
how R CMD check walks the package code in functions
tools:::.check_packages_used and codetools::checkUsage.

A less-well-behaved package can always load a namespace during runtime
and choose the functions to call depending on the phase of the moon or
weather on Jupiter. For these, like for the halting problem, there's no
general solution: the package could be written to say, "if Leonard's
function says I'm about to call foo::bar, I won't do it, otherwise I
will".

--
Best regards,
Ivan

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Clustering Functions used by Reverse-Dependencies

2024-02-28 Thread Ivan Krylov via R-help

В Sat, 24 Feb 2024 03:08:26 +
Leo Mada via R-help  пишет:

> Are there any tools to extract the function names called by
> reverse-dependencies?

For well-behaved packages that declare their dependencies correctly,
parsing the NAMESPACE for importFrom() and import() calls should give
you the explicit imports. (What if the package imports the whole
dependency? The safe assumption is that all functions are used, but it
comes with false positives. You could also walk the package code
looking for function names that may belong to the imported package, but
that may involve both false positives and false negatives.)

For the rest of the imports and uses of weak dependencies, you'll have
to walk the package code looking for the uses of the `::` operator. See
how R CMD check walks the package code in functions
tools:::.check_packages_used and codetools::checkUsage.

A less-well-behaved package can always load a namespace during runtime
and choose the functions to call depending on the phase of the moon or
weather on Jupiter. For these, like for the halting problem, there's no
general solution: the package could be written to say, "if Leonard's
function says I'm about to call foo::bar, I won't do it, otherwise I
will".

-- 
Best regards,
Ivan

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Clustering Functions used by Reverse-Dependencies

2024-02-23 Thread Leo Mada via R-help

Dear R Users,

Are there any tools to extract the function names called by 
reverse-dependencies?

I would like to group these functions using clustering methods based on the 
co-occurrence in the reverse-dependencies.

Utility: It may be possible to split complex packages into modules with fewer 
reverse-dependencies.

Package pkgdepR may offer some of the functionality; but I did not have time to 
fully explore it:
https://cran.r-project.org/web/packages/pkgdepR/index.html

If such tools are not yet available, I have opened an issue on GitHub and would 
like to collect any ideas:
https://github.com/discoleo/PackageBrowser/issues/1

There are some model packages that could be used to test the clustering:
1) Rcpp, xml: the core-functionality will always be needed; splitting into 
modules may not be possible/useful;

2) pracma: most reverse-dependencies are likely using only a small subset of 
the functions in pracma;
(there was some discussion on R-package-devel about reverse dependencies a few 
weeks ago)

3) survival: I have no idea in which category it falls - but it has a lot of 
reverse-dependencies;

Note:
I missed some important functionality from the pkgsearch package. I have 
started therefore work on a new package (PackageBrowser) - although it is not 
yet published on CRAN:
https://github.com/discoleo/PackageBrowser

It does not yet process recursively the reverse-dependencies; although this 
does not seem a big challenge. The real challenge is to parse the code and 
extract function names. I did some work in the past, but never had time for a 
full implementation.

Many thanks,

Leonard


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Clustering of datasets

2022-09-05 Thread Rui Barradas


Hello,

I am not at all sure that the following answers the question.
The code below ries to find the optimal number of clusters. One of the 
changes I have made to your call to kmeans is to subset DMs not dropping 
the dim attribute.



library(cluster)

max_clust <- 10
wss <- numeric(max_clust)

for(k in 1:max_clust) {
  km <- kmeans(DMs[,2], centers = k, nstart = 25)
  wss[k] <- km$tot.withinss
}
plot(wss, type = "b")

dm <- DMs[, 2, drop = FALSE]
# Where is the elbow, at 2 or at 4?
factoextra::fviz_nbclust(dm, kmeans, method = "wss")
factoextra::fviz_nbclust(dm, kmeans, method = "silhouette")

k2 <- kmeans(dm, centers = 2, nstart = 25)
k3 <- kmeans(dm, centers = 3, nstart = 25)
k4 <- kmeans(dm, centers = 4, nstart = 25)

main2 <- paste(length(k2$centers), "clusters")
main3 <- paste(length(k3$centers), "clusters")
main4 <- paste(length(k4$centers), "clusters")

old_par <- par(mfcol = c(1, 3))
plot(DMs[,2], col = k2$cluster, pch = 19, main = main2)
plot(DMs[,2], col = k3$cluster, pch = 19, main = main3)
plot(DMs[,2], col = k4$cluster, pch = 19, main = main4)
par(old_par)


Hope this helps,

Rui Barradas


Às 12:31 de 05/09/2022, Subhamitra Patra escreveu:

Dear all,

I am about to cluster my datasets by using K-mean clustering techniques in
R, but getting some type of scattered results. Herewith I pasted my code
below. Please suggest to me where I am lacking in my code. I was pasting my
data before applying the K-mean method as follows.

DMs<-read.table(text="Country DATA
   IS -0.0092
   BA -0.0235
   HK -0.0239
   JA -0.0333
   KU -0.0022
   OM -0.0963
   QA -0.0706
   SK -0.0322
   SA -0.1233
   SI -0.0141
   TA -0.0142
   UAE -0.0656
   AUS -0.0230
  BEL -0.0006
  CYP -0.0085
  CR  -0.0398
 DEN  -0.0423
   EST -0.0604
   FIN -0.0227
   FRA -0.0085
  GER -0.0272
  GrE -0.3519
  ICE -0.0210
  IRE -0.0057
  LAT -0.0595
 LITH -0.0451
 LUXE -0.0023
 MAL  -0.0351
 NETH -0.0048
   NOR -0.0495
   POL -0.0081
 PORT -0.0044
 SLOVA -0.1210
 SLOVE -0.0031
   SPA -0.0213
   SWE -0.0106
 SWIT -0.0152
   UK -0.0030
 HUNG -0.0086
   CAN -0.0144
 CHIL -0.0078
   USA -0.0042
 BERM -0.0035
 AUST -0.0211
 NEWZ -0.0538" ,
  header = TRUE,stringsAsFactors=FALSE)
library(cluster)
k1<-kmeans(DMs[,2],centers=2,nstart=25)
plot(DMs[,2],col=k1$cluster,pch=19,xlim=c(1,46), ylim=c(-0.12,0))
text(1:46,DMs[,2],DMs[,1],col=k1$cluster)
legend(10,1,c("cluster 1: Highly Integrated","cluster 2: Less Integrated"),
col=1:2,pch=19)




__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Clustering of datasets

2022-09-05 Thread Jim Lemon

Hi Subhamitra,
I think the fact that you are passing a vector of values rather than a
matrix is part of the problem. As you have only one value for each
country, The points plotted will be the index on the x-axis and the
value for each country on the y-axis. Passing a value for ylim= means
that you are cutting off the lowest points. Here is an example that
will give you two clusters and show the values for the centers in the
middle of the plot. Perhaps this is all you need, but I suspect there
is more work to be done.

k2<-kmeans(DMs[,2],centers=2)
plot(DMs[,2],col=k2$cluster,pch=19,xlim=c(1,46))
text(1:46,DMs[,2],DMs[,1],col=k2$cluster)
points(rep(23,2),k2$centers,pch=1:2,cex=2,col=k2$cluster)
legend(10,1,c("cluster 1: Highly Integrated","cluster 2: Less Integrated"),
col=1:2,pch=19)

Jim

On Mon, Sep 5, 2022 at 9:31 PM Subhamitra Patra
 wrote:
>
> Dear all,
>
> I am about to cluster my datasets by using K-mean clustering techniques in
> R, but getting some type of scattered results. Herewith I pasted my code
> below. Please suggest to me where I am lacking in my code. I was pasting my
> data before applying the K-mean method as follows.
>
> DMs<-read.table(text="Country DATA
>   IS -0.0092
>   BA -0.0235
>   HK -0.0239
>   JA -0.0333
>   KU -0.0022
>   OM -0.0963
>   QA -0.0706
>   SK -0.0322
>   SA -0.1233
>   SI -0.0141
>   TA -0.0142
>   UAE -0.0656
>   AUS -0.0230
>  BEL -0.0006
>  CYP -0.0085
>  CR  -0.0398
> DEN  -0.0423
>   EST -0.0604
>   FIN -0.0227
>   FRA -0.0085
>  GER -0.0272
>  GrE -0.3519
>  ICE -0.0210
>  IRE -0.0057
>  LAT -0.0595
> LITH -0.0451
> LUXE -0.0023
> MAL  -0.0351
> NETH -0.0048
>   NOR -0.0495
>   POL -0.0081
> PORT -0.0044
> SLOVA -0.1210
> SLOVE -0.0031
>   SPA -0.0213
>   SWE -0.0106
> SWIT -0.0152
>   UK -0.0030
> HUNG -0.0086
>   CAN -0.0144
> CHIL -0.0078
>   USA -0.0042
> BERM -0.0035
> AUST -0.0211
> NEWZ -0.0538" ,
>  header = TRUE,stringsAsFactors=FALSE)
> library(cluster)
> k1<-kmeans(DMs[,2],centers=2,nstart=25)
> plot(DMs[,2],col=k1$cluster,pch=19,xlim=c(1,46), ylim=c(-0.12,0))
> text(1:46,DMs[,2],DMs[,1],col=k1$cluster)
> legend(10,1,c("cluster 1: Highly Integrated","cluster 2: Less Integrated"),
> col=1:2,pch=19)
>
>
> --
> *Best Regards,*
> *Subhamitra Patra*
> *Phd. Research Scholar*
> *Department of Humanities and Social Sciences*
> *Indian Institute of Technology, Kharagpur*
> *INDIA*
>
> [image: Mailtrack]
> 
> Sender
> notified by
> Mailtrack
> 
> 09/05/22,
> 04:55:22 PM
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] clustering levels using Tukey HSD in a one way anova

2017-12-31 Thread Ashim Kapoor

Dear all,

I am doing a one way between subjects anova in an unbalanced data set.
Suppose we have "a" levels of the one factor. I want to merge the not so
significantly different  levels into the same cluster.

Can I do a Tukey Kramer HSD and then use the following algorithm:

For i in 2 : "a"
 For j in 1 : i-1
if mean of level i is not significantly different to the mean
of level j,then put i and j in the same cluster. After the first time mean
of level i is not different to the mean of level j , just goto the next i ,
no need to compare with remaining j's.

Alternately,

I do not do Tukey Kramer HSD.  I only run the above algorithm. At each
iteration of the inner loop compute the contrast : mean of level i  = mean
j. At the first match I come out of the inner loop. To control for the (at
most)  1+ 2 + ... + (n-1) comparisons we can use bonferroni/scheffe / some
other technique.

Since this is a statistics query I have posted on stackexchange.  I have
not received a reply so I am posting my query here. Can some one please
answer my query here or on stackexchange?

The link to the query on stackexchange is:

https://stats.stackexchange.com/questions/320930/one-way-
anova-clustering-levels-using-tukey-kramer-hsd

Best Regards,
Ashim

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Clustering methods for data that has bimodal distribution

2016-12-05 Thread Ranjan Maitra

Hello Adrian,

It all depends on what the structure of the dataset is. For instance, you said 
that all your values are betweenn -1 and 1. Do the data rown sum-squared up to 
1? How about the means? Are they zero. I guess all this has to depend on the 
application and how the data were processed or what is sought to be answered? 
Even if Euclidean space is most apt, then you need to figure out what sort of 
structure you would like in your derived groups/clusters. For example again, 
k-means has an underlying philosophy: homoegenous spherical clusters of roughly 
equal sizes. Is this what yuo want?

HTH,
Ranjan

On Sun, 4 Dec 2016 22:52:33 -0500 Adrian Johnson  
wrote:

> Dear group,
> pardon me for a naive question. I have data matrix (11K rows , 4K columns).
> The data range is between -1 to 1. Not strictly integers, but real
> numbers with at least place values in millionths.
> 
> The data distribution is peculiar (if I do plot(density(myMatrix)),  I
> get nice bimodal curve (nice standard distribution between -1 and 0
> and another curve between 0 and 1) .
> 
> I am interested in clustering the data (using conesnsus clustering
> (that uses K-means)).
> 
> My question are:
> 
> 1. If my data is range is between -1  and 1. Is K-means appropriate
> method. considering if the data might have ties.
> 
> 2. Although K-means is non-parametric, would a bimodal distributed
> data be okay as input to K-means.
> 
> I appreciate any suggestion.
> Thanks
> Adrian.
> 
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 

-- 
Important Notice: This mailbox is ignored: e-mails are set to be deleted on 
receipt. Please respond to the mailing list if appropriate. For those needing 
to send personal or professional e-mail, please use appropriate addresses.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Clustering methods for data that has bimodal distribution

2016-12-04 Thread Adrian Johnson

Dear group,
pardon me for a naive question. I have data matrix (11K rows , 4K columns).
The data range is between -1 to 1. Not strictly integers, but real
numbers with at least place values in millionths.

The data distribution is peculiar (if I do plot(density(myMatrix)),  I
get nice bimodal curve (nice standard distribution between -1 and 0
and another curve between 0 and 1) .

I am interested in clustering the data (using conesnsus clustering
(that uses K-means)).

My question are:

1. If my data is range is between -1  and 1. Is K-means appropriate
method. considering if the data might have ties.

2. Although K-means is non-parametric, would a bimodal distributed
data be okay as input to K-means.

I appreciate any suggestion.
Thanks
Adrian.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Clustering of clients (retail) - Free data sets?

2015-09-17 Thread Omar André Gonzáles Díaz

Hi all,

I'm learning about how to do clusters of clients. Ç

I've founde this nice presentation on the subject, but the data is not
avaliable to use. I've contacted the autor, hope he'll answer soon.

https://ds4ci.files.wordpress.com/2013/09/user08_jimp_custseg_revnov08.pdf

Someone knows similar tutorials? or good books on the subject?

Thanks very much!

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] clustering with hclust

2014-07-25 Thread Marianna Bolognesi

Hi everybody, I have a problem with a cluster analysis.

I am trying to use hclust, method=ward.

The Ward method works with SQUARED Euclidean distances.

Hclust demands a dissimilarity structure as produced by dist.

Yet, dist does not seem to produce a table of squared euclidean distances,
starting from cosines.
In fact, computing manually the squared euclidean distances from cosines
(d=2(1-cos)) produces a different outcome.

As a consequence, using hclust with ward method on a table of cosines
tranformed into distances with dist, produces a different dendrogram than
other programs for hierarchical clustering with ward method (i.e.
multidendrograms). Weird right??

Computing manually the distances and then feeding them to hclust produces
an error message. So, I am wondering, what the hell is this dist function
doing?!

thanks!

marianna

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] clustering with hclust

2014-07-25 Thread Christian Hennig


Dear Marianna,

the function agnes in library cluster can compute Ward's method from a raw 
data matrix (at least this is what the help page suggests).


Also, you may not be using the most recent version of hclust. The most 
recent version has a note in its help page that states:


Two different algorithms are found in the literature for Ward clustering. 
The one used by option ward.D (equivalent to the only Ward option ward 
in R versions = 3.0.3) does not implement Ward's (1963) clustering 
criterion, whereas option ward.D2 implements that criterion (Murtagh and 
Legendre 2013). With the latter, the dissimilarities are squared before 
cluster updating. Note that agnes(*, method=ward) corresponds to 
hclust(*, ward.D2).


The Murtagh and Legendre paper has more details on this and is here:
http://arxiv.org/abs/.6285
F. Murtagh and P. Legendre, Ward's hierarchical clustering method: 
clustering criterion and agglomerative algorithm


It's not clear to me why one would want to use Ward's method for this kind 
of data, but that's your decision of course.


Best wishes,
Christian


On Fri, 25 Jul 2014, Marianna Bolognesi wrote:


Hi everybody, I have a problem with a cluster analysis.

I am trying to use hclust, method=ward.

The Ward method works with SQUARED Euclidean distances.

Hclust demands a dissimilarity structure as produced by dist.

Yet, dist does not seem to produce a table of squared euclidean distances,
starting from cosines.
In fact, computing manually the squared euclidean distances from cosines
(d=2(1-cos)) produces a different outcome.

As a consequence, using hclust with ward method on a table of cosines
tranformed into distances with dist, produces a different dendrogram than
other programs for hierarchical clustering with ward method (i.e.
multidendrograms). Weird right??

Computing manually the distances and then feeding them to hclust produces
an error message. So, I am wondering, what the hell is this dist function
doing?!

thanks!

marianna

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
c.hen...@ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Clustering of data set documentation files in package description

2013-09-20 Thread Thiem Alrik

Dear R help list,

I was just wondering whether there is a way to cluster the documentation files 
of data sets in the package documentation index file, so that common prefixes 
such as dat... are not necessary.

Best wishes,
Alrik



Dr. Alrik Thiem
Post-Doctoral Researcher

Department of Humanities, Social and Political Sciences
Swiss Federal Institute of Technology Zurich (ETHZ)
Building IFW, Office C 29.2
Haldeneggsteig 4
CH-8092 Zurich

+41 44 63 20937 (landline)
+41 76 52 78083 (mobile)

http://www.alrik-thiem.net
http://www.compasss.org



Dr. Alrik Thiem
Post-Doctoral Researcher

Department of Humanities, Social and Political Sciences
Swiss Federal Institute of Technology Zurich (ETHZ)
Building IFW, Office C 29.2
Haldeneggsteig 4
CH-8092 Zurich

+41 44 63 20937 (landline)
+41 76 52 78083 (mobile)

http://www.alrik-thiem.net
http://www.compasss.org

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Clustering with uneven variables

2013-05-09 Thread Elizabeth McKenzie

 Hello,

 I am new to R (and a novice at statistics).  I have a list of objects,
with
 (ideally) 10 different attributes measured per object.  However, in
reality,
 I was not able to obtain all 10 attributes for every object, so there is
 some data missing (unequal number of measured attributes per object).  I
 would like to cluster my objects based on the measured attributes.   Can I
 still cluster my objects even though the data is of unequal lengths?  If
so,
 what would be a good R function to do this?

 An example of the setup of the data with the columns ATT1, ATT2, etc as
the
 different attributes:
 OBJ  ATT1  ATT2  ATT3  
 obj13  1 3
 obj2  NA2 4
 obj31NA  NA
 :
 :

 Thank you for your help!

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Clustering newbie question

2012-12-18 Thread Anton Ashanin



Hello,
Please advice on encoding data for the following clustering problem. 
I have a dataset with car usage info. Dataset has the following fields:
1. Car model  (Toyoya Celica, BMW, Nissan X-Trail, Mazda Cosmo, etc.)
2. Year built 
3. Country where the car runs 
4. Distance run by car before major repairs 

Important: The above dataset is sparse. 
In most cases Distance is not known for all countries for a given car.   

Problem: 
For a given car predict the Distance it will run before major repairs in a 
country for which Distance is unknown.

My approach:
I want to represent each record in the dataset as a sparse vector with the 
following components:
1. Binary (1/0) car model components. Number of these components equals the 
number of all possible models in the dataset.
2. Binary (1/0) country where the car runs. Number of these components equals 
the number of all possible countries in the dataset.
3. Distance. A single integer component, equals the distance run by car.

Next I want to cluster (k-means) these vectors and analyze resulting groups. 

Questions:
1) In my vectors I mix components of different nature - binary (model, 
country)  and continuous (distance). How to calculate component-wise distance 
between vectors? Cosine similarity?
2) Other ways to encode components with finite set of values (model, country) 
to work well with continuous components (such as distance)?

Thanks!
Anton
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] clustering of binary data

2012-12-06 Thread marco milella

Good morning,
I am analyzing a dataset composed by 364 subjects and 13 binary variables
(0,1 = absence,presence).
I am testing possible association (co-presence) of my variables. To do
this, I was trying with cluster analysis.

My main interest is to check for the significance of the obtained clusters.

First, I tried with the pvclust() function, by using method.hclust=ward
and method.dist=binary. Altoghether it works (clusters and significance
obtained). However, I'm not convinced by the distance matrix. Association
between variables are indeed different from results obtained in PAST by
using Ward on a Jaccard matrix (that should be ok for binary data).
Moreover, when I try to obtain a Jaccard matrix in R from my data, by using
the Vegan package

mydistance-vegdist(t(data),method=jaccard)

 I receive the following error message:

Error in rowSums(x, na.rm = TRUE) : 'x' must be numeric


below an subset from my dataset:

   variable1 variable2 variable3 variable4 variable5 variable6 variable7
variable8 variable9 variable10 variable11 variable12 variable13  case1 0 0 0
0 0 1 0 0 1 1 0 0 0  case2 0 0 0 0 0 1 0 NA NA 1 0 0 0  case3 0 0 0 0 0 1 0
0 1 1 0 0 0  case4 1 0 0 0 0 1 0 1 0 1 0 0 0  case5 0 0 0 0 0 1 0 0 1 1 0 0
0  case6 0 1 0 0 0 1 0 1 0 1 0 0 0  case7 0 1 0 0 0 1 0 0 1 1 0 0 0  case8 0
0 0 0 0 1 0 1 0 1 0 0 0  case9 0 0 0 0 0 1 0 1 0 1 0 0 0  case10 0 0 0 0 0 1
0 0 1 1 0 0 0  case11 1 0 0 1 0 1 1 1 0 1 0 0 0  case12 0 0 0 1 1 0 1 1 0 1
0 0 0  .













So, my questions are the following: Is the Jaccard index a good strategy
for my kind of data? Is binary distance used in pvclust is theoretically
more correct? Is there any alternative to pvclust for testing the
significance of my clusters?

Thanks in advance
Marco

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] clustering of binary data

2012-12-06 Thread David L Carlson

Do not use html in r-help emails. Look below at what happens to your data.
The error message is telling you that t(data) is not numeric. 

 str(data)

That will tell you what kind of data you have. 

--
David L Carlson
Associate Professor of Anthropology
Texas AM University
College Station, TX 77843-4352


 -Original Message-
 From: r-help-boun...@r-project.org [mailto:r-help-bounces@r-
 project.org] On Behalf Of marco milella
 Sent: Thursday, December 06, 2012 12:08 PM
 To: r-help@r-project.org
 Subject: [R] clustering of binary data
 
 Good morning,
 I am analyzing a dataset composed by 364 subjects and 13 binary
 variables
 (0,1 = absence,presence).
 I am testing possible association (co-presence) of my variables. To do
 this, I was trying with cluster analysis.
 
 My main interest is to check for the significance of the obtained
 clusters.
 
 First, I tried with the pvclust() function, by using
 method.hclust=ward
 and method.dist=binary. Altoghether it works (clusters and
 significance
 obtained). However, I'm not convinced by the distance matrix.
 Association
 between variables are indeed different from results obtained in PAST by
 using Ward on a Jaccard matrix (that should be ok for binary data).
 Moreover, when I try to obtain a Jaccard matrix in R from my data, by
 using
 the Vegan package
 
 mydistance-vegdist(t(data),method=jaccard)
 
  I receive the following error message:
 
 Error in rowSums(x, na.rm = TRUE) : 'x' must be numeric
 
 
 below an subset from my dataset:
 
variable1 variable2 variable3 variable4 variable5 variable6
 variable7
 variable8 variable9 variable10 variable11 variable12 variable13  case1
 0 0 0
 0 0 1 0 0 1 1 0 0 0  case2 0 0 0 0 0 1 0 NA NA 1 0 0 0  case3 0 0 0 0 0
 1 0
 0 1 1 0 0 0  case4 1 0 0 0 0 1 0 1 0 1 0 0 0  case5 0 0 0 0 0 1 0 0 1 1
 0 0
 0  case6 0 1 0 0 0 1 0 1 0 1 0 0 0  case7 0 1 0 0 0 1 0 0 1 1 0 0 0
 case8 0
 0 0 0 0 1 0 1 0 1 0 0 0  case9 0 0 0 0 0 1 0 1 0 1 0 0 0  case10 0 0 0
 0 0 1
 0 0 1 1 0 0 0  case11 1 0 0 1 0 1 1 1 0 1 0 0 0  case12 0 0 0 1 1 0 1 1
 0 1
 0 0 0  .
 
 
 
 
 
 
 
 
 
 
 
 
 
 So, my questions are the following: Is the Jaccard index a good
 strategy
 for my kind of data? Is binary distance used in pvclust is
 theoretically
 more correct? Is there any alternative to pvclust for testing the
 significance of my clusters?
 
 Thanks in advance
 Marco
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-
 guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Clustering groups according to multiple variables

2012-10-31 Thread Matthew Ouellette

Dear R help,


I am trying to cluster my data according to group in a data frame such as
the following:

df=data.frame(group=rep(c(a,b,c,d),10),(replicate(100,rnorm(40


I'm not sure how to tell hclust() that I want to cluster according to the
group variable.  For example:

dfclust=hclust(dist(df),ave)

plot(dfclust)

Clusters according to each individual row.  What I'm looking for is an
unrooted tree that will show similarity/dissimilarity among groups
according to the data set as a whole.

I appreciate the help,


MO

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Clustering groups according to multiple variables

2012-10-31 Thread Matthew Ouellette

Dear R help,


I am trying to cluster my data according to group in a data frame such as
the following:

df=data.frame(group=rep(c(a,b,c,d),10),(replicate(100,rnorm(40


I'm not sure how to tell hclust() that I want to cluster according to the
group variable.  For example:

dfclust=hclust(dist(df),ave)

plot(dfclust)

Clusters according to each individual row.  What I'm looking for is an
unrooted tree that will show similarity/dissimilarity among groups
according to the data set as a whole.

I appreciate the help,


MO

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] clustering spline-based models

2012-10-01 Thread Wyatt McMahon

Hello playeRs!

 

I'm working on a project for a client.  She's modeling hormone levels
periodically, and trying to develop a model and fit her data to that
model, and subsequently she's trying to cluster individuals based on how
well each fits the model.

 

I've been looking at grofit for this, but it doesn't appear that it can do
the sort of post-hoc analysis I'm looking for, although it can use a
model-free spline fit nicely (reportedly).  

 

Does anyone have any packages that can accomplish what I'm looking to
accomplish?

 

Thanks in advance for any help you can offer,

 

Wyatt


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Clustering analysis with ordination plots

2012-05-02 Thread Gavin Simpson

Please read the posting guide for future questions.

I presume you mean using the vegan package? If so, then see this blog
post of mine which shows how to do something similar:

http://wp.me/pZRQ9-73

If you post more details and an example I will help further if the blog
post is not sufficient for you to get the solution you want.

G

On Mon, 2012-04-30 at 09:44 -0700, borinot wrote:
 Hello to all, 
 
 I'm new to R so I have a lot of problems with it, but I'll only ask the main
 one. 
 
 I have clustered an environmental matrix with 2 different methods, and I'd
 like to plot them in a PCA and a db-RDA. I mean, I want see these clusters
 in the plots like points of differents colours, together with the rest
 information of the plot, but I don't know how to do this. 
 
 I've checked a lot of bibliography and forums, and I haven't found the
 solution... it can't be so hard! 
 
 Well, thanks in advance! :) 
 
 --
 View this message in context: 
 http://r.789695.n4.nabble.com/Clustering-analysis-with-ordination-plots-tp4598695.html
 Sent from the R help mailing list archive at Nabble.com.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 

-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Dr. Gavin Simpson [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,  [f] +44 (0)20 7679 0565
 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London  [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT. [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Clustering analysis with ordination plots

2012-05-01 Thread Uwe Ligges




On 30.04.2012 18:44, borinot wrote:

Hello to all,

I'm new to R so I have a lot of problems with it, but I'll only ask the main
one.

I have clustered an environmental matrix


We do not know what that is. Where is the example data? See the posting 
guide.





with 2 different methods,


Which? Where is the reproducible code? See the posting guide.



and I'd
like to plot them in a PCA


PCA is a method for reduction of dimensions. Frequently, if reduction 
works nicely, you can find clusters when plotting the first few PCs. But 
plot them [clusters] in a PCA is semantically invalid.



and a db-RDA.


What is a db-RDA?


I mean, I want see these clusters
in the plots like points of differents colours, together with the rest
information of the plot, but I don't know how to do this.


Tells us how you clustered the data and how you visualized it so far and 
we may be able to show you how to go ahead.




I've checked a lot of bibliography and forums, and I haven't found the
solution... it can't be so hard!

Well, thanks in advance! :)


But really, read the posting guide in advance of follow up questions: It 
helps to improve the way you ask questions and hence you will probably 
get more useful answers.


Uwe Ligges



--
View this message in context: 
http://r.789695.n4.nabble.com/Clustering-analysis-with-ordination-plots-tp4598695.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Clustering analysis with ordination plots

2012-04-30 Thread borinot

Hello to all, 

I'm new to R so I have a lot of problems with it, but I'll only ask the main
one. 

I have clustered an environmental matrix with 2 different methods, and I'd
like to plot them in a PCA and a db-RDA. I mean, I want see these clusters
in the plots like points of differents colours, together with the rest
information of the plot, but I don't know how to do this. 

I've checked a lot of bibliography and forums, and I haven't found the
solution... it can't be so hard! 

Well, thanks in advance! :) 

--
View this message in context: 
http://r.789695.n4.nabble.com/Clustering-analysis-with-ordination-plots-tp4598695.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] clustering and the region of integration

2012-02-10 Thread Barbara Uszczynska

Dear R users,

I'm having trouble with calculating pvalues for my 2d dataset. First I
performed clustering and I would like to get some info about the strength
of cluster membership for each point. I've calculated (thanks to nice
people help) the multivariate normal densities (mnd) using dmvnorm function:

pd11=mvtnorm::dmvnorm(dataset1,mean=dataset1MC$parameters$mean[,1],sigma=dataset1MC$parameters$variance$sigma[,,1])

I've obtained a vector of mnds for each cluster:

NA12043  NA12249  NA12264  NA12707  NA12716  NA12717
   NA12751  NA12762  NA12864  NA12873  NA07034  NA07048
 NA07055  NA07345  NA07348  NA07357  NA10830
 NA10835
8.627681e+00 8.465797e+00 1.522724e+01 2.047262e+01 1.780368e+01
2.443946e+01 8.687642e+00 5.024366e+00 2.163811e+01 6.093326e-01
1.503374e+00 2.263341e+00 2.177880e+01 2.851877e+01 1.240402e+01
7.498245e+00 1.186389e+01 1.229760e+01
 NA12154  NA12234  NA12236  NA12763  NA12801
 NA12812  NA12813  NA12878  NA10851  NA10854  NA10857
   NA10859  NA10861  NA10863  NA11839  NA11840  NA11881
 NA11882
8.293616e+00 4.019101e-19 2.733848e+01 2.623284e+01 2.320810e+01
5.112927e-01 1.432336e+01 1.000314e+01 1.675454e+01 8.239816e+00
2.449679e+01 2.655419e+01 2.294064e+01 2.218329e-17 8.844933e+00
2.911991e+00 2.170381e+01 3.089883e+00
 NA11994  NA12044  NA12056  NA12057  NA12891
 NA12892
1.668749e+01 1.588963e+01 5.913443e+00 2.924297e+01 1.765777e+01
7.935129e+00

Next, what I would like to do is to calculate the pvalue for each point,
which was assigned to particular cluster. In order to do this i'm using
pmvnorm function, but I found it difficult to set the region of
integration. As I understand to get the probability of cluster membership I
should define how 'far' from the cluster mean is my point. However, I've
got 2d dataset and my mean is also 2d:

   [1,] [2,]
 1.348992  1.269590

but I've got only one density value for each point.

Using:

pmvnorm(mean=dataset1MC$parameters$mean[,1],sigma=dataset1MC$parameters$variance$sigma[,,1],
lower=2.218329e-17, upper=as.vector(dataset1MC$parameters$mean[,1]))

gives strange results, since for 2.218329e-17 the output is:

 [1] 0.348126
attr(,error)
[1] 1e-15
attr(,msg)
[1] Normal Completion

and

pmvnorm(mean=dataset1MC$parameters$mean[,1],sigma=dataset1MC$parameters$variance$sigma[,,1],
lower= as.vector(dataset1MC$parameters$mean[,1]) , upper=2.924297e+01)

gives:

[1] 0.348126
attr(,error)
[1] 1e-15
attr(,msg)
[1] Normal Completion

If it is possible I would like to get some info about:

Is my idea of calculating  the probability of cluster membership is
correct? How I can set properly the region of integration?

I would be grateful for any help.

Best Wishes,

Bas.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Clustering and visualising a wordcloud

2012-01-23 Thread Sachinthaka Abeywardana

Is there a package (and for that matter a function) that I can use to
create clustered wordclouds. The current wordcloud package simply has more
frequent words as larger words, whereas what I want is the cluster centre
to be the more frequent words but, the closer a word is to another the
higher the co occurences.

For example in a document if the words {bus, drive, eat, pizza} appeared as
the most frequent word, you would expect {bus, drive} to be close to each
other, whereas {eat,pizza} to be away from the other cluster but close to
each other.

Thanks,
Sachin

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Clustering Large Applications..sort of

2011-08-10 Thread Ken Hutchison

Hello all,
   I am using the clustering functions in R in order to work with large
masses of binary time series data, however the clustering functions do not
seem able to fit this size of practical problem. Library 'hclust' is good
(though it may be sub par for this size of problem, thus doubly poor for
this application) in that I do not want to make assumptions about the number
of clusters present, also due to computational resources and time hclust is
not functionally good enough; furthermore k-means works fine assuming the
number of clusters within the data, which is not realistic. The silhouette
functions in 'Pam' and 'Clara' and (if I remember correctly) 'cluster' seem
to be really bad through very thorough experimentation of data generation
with known clusters. I am left then with either theoretical abstractions
such as pruning hclust trees with minimal spanning trees or perhaps
hand-rolling a hierarchical k-medoids which works extremely efficiently and
without cluster number assumptions. Anybody have any suggestions as to
possible libraries which I have missed or suggestions in general? Note: this
is not a question for 'Bigkmeans' unless there exists a
'findbigkmeansnumberofclusters' function also.
Thank you in advance for your
assistance,
 Ken

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Clustering Large Applications..sort of

2011-08-10 Thread Thomas Lumley

Try the flow cytometry clustering functions in Bioconductor.

 -thomas

On Thu, Aug 11, 2011 at 7:07 AM, Ken Hutchison vicvoncas...@gmail.com wrote:
 Hello all,
   I am using the clustering functions in R in order to work with large
 masses of binary time series data, however the clustering functions do not
 seem able to fit this size of practical problem. Library 'hclust' is good
 (though it may be sub par for this size of problem, thus doubly poor for
 this application) in that I do not want to make assumptions about the number
 of clusters present, also due to computational resources and time hclust is
 not functionally good enough; furthermore k-means works fine assuming the
 number of clusters within the data, which is not realistic. The silhouette
 functions in 'Pam' and 'Clara' and (if I remember correctly) 'cluster' seem
 to be really bad through very thorough experimentation of data generation
 with known clusters. I am left then with either theoretical abstractions
 such as pruning hclust trees with minimal spanning trees or perhaps
 hand-rolling a hierarchical k-medoids which works extremely efficiently and
 without cluster number assumptions. Anybody have any suggestions as to
 possible libraries which I have missed or suggestions in general? Note: this
 is not a question for 'Bigkmeans' unless there exists a
 'findbigkmeansnumberofclusters' function also.
                                        Thank you in advance for your
 assistance,
                                             Ken

        [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Clustering Large Applications..sort of

2011-08-10 Thread Peter Langfelder

On Wed, Aug 10, 2011 at 12:07 PM, Ken Hutchison vicvoncas...@gmail.com wrote:
 Hello all,
   I am using the clustering functions in R in order to work with large
 masses of binary time series data, however the clustering functions do not
 seem able to fit this size of practical problem. Library 'hclust' is good
 (though it may be sub par for this size of problem, thus doubly poor for
 this application) in that I do not want to make assumptions about the number
 of clusters present, also due to computational resources and time hclust is
 not functionally good enough;

How big is your problem? If your distance (dissimilarity) fits in the
memory of your machine, packages flashClust and fastCluster provide
much faster implementations of hierarchical clustering than the stock
R function hclust.

Peter

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Clustering Large Applications..sort of

2011-08-10 Thread Christian Hennig

There is a number of methods in the literature to decide the number of 
clusters for k-means. Probably the most popular one is the Calinski and 
Harabasz index, implemented as calinhara in package fpc. A distance 
based version (and several other indexes to do this) is in function 
cluster.stats in the same package.


Christian

On Wed, 10 Aug 2011, Ken Hutchison wrote:


Hello all,
  I am using the clustering functions in R in order to work with large
masses of binary time series data, however the clustering functions do not
seem able to fit this size of practical problem. Library 'hclust' is good
(though it may be sub par for this size of problem, thus doubly poor for
this application) in that I do not want to make assumptions about the number
of clusters present, also due to computational resources and time hclust is
not functionally good enough; furthermore k-means works fine assuming the
number of clusters within the data, which is not realistic. The silhouette
functions in 'Pam' and 'Clara' and (if I remember correctly) 'cluster' seem
to be really bad through very thorough experimentation of data generation
with known clusters. I am left then with either theoretical abstractions
such as pruning hclust trees with minimal spanning trees or perhaps
hand-rolling a hierarchical k-medoids which works extremely efficiently and
without cluster number assumptions. Anybody have any suggestions as to
possible libraries which I have missed or suggestions in general? Note: this
is not a question for 'Bigkmeans' unless there exists a
'findbigkmeansnumberofclusters' function also.
   Thank you in advance for your
assistance,
Ken

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Clustering Large Applications..sort of

2011-08-10 Thread Christian Hennig

PS to my previous posting: Also have a look at kmeansruns in fpc. This 
runs kmeans for several numbers of clusters and decides the number of 
clusters by either CalinskiHarabasz or Average Silhouette Width.


Christian

On Wed, 10 Aug 2011, Ken Hutchison wrote:


Hello all,
  I am using the clustering functions in R in order to work with large
masses of binary time series data, however the clustering functions do not
seem able to fit this size of practical problem. Library 'hclust' is good
(though it may be sub par for this size of problem, thus doubly poor for
this application) in that I do not want to make assumptions about the number
of clusters present, also due to computational resources and time hclust is
not functionally good enough; furthermore k-means works fine assuming the
number of clusters within the data, which is not realistic. The silhouette
functions in 'Pam' and 'Clara' and (if I remember correctly) 'cluster' seem
to be really bad through very thorough experimentation of data generation
with known clusters. I am left then with either theoretical abstractions
such as pruning hclust trees with minimal spanning trees or perhaps
hand-rolling a hierarchical k-medoids which works extremely efficiently and
without cluster number assumptions. Anybody have any suggestions as to
possible libraries which I have missed or suggestions in general? Note: this
is not a question for 'Bigkmeans' unless there exists a
'findbigkmeansnumberofclusters' function also.
   Thank you in advance for your
assistance,
Ken

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] clustering based on most significant pvalues does not separate the groups!

2011-07-06 Thread S Ellison

t-tests and the like test for a difference in mean value, not for 
non-overlapping populations or data sets.

The fact that the mean  of one data set differs significantly from the mean of 
the other does not mean that the ranges of the individual points in each data 
set are disjoint.

set.seed(1023)

x-rnorm(60, 10)
y-x+0.75
boxplot(x,y)
#Lots of overlap for individual points
t.test(x,y)
#Strongly significant difference

Does that correspond to your situation well enough to account for your 
puzzlement?


S Ellison

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of pguilha
 Sent: 04 July 2011 19:22
 To: r-help@r-project.org
 Subject: [R] clustering based on most significant pvalues 
 does not separate the groups!
 
 Hi all,
 
 I have some microarray data on 40 samples that fall into two 
 groups. I have a value for 480k probes for each of those 
 samples. I performed a t test
 (rowttests) on each row(giving the indices of the columns for 
 each group) then used p.adjust() to adjust the pvalues for 
 the number of tests performed. I then selected only the 
 probes with adj-p.value=0.05. I end up with roughly 2000 
 probes to do the clustering on but using pvclust, and hclust, 
 the samples do no split up into the two groups. I would have 
 imagined that using only those values that are significantly 
 different between the two groups, the clustering should 
 surely reflect that?
 
 Please, what am I missing???
 
 Thanks!
 
 Paul
 
 PS: I am hoping I have just thought this through in the wrong 
 way and there is a simple explanation, but can provide the 
 code I am using for clustering if necessary!
 
 
 
 --
 View this message in context: 
 http://r.789695.n4.nabble.com/clustering-based-on-most-signifi
 cant-pvalues-does-not-separate-the-groups-tp3644249p3644249.html
 Sent from the R help mailing list archive at Nabble.com.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 ***
This email and any attachments are confidential. Any use...{{dropped:8}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] clustering based on most significant pvalues does not separate the groups!

2011-07-06 Thread pguilha

Yes absolutely, your explanation makes sense. Thanks very much.
rgds
Paul

--
View this message in context: 
http://r.789695.n4.nabble.com/clustering-based-on-most-significant-pvalues-does-not-separate-the-groups-tp3644249p3649233.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] clustering based on most significant pvalues does not separate the groups!

2011-07-04 Thread pguilha

Hi all,

I have some microarray data on 40 samples that fall into two groups. I have
a value for 480k probes for each of those samples. I performed a t test
(rowttests) on each row(giving the indices of the columns for each group)
then used p.adjust() to adjust the pvalues for the number of tests
performed. I then selected only the probes with adj-p.value=0.05. I end up
with roughly 2000 probes to do the clustering on but using pvclust, and
hclust, the samples do no split up into the two groups. I would have
imagined that using only those values that are significantly different
between the two groups, the clustering should surely reflect that?

Please, what am I missing???

Thanks!

Paul

PS: I am hoping I have just thought this through in the wrong way and there
is a simple explanation, but can provide the code I am using for clustering
if necessary!



--
View this message in context: 
http://r.789695.n4.nabble.com/clustering-based-on-most-significant-pvalues-does-not-separate-the-groups-tp3644249p3644249.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Clustering help in Heat Maps

2011-04-12 Thread khush ........

Dear Experts,

I am using the below script to generate the heat map of gene expression
data. I am using Hierarchical Clustering (hclust) for clustering. Now I want
to compare different clustering parameters such as *K-means* clustering, Model
Based Clustering,

I have two queries:

1. How to incorporate different clustering method in the same code?
2. Is this possible to implement pvclust in the same code and cluster
accordingly?

library(gplots)

#===Cyto=
#x=read.table(Cyto_shoot.txt, header=TRUE)
mat=data.matrix(x)
heatmap.2(mat,
# c('red','green','orange','blue','yellow',
'gray','black','brown','aquamarine3','cyan',
'darkmagenta','darkviolet','green4'))
col=colorRampPalette(c(green,white,red))(256),

#col=greenred(75),
#col = cm.colors(256),
#bgStyle=3D Rectangle,
#bgGradientMode= Diagonal Edge,
Rowv=TRUE,

Colv=FALSE,
distfun = dist,
hclustfun = hclust,
dendrogram = c(row),
scale = c(column),
na.rm=TRUE,
trace=none,
sepwidth=c(0.05,0.05),
margins = c(03, 40),
xlab = , ylab = ,
labRow = NULL,
labCol = NULL,
key=TRUE,
keysize=1,
density.info=c(none),
)


Thanks in advance

Kamal

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Clustering problem

2011-03-21 Thread Abhishek Pratap

Hi Guys

I want to apply a clustering algo to my dataset in order to find the
regions points(X,Y) which have similar values(percent_GC and
mean_phred_quality). Details below.

I have sampled 1% of points from my main data set of 85 million
points.  The result is still somewhat large 800K points and  looks
like following.


 X Ypercent_GC  mean_phred_quality
1  4286 930   0.50   0.13
2  4825 947   0.50   20.33
3  8207 932   0.32   26.50
4  8451 940   0.48   24.81
5  9331 931   0.38   16.93
6 11501 949   0.49  31.28

What I want to do is find local regions in which I have associations
between these 4 values i.e points X,Y have close correlation with
percent_GC and mean_phred_quality.

PS:  I did calculate the overall pearson correlation coeff between
percent_GC and mean_phred_quality and it is not statistically
significant which got me interested into finding local regions where
it may be.

I would really appreciate your help as I am still a rookie in applying
clustering algorithms.

Thanks!
-Abhi

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] clustering problem

2011-03-02 Thread Maxim

Hi,

I have a gene expression experiment with 20 samples and 25000 genes each.
I'd like to perform clustering on these. It turned out to become much faster
when I transform the underlying matrix with t(matrix). Unfortunately then
I'm not anymore able to use cutree to access individual clusters. In general
I do something like this:

hc - hclust(dist(USArrests), ave)

library(RColorBrewer)
library(gplots)
clrno=3
cols-rainbow(clrno, alpha = 1)
clstrs - cutree(hc, k=clrno)
ccols - cols[as.vector(clstrs)]
heatcol-colorRampPalette(c(3,1,2), bias = 1.0)(32)
heatmap.2(as.matrix(USArrests), Rowv=as.dendrogram(hc),col=heatcol,
trace=none,RowSideColors=ccols)

Nice, I can access 3 main clusters with cutree. But what about a situation
when I perform hclust like

hc - hclust(dist(t(USArrests)), ave)

which I have to do in order to speed up the clustering process. This I can
plot with:

heatmap.2(as.matrix(USArrests), Colv=as.dendrogram(hc),col=heatcol,
trace=none)

But where do I find information about the clustering that was applied to the
rows?
cutree(hc, k=clrno) delivers the clustering on the columns, so what can I do
to access the levels for the rows?
I guess the solution is easy, but after ours of playing around I thought it
might be a good time to contact the mailing list!

Maxim

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] clustering problem

2011-03-02 Thread rex.dwyer

Don't you expect it to be a lot faster if you cluster 20 items instead of 25000?

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Maxim
Sent: Wednesday, March 02, 2011 4:08 PM
To: r-help@r-project.org
Subject: [R] clustering problem

Hi,

I have a gene expression experiment with 20 samples and 25000 genes each.
I'd like to perform clustering on these. It turned out to become much faster
when I transform the underlying matrix with t(matrix). Unfortunately then
I'm not anymore able to use cutree to access individual clusters. In general
I do something like this:

hc - hclust(dist(USArrests), ave)

library(RColorBrewer)
library(gplots)
clrno=3
cols-rainbow(clrno, alpha = 1)
clstrs - cutree(hc, k=clrno)
ccols - cols[as.vector(clstrs)]
heatcol-colorRampPalette(c(3,1,2), bias = 1.0)(32)
heatmap.2(as.matrix(USArrests), Rowv=as.dendrogram(hc),col=heatcol,
trace=none,RowSideColors=ccols)

Nice, I can access 3 main clusters with cutree. But what about a situation
when I perform hclust like

hc - hclust(dist(t(USArrests)), ave)

which I have to do in order to speed up the clustering process. This I can
plot with:

heatmap.2(as.matrix(USArrests), Colv=as.dendrogram(hc),col=heatcol,
trace=none)

But where do I find information about the clustering that was applied to the
rows?
cutree(hc, k=clrno) delivers the clustering on the columns, so what can I do
to access the levels for the rows?
I guess the solution is easy, but after ours of playing around I thought it
might be a good time to contact the mailing list!

Maxim

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

message may contain confidential information. If you are not the designated 
recipient, please notify the sender immediately, and delete the original and 
any copies. Any use of the message by you is prohibited. 
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] clustering problem

2011-03-02 Thread Maxim

Sure,

but in the end I like to call clusters of genes and not of samples. Actually
the experiment is a time-lapse experiment, therefore the samples (columns)
are fixed anyway.

I guess my misunderstanding is that I get clustering of rows in the latter
case (with dist(t(matrix))) because it's actually the heatmap function
itself, that does the actual clustering on rows, right?

But still my question stays the same: how can I cluster 25000 genes for 20
samples with a normal (i7) processor without running into several hours of
clustering/ presumably anyhow freezing of the process?
Best

Maxim

2011/3/2 rex.dw...@syngenta.com

 Don't you expect it to be a lot faster if you cluster 20 items instead of
 25000?

 -Original Message-
 From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org]
 On Behalf Of Maxim
 Sent: Wednesday, March 02, 2011 4:08 PM
 To: r-help@r-project.org
 Subject: [R] clustering problem

 Hi,

 I have a gene expression experiment with 20 samples and 25000 genes each.
 I'd like to perform clustering on these. It turned out to become much
 faster
 when I transform the underlying matrix with t(matrix). Unfortunately then
 I'm not anymore able to use cutree to access individual clusters. In
 general
 I do something like this:

 hc - hclust(dist(USArrests), ave)

 library(RColorBrewer)
 library(gplots)
 clrno=3
 cols-rainbow(clrno, alpha = 1)
 clstrs - cutree(hc, k=clrno)
 ccols - cols[as.vector(clstrs)]
 heatcol-colorRampPalette(c(3,1,2), bias = 1.0)(32)
 heatmap.2(as.matrix(USArrests), Rowv=as.dendrogram(hc),col=heatcol,
 trace=none,RowSideColors=ccols)

 Nice, I can access 3 main clusters with cutree. But what about a situation
 when I perform hclust like

 hc - hclust(dist(t(USArrests)), ave)

 which I have to do in order to speed up the clustering process. This I can
 plot with:

 heatmap.2(as.matrix(USArrests), Colv=as.dendrogram(hc),col=heatcol,
 trace=none)

 But where do I find information about the clustering that was applied to
 the
 rows?
 cutree(hc, k=clrno) delivers the clustering on the columns, so what can I
 do
 to access the levels for the rows?
 I guess the solution is easy, but after ours of playing around I thought it
 might be a good time to contact the mailing list!

 Maxim

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




 message may contain confidential information. If you are not the designated
 recipient, please notify the sender immediately, and delete the original and
 any copies. Any use of the message by you is prohibited.



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] clustering fuzzy

2011-02-05 Thread pete


After ordering the table of membership degrees , i must get the difference
between the first and second coloumns , between the first and second largest
membership degree of object i. This for K=2,K=3,to K.max=6. 
This difference is multiplyed by the Crisp silhouette index vector (si). Too
it dependending on K=2,...,K.max=6; the result divided by the sum of these
differences 
 I need a final vector composed of the indexes for each clustering
(K=2,...,K.max=6). 
There is a method, i think that is classe.memb, but i can't to solve problem
because trasformation of the membership degrees matrix( (ris$membership) and
of  list object (ris$silinfo), does not permitme to use classe.memb
propertyes. 
. 

Σí(uί1-uí2)sí/Σí(uí1-uí2) 


 head(t(A.sort)) membership degrees table ordering by max to min value 
  [,1] [,2] [,3] [,4] 
1 0.66 0.30 0.04 0.01 
2 0.89 0.09 0.02 0.00 
3 0.92 0.06 0.01 0.01 
4 0.71 0.21 0.07 0.01 
5 0.85 0.10 0.04 0.01 
6 0.91 0.04 0.02 0.02 
 head(t(A.sort)) 
  [,1] [,2] [,3] [,4] 
1 0.66 0.30 0.04 0.01 
2 0.89 0.09 0.02 0.00 
3 0.92 0.06 0.01 0.01 
4 0.71 0.21 0.07 0.01 
5 0.85 0.10 0.04 0.01 
6 0.91 0.04 0.02 0.02 
 H.Asort=head(t(A.sort)) 
 H.Asort[,1]-H.Asort[,2] 
   123456 
0.36 0.80 0.86 0.50 0.75 0.87 

 H.Asort=t(H.Asort[,1]-H.Asort[,2]) 
This is the differences vector by multiplying trasformed table ris$silinfo. 
 ris$silinfo 
$widths 
   cluster neighbor   sil_width 
72   13  0.43820207 
54   13  0.43427773 
29   16  0.41729079 
62   16  0.40550562 
64   16  0.32686757 
32   13  0.30544722 
45   13  0.30428723 
79   13  0.30192624 
12   13  0.30034472 
60   16  0.29642495 
41   13  0.29282778 
113  0.28000788 
85   13  0.24709237 
74   13  0.239 




 P=ris$silinfo 
 P=P[1] 
  P=as.data.frame(P) 
  V4=rownames(P) 
  mode(V4)=numeric 
  P[,4]=V4 
  P[order(P$V4),] 

   widths.cluster widths.neighbor widths.sil_width V4 
1   1   3   0.28000788  1 
2   2   4   0.07614849  2 
3   2   3  -0.11676440  3 
4   2   4   0.15436648  4 
5   2   3   0.14693927  5 
6   3   1   0.57083836  6 
7   4   5   0.36391826  7 
8   5   4   0.63491118  8 
9   4   2   0.54458733  9 
10  5   4   0.51059626 10 
11  2   5   0.03908952 11 
12  1   3   0.30034472 12 
13  1   3  -0.04928562 13 
14  4   3   0.20337180 14 
15  3   4   0.46164324 15 
18  5   4   0.52066782 18 
20  4   3   0.45517287 20 
21  3   4   0.39405507 21 
22  4   5   0.05574547 22 
23  6   1  -0.06750403 23 
 P= P[order(P$V4),] 

P=P[,3] 
 This is trasformed vector ris$silinfo =P. 
I can't to use this vector object in the classe.memb. 
K=2 
K.max=6 
while (K=K.max) 
 { 
 
ris=fanny(frj,K,memb.exp=m,metric=SqEuclidean,stand=TRUE,maxit=1000,tol=1e-6) 
  ris$centroid=matrix(0,nrow=K,ncol=J) 
  for (k in 1:K) 
   { 
   
ris$centroid[k,]=(t(ris$membership[,k]^m)%*%as.matrix(frj))/sum(ris$membership[,k]^m)
 
   } 
  rownames(ris$centroid)=1:K 
  colnames(ris$centroid)=colnames(frj) 
  print(K) 
  print(round(ris$centroid,2)) 
  print(classe.memb(ris$membership)$table.U) 
  print(ris$silinfo$avg.width) 
  K=K+1 
 } 
this should be scheme clearly are determined centroid based on classe.memb. 

classe.memb=function(U) 
{ 
 info.U=cbind(max.col(U),apply(U,1,max)) 
 i=1 
 while (i = nrow(U)) 
  { 
   if (apply(U,1,max)[i]0.5) info.U[i,1]=0 
   i=i+1 
  } 
 K=ncol(U) 
 table.U=matrix(0,nrow=K,ncol=4) 
 cl=1 
 while (cl = K) 
  { 
   table.U[cl,1] = length(which(info.U[info.U[,1]==cl,2]=.90)) 
   table.U[cl,2] = length(which(info.U[info.U[,1]==cl,2]=.70)) -
table.U[cl,1] 
   table.U[cl,3] = length(which(info.U[info.U[,1]==cl,2]=.50)) -
table.U[cl,1] - table.U[cl,2] 
   table.U[cl,4] = sum(table.U[cl,]) 
   cl = cl+1 
  } 
 rownames(table.U) = c(1:K) 
 colnames(table.U) = c(Alto, Medio, Basso, Totale) 
 out=list() 
 out$info.U=round(info.U,2) 
 out$table.U=table.U 
 return(out) 
}
-- 
View this message in context: 
http://r.789695.n4.nabble.com/clustering-fuzzy-tp3229853p3261837.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] clustering fuzzy

2011-02-05 Thread pete


After ordering the table of membership degrees , i must get the difference
between the first and second coloumns , between the first and second largest
membership degree of object i. This for K=2,K=3,to K.max=6. 
This difference is multiplyed by the Crisp silhouette index vector (si). Too
it dependending on K=2,...,K.max=6; the result divided by the sum of these
differences 
 I need a final vector composed of the indexes for each clustering
(K=2,...,K.max=6). 
There is a method, i think that is classe.memb, but i can't to solve problem
because trasformation of the membership degrees matrix( (ris$membership) and
of  list object (ris$silinfo), does not permitme to use classe.memb
propertyes. 
. 

Σí(uί1-uí2)sí/Σí(uí1-uí2) 


 head(t(A.sort)) membership degrees table ordering by max to min value 
  [,1] [,2] [,3] [,4] 
1 0.66 0.30 0.04 0.01 
2 0.89 0.09 0.02 0.00 
3 0.92 0.06 0.01 0.01 
4 0.71 0.21 0.07 0.01 
5 0.85 0.10 0.04 0.01 
6 0.91 0.04 0.02 0.02 
 head(t(A.sort)) 
  [,1] [,2] [,3] [,4] 
1 0.66 0.30 0.04 0.01 
2 0.89 0.09 0.02 0.00 
3 0.92 0.06 0.01 0.01 
4 0.71 0.21 0.07 0.01 
5 0.85 0.10 0.04 0.01 
6 0.91 0.04 0.02 0.02 
 H.Asort=head(t(A.sort)) 
 H.Asort[,1]-H.Asort[,2] 
   123456 
0.36 0.80 0.86 0.50 0.75 0.87 

 H.Asort=t(H.Asort[,1]-H.Asort[,2]) 
This is the differences vector by multiplying trasformed table ris$silinfo. 
 ris$silinfo 
$widths 
   cluster neighbor   sil_width 
72   13  0.43820207 
54   13  0.43427773 
29   16  0.41729079 
62   16  0.40550562 
64   16  0.32686757 
32   13  0.30544722 
45   13  0.30428723 
79   13  0.30192624 
12   13  0.30034472 
60   16  0.29642495 
41   13  0.29282778 
113  0.28000788 
85   13  0.24709237 
74   13  0.239 




 P=ris$silinfo 
 P=P[1] 
  P=as.data.frame(P) 
  V4=rownames(P) 
  mode(V4)=numeric 
  P[,4]=V4 
  P[order(P$V4),] 

   widths.cluster widths.neighbor widths.sil_width V4 
1   1   3   0.28000788  1 
2   2   4   0.07614849  2 
3   2   3  -0.11676440  3 
4   2   4   0.15436648  4 
5   2   3   0.14693927  5 
6   3   1   0.57083836  6 
7   4   5   0.36391826  7 
8   5   4   0.63491118  8 
9   4   2   0.54458733  9 
10  5   4   0.51059626 10 
11  2   5   0.03908952 11 
12  1   3   0.30034472 12 
13  1   3  -0.04928562 13 
14  4   3   0.20337180 14 
15  3   4   0.46164324 15 
18  5   4   0.52066782 18 
20  4   3   0.45517287 20 
21  3   4   0.39405507 21 
22  4   5   0.05574547 22 
23  6   1  -0.06750403 23 
 P= P[order(P$V4),] 

P=P[,3] 
 This is trasformed vector ris$silinfo =P. 
I can't to use this vector object in the classe.memb. 
K=2 
K.max=6 
while (K=K.max) 
 { 
 
ris=fanny(frj,K,memb.exp=m,metric=SqEuclidean,stand=TRUE,maxit=1000,tol=1e-6) 
  ris$centroid=matrix(0,nrow=K,ncol=J) 
  for (k in 1:K) 
   { 
   
ris$centroid[k,]=(t(ris$membership[,k]^m)%*%as.matrix(frj))/sum(ris$membership[,k]^m)
 
   } 
  rownames(ris$centroid)=1:K 
  colnames(ris$centroid)=colnames(frj) 
  print(K) 
  print(round(ris$centroid,2)) 
  print(classe.memb(ris$membership)$table.U) 
  print(ris$silinfo$avg.width) 
  K=K+1 
 } 
this should be scheme clearly are determined centroid based on classe.memb. 

classe.memb=function(U) 
{ 
 info.U=cbind(max.col(U),apply(U,1,max)) 
 i=1 
 while (i = nrow(U)) 
  { 
   if (apply(U,1,max)[i]0.5) info.U[i,1]=0 
   i=i+1 
  } 
 K=ncol(U) 
 table.U=matrix(0,nrow=K,ncol=4) 
 cl=1 
 while (cl = K) 
  { 
   table.U[cl,1] = length(which(info.U[info.U[,1]==cl,2]=.90)) 
   table.U[cl,2] = length(which(info.U[info.U[,1]==cl,2]=.70)) -
table.U[cl,1] 
   table.U[cl,3] = length(which(info.U[info.U[,1]==cl,2]=.50)) -
table.U[cl,1] - table.U[cl,2] 
   table.U[cl,4] = sum(table.U[cl,]) 
   cl = cl+1 
  } 
 rownames(table.U) = c(1:K) 
 colnames(table.U) = c(Alto, Medio, Basso, Totale) 
 out=list() 
 out$info.U=round(info.U,2) 
 out$table.U=table.U 
 return(out) 
-- 
View this message in context: 
http://r.789695.n4.nabble.com/clustering-fuzzy-tp3262027p3262027.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] clustering with finite mixture model

2011-02-02 Thread karuna m

Dear R-help,
I am doing clustering via finite mixture model. Please suggest some packages in 
R to find clusters via finite mixture model with continuous variables. And 
also I wish to verify the distributional properties of the mixture 
distributions 
by fitting the model with lognormal, gamma, exponentials etc,.
Thanks in advance,
 warm regards,Ms.Karunambigai M
PhD Scholar
Dept. of Biostatistics
NIMHANS
Bangalore
India 


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] clustering with finite mixture model

2011-02-02 Thread Matt Shotwell

There are quite a few packages that work with finite mixtures, as 
evidenced by the descriptions here:


http://cran.r-project.org/web/packages/index.html

These might be useful:

http://cran.r-project.org/web/packages/flexmix/index.html
http://cran.r-project.org/web/packages/mclust/index.html

-Matt

On 02/02/2011 04:28 AM, karuna m wrote:

Dear R-help,
I am doing clustering via finite mixture model. Please suggest some packages in
R to find clusters via finite mixture model with continuous variables. And
also I wish to verify the distributional properties of the mixture distributions
by fitting the model with lognormal, gamma, exponentials etc,.
Thanks in advance,
  warm regards,Ms.Karunambigai M
PhD Scholar
Dept. of Biostatistics
NIMHANS
Bangalore
India


[[alternative HTML version deleted]]




__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
Matthew S Shotwell   Assistant Professor   School of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] clustering fuzzy

2011-02-02 Thread pete


After ordering the table of membership degrees , i must get the difference
between the first and second coloumns , between the first and second largest
membership degree of object i. This for K=2,K=3,to K.max=6. 
This difference is multiplyed by the Crisp silhouette index vector (si). Too
it dependending on K=2,...,K.max=6; the result divided by the sum of these
differences 
 I need a final vector composed of the indexes for each clustering
(K=2,...,K.max=6). 
There is a method, i think that is classe.memb, but i can't to solve problem
because trasformation of the membership degrees matrix( (ris$membership) and
of  list object (ris$silinfo), does not permitme to use classe.memb
propertyes. 
. 

Σí(uί1-uí2)sí/Σí(uí1-uí2) 


 head(t(A.sort)) membership degrees table ordering by max to min value 
  [,1] [,2] [,3] [,4] 
1 0.66 0.30 0.04 0.01 
2 0.89 0.09 0.02 0.00 
3 0.92 0.06 0.01 0.01 
4 0.71 0.21 0.07 0.01 
5 0.85 0.10 0.04 0.01 
6 0.91 0.04 0.02 0.02 
 head(t(A.sort)) 
  [,1] [,2] [,3] [,4] 
1 0.66 0.30 0.04 0.01 
2 0.89 0.09 0.02 0.00 
3 0.92 0.06 0.01 0.01 
4 0.71 0.21 0.07 0.01 
5 0.85 0.10 0.04 0.01 
6 0.91 0.04 0.02 0.02 
 H.Asort=head(t(A.sort)) 
 H.Asort[,1]-H.Asort[,2] 
   123456 
0.36 0.80 0.86 0.50 0.75 0.87 

 H.Asort=t(H.Asort[,1]-H.Asort[,2]) 
This is the differences vector by multiplying trasformed table ris$silinfo. 
 ris$silinfo 
$widths 
   cluster neighbor   sil_width 
72   13  0.43820207 
54   13  0.43427773 
29   16  0.41729079 
62   16  0.40550562 
64   16  0.32686757 
32   13  0.30544722 
45   13  0.30428723 
79   13  0.30192624 
12   13  0.30034472 
60   16  0.29642495 
41   13  0.29282778 
113  0.28000788 
85   13  0.24709237 
74   13  0.239 




 P=ris$silinfo 
 P=P[1] 
  P=as.data.frame(P) 
  V4=rownames(P) 
  mode(V4)=numeric 
  P[,4]=V4 
  P[order(P$V4),] 

   widths.cluster widths.neighbor widths.sil_width V4 
1   1   3   0.28000788  1 
2   2   4   0.07614849  2 
3   2   3  -0.11676440  3 
4   2   4   0.15436648  4 
5   2   3   0.14693927  5 
6   3   1   0.57083836  6 
7   4   5   0.36391826  7 
8   5   4   0.63491118  8 
9   4   2   0.54458733  9 
10  5   4   0.51059626 10 
11  2   5   0.03908952 11 
12  1   3   0.30034472 12 
13  1   3  -0.04928562 13 
14  4   3   0.20337180 14 
15  3   4   0.46164324 15 
18  5   4   0.52066782 18 
20  4   3   0.45517287 20 
21  3   4   0.39405507 21 
22  4   5   0.05574547 22 
23  6   1  -0.06750403 23 
 P= P[order(P$V4),] 

P=P[,3] 
 This is trasformed vector ris$silinfo =P. 
I can't to use this vector object in the classe.memb. 
K=2 
K.max=6 
while (K=K.max) 
 { 
 
ris=fanny(frj,K,memb.exp=m,metric=SqEuclidean,stand=TRUE,maxit=1000,tol=1e-6) 
  ris$centroid=matrix(0,nrow=K,ncol=J) 
  for (k in 1:K) 
   { 
   
ris$centroid[k,]=(t(ris$membership[,k]^m)%*%as.matrix(frj))/sum(ris$membership[,k]^m)
 
   } 
  rownames(ris$centroid)=1:K 
  colnames(ris$centroid)=colnames(frj) 
  print(K) 
  print(round(ris$centroid,2)) 
  print(classe.memb(ris$membership)$table.U) 
  print(ris$silinfo$avg.width) 
  K=K+1 
 } 
this should be scheme clearly are determined centroid based on classe.memb. 

classe.memb=function(U) 
{ 
 info.U=cbind(max.col(U),apply(U,1,max)) 
 i=1 
 while (i = nrow(U)) 
  { 
   if (apply(U,1,max)[i]0.5) info.U[i,1]=0 
   i=i+1 
  } 
 K=ncol(U) 
 table.U=matrix(0,nrow=K,ncol=4) 
 cl=1 
 while (cl = K) 
  { 
   table.U[cl,1] = length(which(info.U[info.U[,1]==cl,2]=.90)) 
   table.U[cl,2] = length(which(info.U[info.U[,1]==cl,2]=.70)) -
table.U[cl,1] 
   table.U[cl,3] = length(which(info.U[info.U[,1]==cl,2]=.50)) -
table.U[cl,1] - table.U[cl,2] 
   table.U[cl,4] = sum(table.U[cl,]) 
   cl = cl+1 
  } 
 rownames(table.U) = c(1:K) 
 colnames(table.U) = c(Alto, Medio, Basso, Totale) 
 out=list() 
 out$info.U=round(info.U,2) 
 out$table.U=table.U 
 return(out) 
}
-- 
View this message in context: 
http://r.789695.n4.nabble.com/clustering-fuzzy-tp3229853p3255223.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] clustering fuzzy

2011-01-22 Thread pete


I must get an index (fuzzy silhouette), a weighted average. A average the
crisp silhouette for every row (i) s and the weight of each term is
determined by the difference between the membership degrees of corrisponding
object to its first and second best matching fuzzy clusters.
i need the difference between values the first and second coloumns for every
row (i) to be multiplied with a index(Crisp silhouette) s on each unit (row)
and then divide by the sum of all these differences

 
jholtman wrote:
 
 use 'apply':
 
 head(x.m)
V2   V3   V4   V5
 [1,] 0.66 0.04 0.01 0.30
 [2,] 0.02 0.89 0.09 0.00
 [3,] 0.06 0.92 0.01 0.01
 [4,] 0.07 0.71 0.21 0.01
 [5,] 0.10 0.85 0.04 0.01
 [6,] 0.91 0.04 0.02 0.02
 x.m.sort - apply(x.m, 1, sort, decreasing = TRUE)
 head(t(x.m.sort))
  [,1] [,2] [,3] [,4]
 [1,] 0.66 0.30 0.04 0.01
 [2,] 0.89 0.09 0.02 0.00
 [3,] 0.92 0.06 0.01 0.01
 [4,] 0.71 0.21 0.07 0.01
 [5,] 0.85 0.10 0.04 0.01
 [6,] 0.91 0.04 0.02 0.02

 
 
 On Fri, Jan 21, 2011 at 10:07 AM, pete pierole...@hotmail.it wrote:

 hello,
 i'm pete ,how can i order rows of matrix by max to min value?
 I have a matrix of membership degrees, with 82 (i) rows and K coloumns, K
 are clusters.
 I need first and second largest elements of the i-th row.

 for example
 1  0.66 0.04 0.01 0.30
 2  0.02 0.89 0.09 0.00
 3  0.06 0.92 0.01 0.01
 4  0.07 0.71 0.21 0.01
 5  0.10 0.85 0.04 0.01
 6  0.91 0.04 0.02 0.02
 7  0.00 0.01 0.98 0.00
 8  0.02 0.05 0.92 0.01
 9  0.05 0.54 0.40 0.01
 10 0.02 0.06 0.92 0.00
 11 0.05 0.55 0.39 0.01
 12 0.77 0.02 0.01 0.20
 13 0.95 0.01 0.00 0.04
 14 0.43 0.33 0.18 0.06
 15 0.79 0.10 0.08 0.03
 18 0.02 0.04 0.94 0.00
 20 0.09 0.15 0.76 0.01
 21 0.80 0.10 0.07 0.03
 22 0.06 0.15 0.79 0.01
 23 0.05 0.01 0.00 0.94
 24 0.83 0.02 0.01 0.15
 25 0.87 0.05 0.03 0.04
 27 0.76 0.10 0.11 0.03
 28 0.17 0.68 0.10 0.05
 29 0.10 0.01 0.00 0.90
 30 0.09 0.29 0.60 0.01
 31 0.05 0.01 0.00 0.94
 32 0.53 0.04 0.01 0.43
 33 0.85 0.04 0.02 0.09
 34 0.82 0.06 0.02 0.10
 35 0.76 0.07 0.02 0.14
 37 0.36 0.31 0.30 0.02
 38 0.01 0.02 0.97 0.00
 39 0.12 0.04 0.02 0.82
 40 0.02 0.00 0.00 0.97
 41 0.57 0.15 0.02 0.25
 42 0.14 0.03 0.02 0.82
 43 0.89 0.06 0.01 0.03
 44 0.02 0.00 0.00 0.98
 45 0.61 0.02 0.01 0.36
 46 0.03 0.00 0.00 0.97
 47 0.88 0.07 0.02 0.03
 48 0.06 0.60 0.32 0.02
 49 0.01 0.98 0.01 0.00
 50 0.06 0.88 0.05 0.01
 51 0.01 0.05 0.93 0.00
 52 0.02 0.08 0.90 0.00
 53 0.11 0.01 0.01 0.87
 54 0.27 0.01 0.00 0.72
 55 0.94 0.03 0.01 0.02
 58 0.45 0.41 0.05 0.09
 59 0.12 0.61 0.22 0.05
 60 0.26 0.07 0.02 0.64
 61 0.17 0.19 0.62 0.02
 62 0.08 0.00 0.00 0.92
 63 0.02 0.94 0.03 0.00
 64 0.08 0.01 0.00 0.91
 65 0.98 0.01 0.00 0.01
 67 0.22 0.69 0.08 0.01
 68 0.96 0.02 0.00 0.02
 69 0.96 0.02 0.01 0.01
 71 0.00 0.01 0.98 0.00
 72 0.56 0.05 0.01 0.37
 73 0.10 0.01 0.01 0.88
 74 0.91 0.01 0.00 0.08
 75 0.36 0.38 0.21 0.05
 76 0.15 0.40 0.44 0.01
 77 0.02 0.06 0.91 0.00
 78 0.48 0.43 0.03 0.06
 79 0.51 0.02 0.01 0.45
 80 0.04 0.01 0.00 0.95
 81 0.47 0.03 0.01 0.49
 82 0.98 0.01 0.00 0.01
 83 0.05 0.01 0.01 0.93
 84 0.03 0.00 0.00 0.96
 85 0.76 0.07 0.01 0.15
 86 0.95 0.03 0.01 0.01
 88 0.03 0.00 0.00 0.96
 90 0.79 0.13 0.02 0.06
 91 0.37 0.50 0.05 0.09
 92 0.86 0.10 0.02 0.02
 93 0.13 0.82 0.03 0.01


  A[1,][order(A[1,],decreasing=TRUE)]
 [1] 0.66 0.30 0.04 0.01

 I want this for every row
 thank you
 --
 View this message in context:
 http://r.789695.n4.nabble.com/clustering-fuzzy-tp3229853p3229853.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

 
 
 
 -- 
 Jim Holtman
 Data Munger Guru
 
 What is the problem that you are trying to solve?
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 

-- 
View this message in context: 
http://r.789695.n4.nabble.com/clustering-fuzzy-tp3229853p3231477.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] clustering fuzzy

2011-01-21 Thread pete


hello,
i'm pete ,how can i order rows of matrix by max to min value?
I have a matrix of membership degrees, with 82 (i) rows and K coloumns, K
are clusters.
I need first and second largest elements of the i-th row.

for example
1  0.66 0.04 0.01 0.30
2  0.02 0.89 0.09 0.00
3  0.06 0.92 0.01 0.01
4  0.07 0.71 0.21 0.01
5  0.10 0.85 0.04 0.01
6  0.91 0.04 0.02 0.02
7  0.00 0.01 0.98 0.00
8  0.02 0.05 0.92 0.01
9  0.05 0.54 0.40 0.01
10 0.02 0.06 0.92 0.00
11 0.05 0.55 0.39 0.01
12 0.77 0.02 0.01 0.20
13 0.95 0.01 0.00 0.04
14 0.43 0.33 0.18 0.06
15 0.79 0.10 0.08 0.03
18 0.02 0.04 0.94 0.00
20 0.09 0.15 0.76 0.01
21 0.80 0.10 0.07 0.03
22 0.06 0.15 0.79 0.01
23 0.05 0.01 0.00 0.94
24 0.83 0.02 0.01 0.15
25 0.87 0.05 0.03 0.04
27 0.76 0.10 0.11 0.03
28 0.17 0.68 0.10 0.05
29 0.10 0.01 0.00 0.90
30 0.09 0.29 0.60 0.01
31 0.05 0.01 0.00 0.94
32 0.53 0.04 0.01 0.43
33 0.85 0.04 0.02 0.09
34 0.82 0.06 0.02 0.10
35 0.76 0.07 0.02 0.14
37 0.36 0.31 0.30 0.02
38 0.01 0.02 0.97 0.00
39 0.12 0.04 0.02 0.82
40 0.02 0.00 0.00 0.97
41 0.57 0.15 0.02 0.25
42 0.14 0.03 0.02 0.82
43 0.89 0.06 0.01 0.03
44 0.02 0.00 0.00 0.98
45 0.61 0.02 0.01 0.36
46 0.03 0.00 0.00 0.97
47 0.88 0.07 0.02 0.03
48 0.06 0.60 0.32 0.02
49 0.01 0.98 0.01 0.00
50 0.06 0.88 0.05 0.01
51 0.01 0.05 0.93 0.00
52 0.02 0.08 0.90 0.00
53 0.11 0.01 0.01 0.87
54 0.27 0.01 0.00 0.72
55 0.94 0.03 0.01 0.02
58 0.45 0.41 0.05 0.09
59 0.12 0.61 0.22 0.05
60 0.26 0.07 0.02 0.64
61 0.17 0.19 0.62 0.02
62 0.08 0.00 0.00 0.92
63 0.02 0.94 0.03 0.00
64 0.08 0.01 0.00 0.91
65 0.98 0.01 0.00 0.01
67 0.22 0.69 0.08 0.01
68 0.96 0.02 0.00 0.02
69 0.96 0.02 0.01 0.01
71 0.00 0.01 0.98 0.00
72 0.56 0.05 0.01 0.37
73 0.10 0.01 0.01 0.88
74 0.91 0.01 0.00 0.08
75 0.36 0.38 0.21 0.05
76 0.15 0.40 0.44 0.01
77 0.02 0.06 0.91 0.00
78 0.48 0.43 0.03 0.06
79 0.51 0.02 0.01 0.45
80 0.04 0.01 0.00 0.95
81 0.47 0.03 0.01 0.49
82 0.98 0.01 0.00 0.01
83 0.05 0.01 0.01 0.93
84 0.03 0.00 0.00 0.96
85 0.76 0.07 0.01 0.15
86 0.95 0.03 0.01 0.01
88 0.03 0.00 0.00 0.96
90 0.79 0.13 0.02 0.06
91 0.37 0.50 0.05 0.09
92 0.86 0.10 0.02 0.02
93 0.13 0.82 0.03 0.01


 A[1,][order(A[1,],decreasing=TRUE)]
[1] 0.66 0.30 0.04 0.01

I want this for every row
thank you
-- 
View this message in context: 
http://r.789695.n4.nabble.com/clustering-fuzzy-tp3229853p3229853.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] clustering fuzzy

2011-01-21 Thread jim holtman

use 'apply':

 head(x.m)
   V2   V3   V4   V5
[1,] 0.66 0.04 0.01 0.30
[2,] 0.02 0.89 0.09 0.00
[3,] 0.06 0.92 0.01 0.01
[4,] 0.07 0.71 0.21 0.01
[5,] 0.10 0.85 0.04 0.01
[6,] 0.91 0.04 0.02 0.02
 x.m.sort - apply(x.m, 1, sort, decreasing = TRUE)
 head(t(x.m.sort))
 [,1] [,2] [,3] [,4]
[1,] 0.66 0.30 0.04 0.01
[2,] 0.89 0.09 0.02 0.00
[3,] 0.92 0.06 0.01 0.01
[4,] 0.71 0.21 0.07 0.01
[5,] 0.85 0.10 0.04 0.01
[6,] 0.91 0.04 0.02 0.02



On Fri, Jan 21, 2011 at 10:07 AM, pete pierole...@hotmail.it wrote:

 hello,
 i'm pete ,how can i order rows of matrix by max to min value?
 I have a matrix of membership degrees, with 82 (i) rows and K coloumns, K
 are clusters.
 I need first and second largest elements of the i-th row.

 for example
 1  0.66 0.04 0.01 0.30
 2  0.02 0.89 0.09 0.00
 3  0.06 0.92 0.01 0.01
 4  0.07 0.71 0.21 0.01
 5  0.10 0.85 0.04 0.01
 6  0.91 0.04 0.02 0.02
 7  0.00 0.01 0.98 0.00
 8  0.02 0.05 0.92 0.01
 9  0.05 0.54 0.40 0.01
 10 0.02 0.06 0.92 0.00
 11 0.05 0.55 0.39 0.01
 12 0.77 0.02 0.01 0.20
 13 0.95 0.01 0.00 0.04
 14 0.43 0.33 0.18 0.06
 15 0.79 0.10 0.08 0.03
 18 0.02 0.04 0.94 0.00
 20 0.09 0.15 0.76 0.01
 21 0.80 0.10 0.07 0.03
 22 0.06 0.15 0.79 0.01
 23 0.05 0.01 0.00 0.94
 24 0.83 0.02 0.01 0.15
 25 0.87 0.05 0.03 0.04
 27 0.76 0.10 0.11 0.03
 28 0.17 0.68 0.10 0.05
 29 0.10 0.01 0.00 0.90
 30 0.09 0.29 0.60 0.01
 31 0.05 0.01 0.00 0.94
 32 0.53 0.04 0.01 0.43
 33 0.85 0.04 0.02 0.09
 34 0.82 0.06 0.02 0.10
 35 0.76 0.07 0.02 0.14
 37 0.36 0.31 0.30 0.02
 38 0.01 0.02 0.97 0.00
 39 0.12 0.04 0.02 0.82
 40 0.02 0.00 0.00 0.97
 41 0.57 0.15 0.02 0.25
 42 0.14 0.03 0.02 0.82
 43 0.89 0.06 0.01 0.03
 44 0.02 0.00 0.00 0.98
 45 0.61 0.02 0.01 0.36
 46 0.03 0.00 0.00 0.97
 47 0.88 0.07 0.02 0.03
 48 0.06 0.60 0.32 0.02
 49 0.01 0.98 0.01 0.00
 50 0.06 0.88 0.05 0.01
 51 0.01 0.05 0.93 0.00
 52 0.02 0.08 0.90 0.00
 53 0.11 0.01 0.01 0.87
 54 0.27 0.01 0.00 0.72
 55 0.94 0.03 0.01 0.02
 58 0.45 0.41 0.05 0.09
 59 0.12 0.61 0.22 0.05
 60 0.26 0.07 0.02 0.64
 61 0.17 0.19 0.62 0.02
 62 0.08 0.00 0.00 0.92
 63 0.02 0.94 0.03 0.00
 64 0.08 0.01 0.00 0.91
 65 0.98 0.01 0.00 0.01
 67 0.22 0.69 0.08 0.01
 68 0.96 0.02 0.00 0.02
 69 0.96 0.02 0.01 0.01
 71 0.00 0.01 0.98 0.00
 72 0.56 0.05 0.01 0.37
 73 0.10 0.01 0.01 0.88
 74 0.91 0.01 0.00 0.08
 75 0.36 0.38 0.21 0.05
 76 0.15 0.40 0.44 0.01
 77 0.02 0.06 0.91 0.00
 78 0.48 0.43 0.03 0.06
 79 0.51 0.02 0.01 0.45
 80 0.04 0.01 0.00 0.95
 81 0.47 0.03 0.01 0.49
 82 0.98 0.01 0.00 0.01
 83 0.05 0.01 0.01 0.93
 84 0.03 0.00 0.00 0.96
 85 0.76 0.07 0.01 0.15
 86 0.95 0.03 0.01 0.01
 88 0.03 0.00 0.00 0.96
 90 0.79 0.13 0.02 0.06
 91 0.37 0.50 0.05 0.09
 92 0.86 0.10 0.02 0.02
 93 0.13 0.82 0.03 0.01


  A[1,][order(A[1,],decreasing=TRUE)]
 [1] 0.66 0.30 0.04 0.01

 I want this for every row
 thank you
 --
 View this message in context: 
 http://r.789695.n4.nabble.com/clustering-fuzzy-tp3229853p3229853.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] clustering fuzzy

2011-01-21 Thread pete


thank you ,you have been very kind
-- 
View this message in context: 
http://r.789695.n4.nabble.com/clustering-fuzzy-tp3229853p3230228.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] clustering association rules

2010-11-11 Thread Michael Hahsler


Jüri,

How did you create the output?

An example to cluster transactions with arules can be found in:

Michael Hahsler and Kurt Hornik. Building on the arules infrastructure 
for analyzing transaction data with R. In R. Decker and H.-J. Lenz, 
editors, /Advances in Data Analysis, Proceedings of the 30th Annual 
Conference of the Gesellschaft für Klassifikation e.V., Freie 
Universität Berlin, March 8-10, 2006/, Studies in Classification, Data 
Analysis, and Knowledge Organization, pages 449-456. Springer-Verlag, 2007.


URL: http://michael.hahsler.net/research/arules_gfkl2006/arules_gfkl2006.pdf

Hope this helps,
-Michael

--
  Dr. Michael Hahsler, Visiting Assistant Professor
  Department of Computer Science and Engineering
  Lyle School of Engineering
  Southern Methodist University, Dallas, Texas

  (214) 768-8878 * mhahs...@lyle.smu.edu * http://lyle.smu.edu/~mhahsler

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] clustering association rules

2010-11-10 Thread Kuusik , Jüri


Hello.

I have a general question regarding to clustering of association rules.
According to http://cran.r-project.org/web/packages/arules/vignettes/arules.pdf
4.7 Distance based clustering transactions and associations there is 
possibility for creating clusters of association rules.
I do not understand, how I have to interpret clusters of rules.

Let us have next association rules and clusters (see table).
My final goal is to have clusters of items.
Does it mean that items 11180, 10378, 52848, 11451, ..., 50417 are all in 
cluster 1 ?
What is the point of clustering of association rules, how do I have to use them 
for clustering of items which are present in rules?

rules

support

confidence

lift

clusters

{11180} = {10378}

0.000529646

1

1708.238095

1

{10378} = {11180}

0.000529646

0.904761905

1708.238095

1

{52848} = {11451}

0.000557522

1

61.74354561

1

{50417} = {22939}

0.000543584

1

1304.472727

1

{22939} = {50417}

0.000543584

0.709090909

1304.472727

1

{10838} = {540}

0.00050177

1

21.07696827

2

{10924} = {2930}

0.000599337

0.95556

117.5939775

3

{28113} = {540}

0.000669027

0.941176471

19.83714661

2

{11483} = {540}

0.00057146

1

21.07696827

2

{93799} = {13041}

0.000585398

1

1024.942857

4

{13041} = {93799}

0.000585398

0.6

1024.942857

4

{8482} = {540}

0.00050177

0.87804878

18.50660629

2

{16018} = {540}

0.00057146

0.931818182

19.63990225

2

{6837} = {540}

0.000515708

0.925

19.49619565

2

{7709,8699,94762} = {94764}

0.001031416

1

564.9291339

5

{8699,94762,94764} = {7709}

0.001031416

1

658.2201835

5

{7709,8699,94764} = {94762}

0.001031416

0.98667

786.5487407

5

{7709,94762,94764} = {8699}

0.001031416

1

874.9512195

5

{7709,8699,94762} = {410}

0.001031416

1

203.2464589

6

{410,8699,94762} = {7709}

0.001031416

0.98667

649.4439144

6

{410,7709,8699} = {94762}

0.001031416

0.98667

786.5487407

6

{410,7709,94762} = {8699}

0.001031416

1

874.9512195

6

{7709,8699,94762} = {2883}

0.000919912

0.891891892

59.69186164

7

{2883,8699,94762} = {7709}

0.000919912

1

658.2201835

7

{2883,7709,8699} = {94762}

0.000919912

1

797.178

7

{2883,7709,94762} = {8699}

0.000919912

1

874.9512195

7

{8699,94762,94764} = {410}

0.001031416

1

203.2464589

5

{410,8699,94762} = {94764}

0.001031416

0.98667

557.3967454

5

{410,8699,94764} = {94762}

0.001031416

0.98667

786.5487407

5


Best regards,

Jüri Kuusik




Please help Logica to respect the environment by not printing this email  / 
Pour contribuer comme Logica au respect de l'environnement, merci de ne pas 
imprimer ce mail /  Bitte drucken Sie diese Nachricht nicht aus und helfen Sie 
so Logica dabei, die Umwelt zu schützen. /  Por favor ajude a Logica a 
respeitar o ambiente nao imprimindo este correio electronico.



This e-mail and any attachment is for authorised use by the intended 
recipient(s) only. It may contain proprietary material, confidential 
information and/or be subject to legal privilege. It should not be copied, 
disclosed to, retained or used by, any other party. If you are not an intended 
recipient then please promptly delete this e-mail and any attachment and all 
copies and inform the sender. Thank you.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Clustering

2010-10-30 Thread dpender



David Winsemius wrote:
 
 
 On Oct 29, 2010, at 12:08 PM, David Winsemius wrote:
 

 On Oct 29, 2010, at 11:37 AM, dpender wrote:

 Apologies for being vague,

 The structure of the output is as follows:

 Still no code?

 
 I am using the Clusters function from the evd package
 

 $ cluster1  : Named num [1:131] 3.05 2.71 3.26 2.91 2.88 3.11 3.21  
 -1 2.97
 3.39 ...
 ..- attr(*, names)= chr [1:131] 6667 6668 6669 6670 ...

 With 613 clusters.  What I require is abstracting the first and  
 last value
 of

 - attr(*, names)= chr [1:131] 6667 6668 6669 6670

 Those values are in an attribute:
 
 Corrections:

 ? attribute
 
 ?attributes
 
 ? attr

 Your specific request may (perhaps) be addressed by something like:

 attrnames - attr(objname[cluster1], names)
 ^  ^   should be doubled square- 
 
 This works to abstract the part that I am looking for but in order to loop
 this over every cluster I need an output object of the same form as
 clusters to write the names to.
 
 brackets
 attrnames[c(1, length(attrnames)]
   ^  missing right-paren
 
 Might work:
 attrnames - attr(clusobj[[cluster1]], names)
 attrnames[c(1, length(attrnames))]
 --
 
 David Winsemius, MD
 West Hartford, CT
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 


Additionally I can get the output as a matrix in form

 atomic [1:613] 3.01 4.1 3.04 3.81 3.55 3.37 3.09 4.1 3.61 6.36 ...
 - attr(*, acs)= num 47.6

where acs is the average size.  Each height value in the vector has a
corresponding number relating to the location in the dataset.  When I change
the vector to matrix this looks like the row name but it isn't as
rownames(clusters) yields NULL.  

Do you have any idea how to abstract these values?

Doug
-- 
View this message in context: 
http://r.789695.n4.nabble.com/Clustering-tp3017056p3020216.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Clustering

2010-10-30 Thread David Winsemius



On Oct 30, 2010, at 7:49 AM, dpender wrote:


David Winsemius wrote:


On Oct 29, 2010, at 12:08 PM, David Winsemius wrote:



On Oct 29, 2010, at 11:37 AM, dpender wrote:


Apologies for being vague,

The structure of the output is as follows:


Still no code?




I am using the Clusters function from the evd package




$ cluster1  : Named num [1:131] 3.05 2.71 3.26 2.91 2.88 3.11  
3.21  -1 2.97 3.39 ...

..- attr(*, names)= chr [1:131] 6667 6668 6669 6670 ...

With 613 clusters.  What I require is abstracting the first and  
last value of


- attr(*, names)= chr [1:131] 6667 6668 6669 6670


Those values are in an attribute:


Corrections:


? attribute

?attributes

? attr
Your specific request may (perhaps) be addressed by something like:
attrnames - attr(objname[cluster1], names)

   ^  ^   should be doubled square-


This works to abstract the part that I am looking for but in order  
to loop

this over every cluster I need an output object of the same form as
clusters to write the names to.


THat is rather difficult to implement since the phrase same form as  
the clusters is still undetermined in the absence of full output from  
str() or an actual data object. The help page for the clusters  
function (not Clusters, BTW) could be used for a concrete example:


require(evd)
data(portpirie)
clusobj - clusters(portpirie, 4.2, 3)
lapply(clusobj, attr, names)
nclusters - length(clusobj)
# This gives the locations (in the names) and values at the beginning  
and end of the 6 clusters


 lapply(clusobj, function(x) c(head(x,1), tail(x,1)))
$cluster1
   9   12
4.36 4.69

$cluster2
  20   26
4.25 4.37

$cluster3
  31   31
4.55 4.55

$cluster4
  38   43
4.21 4.21

$cluster5
  58   59
4.33 4.55

$cluster6
  65   65
4.33 4.33

# If you used sapply you could get the values as a matrix:

 sapply(clusobj, function(x) c(head(x,1), tail(x,1)))
   cluster1 cluster2 cluster3 cluster4 cluster5 cluster6
9  4.36 4.25 4.55 4.21 4.33 4.33
12 4.69 4.37 4.55 4.21 4.55 4.33

# (I don't know what the 9 and 12 represent.)

# You can also get the sequence boundary locations in a (character)  
matrix:

 sapply(clusobj, function(x) names(c(head(x,1), tail(x,1
 cluster1 cluster2 cluster3 cluster4 cluster5 cluster6
[1,] 9  20 31 38 58 65
[2,] 12 26 31 43 59 65



brackets

attrnames[c(1, length(attrnames)]

 ^  missing right-paren

Might work:
attrnames - attr(clusobj[[cluster1]], names)
attrnames[c(1, length(attrnames))]
--

David Winsemius, MD
West Hartford, CT






Additionally I can get the output as a matrix in form

atomic [1:613] 3.01 4.1 3.04 3.81 3.55 3.37 3.09 4.1 3.61 6.36 ...
- attr(*, acs)= num 47.6

where acs is the average size.  Each height value in the vector  
has a

corresponding number relating to the location in the dataset.


Better would be to tell us _how_ each height value has a  
corresponding number relating to the location. It is not apparent  
from the above. Some other object you are not naming or describing for  
us?



 When I change
the vector to matrix this looks like the row name but it isn't as
rownames(clusters) yields NULL.

Do you have any idea how to abstract these values?


First you need to figure out what these are. Better than guessing and  
applying extractor functions, would be to use str() and class() on the  
result ... and for Pete's sake , include the full console output  
rather than your guess at what is needed.




Doug
--


David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Clustering

2010-10-29 Thread dpender



That's helpful but the reason I'm using clusters in evd is that I need to
specify a time condition to ensure independence.  

I therefore have an output in the form Cluster[[i]][j-k]  where i is the
cluster number and j-k is the range of values above the threshold taking
account of the time condition.

From this I can get durations easily enough but the spacing is proving quite
difficult.

The data is for ocean waves and therefore it may be possible that the wave
height drops below the threshold for a short period but should still be
considered part of the same event, hence the time conditon.

Hope this clarifies the problem.

Doug
-- 
View this message in context: 
http://r.789695.n4.nabble.com/Clustering-tp3017056p3018744.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Clustering

2010-10-29 Thread David Winsemius



On Oct 29, 2010, at 5:14 AM, dpender wrote:




That's helpful but the reason I'm using clusters in evd is that I  
need to

specify a time condition to ensure independence.


I believe this is the first we heard about any particular function or  
package.




I therefore have an output


We would need to know the code that produced that output and either  
the data to which that code was applied or the structure of the  
output. (Use the str() function)



in the form Cluster[[i]][j-k]  where i is the
cluster number and j-k is the range of values above the threshold  
taking

account of the time condition.


Unless you find someone who uses that package in the manner you have,  
you will need to explain in much greater detail than you have so far.




From this I can get durations easily enough but the spacing is  
proving quite

difficult.


There are quite possibly methods using rle() or possibly something  
like rollapply() from the zoo package, but you need to provide a  
specific and richer test case.




The data is for ocean waves and therefore it may be possible that  
the wave
height drops below the threshold for a short period but should still  
be

considered part of the same event, hence the time conditon.

Hope this clarifies the problem.


It clarifies it to the extent that it show how much more you will need  
to further clarify.


--

David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Clustering

2010-10-29 Thread dpender


 
Apologies for being vague,

The structure of the output is as follows:

$ cluster1  : Named num [1:131] 3.05 2.71 3.26 2.91 2.88 3.11 3.21 -1 2.97
3.39 ...
  ..- attr(*, names)= chr [1:131] 6667 6668 6669 6670 ...

With 613 clusters.  What I require is abstracting the first and last value
of 

- attr(*, names)= chr [1:131] 6667 6668 6669 6670

This will give a start and end point of the cluster and allow for the
spacing to be determined.

Thanks,

Doug
-- 
View this message in context: 
http://r.789695.n4.nabble.com/Clustering-tp3017056p3019323.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Clustering

2010-10-29 Thread David Winsemius



On Oct 29, 2010, at 11:37 AM, dpender wrote:




Apologies for being vague,

The structure of the output is as follows:


Still no code?



$ cluster1  : Named num [1:131] 3.05 2.71 3.26 2.91 2.88 3.11 3.21  
-1 2.97

3.39 ...
 ..- attr(*, names)= chr [1:131] 6667 6668 6669 6670 ...

With 613 clusters.  What I require is abstracting the first and last  
value

of

- attr(*, names)= chr [1:131] 6667 6668 6669 6670


Those values are in an attribute:

? attribute
? attr

Your specific request may (perhaps) be addressed by something like:

attrnames - attr(objname[cluster1], names)
attrnames[c(1, length(attrnames)]

I don't really think this is optimal. For one thing it won't  
generalize to the rest of the clusters. Generally when data are put  
into an attribute, the programmer also provides an extraction  
function. Since Nabble posters are prone to not retaining thread  
context, the name of your function and package are probably elsewhere  
further up the thread but not available as I read this on a mail- 
client. Perhaps if you read more of the documentation? (Just a guess.)




This will give a start and end point of the cluster and allow for the
spacing to be determined.


Those would be character values and, if you want to do calculations,  
would obviously need to be coerced to numeric.




Thanks,

Doug
--


David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Clustering

2010-10-29 Thread David Winsemius



On Oct 29, 2010, at 12:08 PM, David Winsemius wrote:



On Oct 29, 2010, at 11:37 AM, dpender wrote:


Apologies for being vague,

The structure of the output is as follows:


Still no code?



$ cluster1  : Named num [1:131] 3.05 2.71 3.26 2.91 2.88 3.11 3.21  
-1 2.97

3.39 ...
..- attr(*, names)= chr [1:131] 6667 6668 6669 6670 ...

With 613 clusters.  What I require is abstracting the first and  
last value

of

- attr(*, names)= chr [1:131] 6667 6668 6669 6670


Those values are in an attribute:


Corrections:


? attribute


?attributes


? attr

Your specific request may (perhaps) be addressed by something like:

attrnames - attr(objname[cluster1], names)
   ^  ^   should be doubled square- 
brackets

attrnames[c(1, length(attrnames)]

 ^  missing right-paren

Might work:
attrnames - attr(clusobj[[cluster1]], names)
attrnames[c(1, length(attrnames))]
--

David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Clustering

2010-10-28 Thread dpender


I am looking to use R in order to determine the number of extreme events for
a high frequency (20 minutes) dataset of wave heights that spans 25 years
(657,432) data points.

I require the number, spacing and duration of the extreme events as an
output.

I have briefly used the clusters function in evd package.

Can anyone suggest a more appropriate package to use for such a large
dataset?

Thanks,

Doug

-- 
View this message in context: 
http://r.789695.n4.nabble.com/Clustering-tp3017056p3017056.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Clustering

2010-10-28 Thread Albyn Jones

I have worked with seismic data measured at 100hz, and had no trouble
locating events in long records (several times the size of your
dataset).  20 minutes is high frequency?  what kind of waves are
these?  what is the wavelength? some details would help.

albyn

On Thu, Oct 28, 2010 at 05:00:10AM -0700, dpender wrote:
 
 I am looking to use R in order to determine the number of extreme events for
 a high frequency (20 minutes) dataset of wave heights that spans 25 years
 (657,432) data points.
 
 I require the number, spacing and duration of the extreme events as an
 output.
 
 I have briefly used the clusters function in evd package.
 
 Can anyone suggest a more appropriate package to use for such a large
 dataset?
 
 Thanks,
 
 Doug
 
 -- 
 View this message in context: 
 http://r.789695.n4.nabble.com/Clustering-tp3017056p3017056.html
 Sent from the R help mailing list archive at Nabble.com.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 

-- 
Albyn Jones
Reed College
jo...@reed.edu

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Clustering

2010-10-28 Thread David Winsemius



On Oct 28, 2010, at 8:00 AM, dpender wrote:



I am looking to use R in order to determine the number of extreme  
events for
a high frequency (20 minutes) dataset of wave heights that spans 25  
years

(657,432) data points.

I require the number, spacing and duration of the extreme events as an
output.


If you created a test vector and then used rle on the test,  you  
may get what you want.


This yields the intervals between events (  greater than 0.9):

 wave - runif(100)
 test - wave  0.9
 rle(test)
Run Length Encoding
  lengths: int [1:11] 74 1 5 1 1 1 6 1 4 1 ...
  values : logi [1:11] FALSE TRUE FALSE TRUE FALSE TRUE ...
 rle(test)$lengths[ !rle(test)$values ]
[1] 74  5  1  6  4  5

You can also get the duration of an extreme event by not using the  
negation of the values. (Sorry for the double-negative.)

--
David.



I have briefly used the clusters function in evd package.

Can anyone suggest a more appropriate package to use for such a large
dataset?



David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] clustering on scaled dataset or not?

2010-10-28 Thread array chip

Hi, just a general question: when we do hierarchical clustering, should we 
compute the dissimilarity matrix based on scaled dataset or non-scaled dataset? 
daisy() in cluster package allow standardizing the variables before calculating 
dissimilarity matrix; but dist() doesn't have that option at all. Appreciate if 
you can share your thoughts?

Thanks

John



  
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] clustering on scaled dataset or not?

2010-10-28 Thread Claudia Beleites


John,



Hi, just a general question: when we do hierarchical clustering, should we
compute the dissimilarity matrix based on scaled dataset or non-scaled dataset?




daisy() in cluster package allow standardizing the variables before calculating
dissimilarity matrix;


I'd say that should depend on your data.

- if your data is all (physically) different kinds of things (and thus 
different orders of magnitude), then you should probably scale.


- On the other hand, I cluster spectra. Thus my variates are all the 
same unit, and moreover I'd be afraid that scaling would blow up 
noise-only variates (i.e. the spectra do have low or no intensity 
regions), thus I usually don't scale.


- It also depends on your distance. E.g. Mahalanobis should do the 
scaling by itself, if think correctly at this time of the day...


What I do frequently, though, is subtracting something like the minimum 
spectrum (in practice, I calculate the 5th percentile for each variate - 
it's less noisy). You can also center, but I'm strongly for having a 
physical meaning, and for my samples that's the minimum spectrum is 
better interpretable (it represents the matrix composition).



but dist() doesn't have that option at all. Appreciate if
you can share your thoughts?

but you could call scale () and then dist ().

Claudia




Thanks

John




[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Clustering with ordinal data

2010-10-19 Thread Steve_Friedman


Hello

I've been asked to help evaluate a vegetation data set, specifically to
examine it for community similarity. The initial problem I see is that the
data is ordinal.   At best this only captures a relative ranking of
abundance and ordinal ranks are assigned after data collection.I've
been trying to find a procedure in R that can handle ordinal based
classification and so far have not found one.

Does one exist ?  If there is one, which package supports this type of
analysis and  what is the function ?

Thanks in advance.
Steve




Steve Friedman Ph. D.
Spatial Statistical Analyst
Everglades and Dry Tortugas National Park
950 N Krome Ave (3rd Floor)
Homestead, Florida 33034

steve_fried...@nps.gov
Office (305) 224 - 4282
Fax (305) 224 - 4147

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Clustering with ordinal data

2010-10-19 Thread Phil Spector


Steve -
   Take a look at daisy() in the cluster package.

- Phil Spector
 Statistical Computing Facility
 Department of Statistics
 UC Berkeley
 spec...@stat.berkeley.edu

On Tue, 19 Oct 2010, steve_fried...@nps.gov wrote:



Hello

I've been asked to help evaluate a vegetation data set, specifically to
examine it for community similarity. The initial problem I see is that the
data is ordinal.   At best this only captures a relative ranking of
abundance and ordinal ranks are assigned after data collection.I've
been trying to find a procedure in R that can handle ordinal based
classification and so far have not found one.

Does one exist ?  If there is one, which package supports this type of
analysis and  what is the function ?

Thanks in advance.
Steve




Steve Friedman Ph. D.
Spatial Statistical Analyst
Everglades and Dry Tortugas National Park
950 N Krome Ave (3rd Floor)
Homestead, Florida 33034

steve_fried...@nps.gov
Office (305) 224 - 4282
Fax (305) 224 - 4147

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Clustering with ordinal data

2010-10-19 Thread Steve_Friedman

Thanks Phil,

I'll do so now.

Much appreciated.

Steve

Steve Friedman Ph. D.
Spatial Statistical Analyst
Everglades and Dry Tortugas National Park
950 N Krome Ave (3rd Floor)
Homestead, Florida 33034

steve_fried...@nps.gov
Office (305) 224 - 4282
Fax (305) 224 - 4147


   
 Phil Spector  
 spec...@stat.ber 
 keley.edu To 
   steve_fried...@nps.gov  
 10/19/2010 02:23   cc 
 PMr-help@r-project.org
   Subject 
   Re: [R] Clustering with ordinal 
   data
   
   
   
   
   
   




Steve -
Take a look at daisy() in the cluster package.

 - Phil Spector
  Statistical
Computing Facility
  Department of
Statistics
  UC Berkeley

spec...@stat.berkeley.edu

On Tue, 19 Oct 2010, steve_fried...@nps.gov wrote:


 Hello

 I've been asked to help evaluate a vegetation data set, specifically to
 examine it for community similarity. The initial problem I see is that
the
 data is ordinal.   At best this only captures a relative ranking of
 abundance and ordinal ranks are assigned after data collection.I've
 been trying to find a procedure in R that can handle ordinal based
 classification and so far have not found one.

 Does one exist ?  If there is one, which package supports this type of
 analysis and  what is the function ?

 Thanks in advance.
 Steve




 Steve Friedman Ph. D.
 Spatial Statistical Analyst
 Everglades and Dry Tortugas National Park
 950 N Krome Ave (3rd Floor)
 Homestead, Florida 33034

 steve_fried...@nps.gov
 Office (305) 224 - 4282
 Fax (305) 224 - 4147

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Clustering with ordinal data

2010-10-19 Thread Michael Bedward

Hello Steve,

 I've been asked to help evaluate a vegetation data set, specifically to
 examine it for community similarity. The initial problem I see is that the
 data is ordinal.   At best this only captures a relative ranking of
 abundance and ordinal ranks are assigned after data collection.

Just about every vegetation survey ever conducted uses either presence
absence or ordinal data collection (e.g. Braun-Blanquet scores or
importance scores from nested quadrats).

A large number of distance metrics are in the literature to deal with
such data. As well as Phil's suggestion you should definitely look at
the vegan package which contains a good selection of these metrics
plus numerous functions frequently used in classification and
ordination of veg data.

Michael


On 20 October 2010 05:14,  steve_fried...@nps.gov wrote:

 Hello

 I've been asked to help evaluate a vegetation data set, specifically to
 examine it for community similarity. The initial problem I see is that the
 data is ordinal.   At best this only captures a relative ranking of
 abundance and ordinal ranks are assigned after data collection.    I've
 been trying to find a procedure in R that can handle ordinal based
 classification and so far have not found one.

 Does one exist ?  If there is one, which package supports this type of
 analysis and  what is the function ?

 Thanks in advance.
 Steve




 Steve Friedman Ph. D.
 Spatial Statistical Analyst
 Everglades and Dry Tortugas National Park
 950 N Krome Ave (3rd Floor)
 Homestead, Florida 33034

 steve_fried...@nps.gov
 Office (305) 224 - 4282
 Fax     (305) 224 - 4147

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] clustering with cosine correlation

2010-10-11 Thread l.mohammadikhankahdani


Dear All

 Do you know how to make a heatmap and use cosine correlation for 
clustering? This is what my colleague can do in gene-math and I want to 
do in R but I don't know how to.

Thanks a lot
Leila

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Clustering groups

2010-07-21 Thread syrvn


Hi,

is there a way in R to identify those cluster methods / distance measures
which best reflect predefined cluster groups.

Given 10 observations O1...O10. Optimally, these 10 observations cluster as
follows:
cluster1: O1, O2, O3, O4
cluster2: O5, O6
cluster3: O7, O8, O9, O10.

What I want is a method which identifies that cluster method / distance
measure which best reflect my predefined groups.

Is that somehow possible to do?
Cheers
-- 
View this message in context: 
http://r.789695.n4.nabble.com/Clustering-groups-tp2297210p2297210.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Clustering

2010-06-23 Thread Ralph Modjesch

Hi,

I use the following clustering methods and get the
corresponding dendrograms for single, complete, average, ward and
kmeans clustering.

This gives the dendrograms, but doesn't show the calculation-way.

My question: is there a possibility to show this calculation steps
(cluster steps) in matrix or graphical form?


Mit freundlichen Grüßen

Ralph Modjesch

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Clustering

2010-06-23 Thread Tal Galili

Hi Ralph,

In case of hclust, the dendrogram does show the steps (they are the
heights presented in the graph).
You can present them also in a matrix using cutree, for example:

dat - (USArrests)
n - (dim(dat)[1])
hc - hclust(dist(USArrests))
cutree(hc, k=1:n)


You might then visualize the results using a clustergram (I wrote about it
recently here:
http://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/
)
It would take a bit of R coding, but it's suppose to be relatively easy.

Regarding the case of kmeans, you could try the kmeans.ani function from
the {animation} package.

Hope this helps,

Cheers,
Tal

Contact
Details:---
Contact me: tal.gal...@gmail.com |  972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)
--




On Wed, Jun 23, 2010 at 8:25 PM, Ralph Modjesch 
ralph.modje...@pfeiffer-koberstein-immobilien.de wrote:

 Hi,

 I use the following clustering methods and get the
 corresponding dendrograms for single, complete, average, ward and
 kmeans clustering.

 This gives the dendrograms, but doesn't show the calculation-way.

 My question: is there a possibility to show this calculation steps
 (cluster steps) in matrix or graphical form?


 Mit freundlichen Grüßen

 Ralph Modjesch

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Clustering algorithms don't find obvious clusters

2010-06-14 Thread Henrik Aldberg

Thank you Etienne, this seems to work like a charm. Also thanks to the rest
of you for your help.


Henrik

On 11 June 2010 13:51, Cuvelier Etienne ecuscim...@gmail.com wrote:



 Le 11/06/2010 12:45, Henrik Aldberg a écrit :

  I have a directed graph which is represented as a matrix on the form


 0 4 0 1

 6 0 0 0

 0 1 0 5

 0 0 4 0


 Each row correspond to an author (A, B, C, D) and the values says how many
 times this author have cited the other authors. Hence the first row says
 that author A have cited author B four times and author D one time. Thus
 the
 matrix represents two groups of authors: (A,B) and (C,D) who cites each
 other. But there is also a weak link between the groups. In reality this
 matrix is much bigger and very sparce but it still consists of distinct
 groups of authors.


 My problem is that when I cluster the matrix using pam, clara or agnes the
 algorithms does not find the obvious clusters. I have tried to turn it
 into
 a dissimilarity matrix before clustering but that did not help either.


 The layout of the clustering is not that important to me, my primary
 interest is the to get the right nodes into the right clusters.





 Hello Henrik,
 You can use a graph clustering using the igraph package.
 Example:

 library(igraph)
 simM-NULL
 simM-rbind(simM,c(0, 4, 0, 1))
 simM-rbind(simM,c(6, 0, 0, 0))
 simM-rbind(simM,c(0, 1, 0, 5))
 simM-rbind(simM,c(0, 0, 4, 0))
 G - graph.adjacency( simM,weighted=TRUE,mode=directed)
 plot(G,layout=layout.kamada.kawai)

 ### walktrap.community
 wt - walktrap.community(G, modularity=TRUE)
 wmemb - community.to.membership(G, wt$merges,
steps=which.max(wt$modularity)-1)

 V(G)$color - rainbow(3)[wmemb$membership+1]
 plot(G)

 I hope  it helps

 Etienne

  Sincerely


 Henrik

[[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.





[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Clustering algorithms don't find obvious clusters

2010-06-13 Thread Joris Meys

Henrik,

the methods you use are NOT applicable to directed graphs, in the
contrary even. They will split up what you want to put together. In
your data, an author never cites himself. Hence, A and B are far more
different than B and D according to the techniques you use.

Please check out Etiennes solution, that is what you want.
Cheers
Joris

On Sat, Jun 12, 2010 at 8:43 PM, Henrik Aldberg
henrik.aldb...@gmail.com wrote:
 Dave,

 I used daisy with the default settings (daisy(M) where M is the matrix).


 Henrik

 On 11 June 2010 21:57, Dave Roberts dvr...@ecology.msu.montana.edu wrote:

 Henrik,

    The clustering algorithms you refer to (and almost all others) expect
 the matrix to be symmetric.  They do not seek a graph-theoretic solution,
 but rather proximity in geometric or topological space.

    How did you convert y9oru matrix to a dissimilarity?

 Dave Roberts

 Henrik Aldberg wrote:

 I have a directed graph which is represented as a matrix on the form


 0 4 0 1

 6 0 0 0

 0 1 0 5

 0 0 4 0


 Each row correspond to an author (A, B, C, D) and the values says how many
 times this author have cited the other authors. Hence the first row says
 that author A have cited author B four times and author D one time. Thus
 the
 matrix represents two groups of authors: (A,B) and (C,D) who cites each
 other. But there is also a weak link between the groups. In reality this
 matrix is much bigger and very sparce but it still consists of distinct
 groups of authors.


 My problem is that when I cluster the matrix using pam, clara or agnes the
 algorithms does not find the obvious clusters. I have tried to turn it
 into
 a dissimilarity matrix before clustering but that did not help either.


 The layout of the clustering is not that important to me, my primary
 interest is the to get the right nodes into the right clusters.



 Sincerely


 Henrik

        [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


 -


        [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control

tel : +32 9 264 59 87
joris.m...@ugent.be
---
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Clustering algorithms don't find obvious clusters

2010-06-12 Thread Henrik Aldberg

Dave,

I used daisy with the default settings (daisy(M) where M is the matrix).


Henrik

On 11 June 2010 21:57, Dave Roberts dvr...@ecology.msu.montana.edu wrote:

 Henrik,

The clustering algorithms you refer to (and almost all others) expect
 the matrix to be symmetric.  They do not seek a graph-theoretic solution,
 but rather proximity in geometric or topological space.

How did you convert y9oru matrix to a dissimilarity?

 Dave Roberts

 Henrik Aldberg wrote:

 I have a directed graph which is represented as a matrix on the form


 0 4 0 1

 6 0 0 0

 0 1 0 5

 0 0 4 0


 Each row correspond to an author (A, B, C, D) and the values says how many
 times this author have cited the other authors. Hence the first row says
 that author A have cited author B four times and author D one time. Thus
 the
 matrix represents two groups of authors: (A,B) and (C,D) who cites each
 other. But there is also a weak link between the groups. In reality this
 matrix is much bigger and very sparce but it still consists of distinct
 groups of authors.


 My problem is that when I cluster the matrix using pam, clara or agnes the
 algorithms does not find the obvious clusters. I have tried to turn it
 into
 a dissimilarity matrix before clustering but that did not help either.


 The layout of the clustering is not that important to me, my primary
 interest is the to get the right nodes into the right clusters.



 Sincerely


 Henrik

[[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


 -


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Clustering algorithms don't find obvious clusters

2010-06-12 Thread Dave Roberts


Henrik,

Given your initial matrix, that should tell you which authors are 
similar/dissimilar to which other authors in terms of which authors they 
cite.  In this case authors 1 and 3 are most similar because they both 
cite authors 2 and 4.  Authors 2 and 3 are most different because they 
both cite 6 authors but none of the same authors 
(sqrt(6^2+5^2+1^2)=7.87).  1 and 2 are next most different because 1 
only cites 5 authors but shares none with 2 (sqrt(6^2+4^2+1^2)=7.28) etc.


If you want to know which authors are similar in terms of who gas 
cited them, simply transpose the matrix


daisy(t(M))

I'm guessing none of this is actually what you are looking for 
however, and Etienne's graph theoretic approach may be more what you 
have in mind.


Dave

David W. Roberts office 406-994-4548
Department of Ecology email drobe...@montana.edu
Montana State University
Bozeman, MT 59717-3460


Henrik Aldberg wrote:

Dave,

I used daisy with the default settings (daisy(M) where M is the matrix).


Henrik

On 11 June 2010 21:57, Dave Roberts dvr...@ecology.msu.montana.edu 
mailto:dvr...@ecology.msu.montana.edu wrote:


Henrik,

   The clustering algorithms you refer to (and almost all others)
expect the matrix to be symmetric.  They do not seek a
graph-theoretic solution, but rather proximity in geometric or
topological space.

   How did you convert y9oru matrix to a dissimilarity?

Dave Roberts

Henrik Aldberg wrote:

I have a directed graph which is represented as a matrix on the form


0 4 0 1

6 0 0 0

0 1 0 5

0 0 4 0


Each row correspond to an author (A, B, C, D) and the values
says how many
times this author have cited the other authors. Hence the first
row says
that author A have cited author B four times and author D one
time. Thus the
matrix represents two groups of authors: (A,B) and (C,D) who
cites each
other. But there is also a weak link between the groups. In
reality this
matrix is much bigger and very sparce but it still consists of
distinct
groups of authors.


My problem is that when I cluster the matrix using pam, clara or
agnes the
algorithms does not find the obvious clusters. I have tried to
turn it into
a dissimilarity matrix before clustering but that did not help
either.


The layout of the clustering is not that important to me, my primary
interest is the to get the right nodes into the right clusters.



Sincerely


Henrik

   [[alternative HTML version deleted]]

__
R-help@r-project.org mailto:R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


-




__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Clustering algorithms don't find obvious clusters

2010-06-11 Thread Henrik Aldberg

I have a directed graph which is represented as a matrix on the form


0 4 0 1

6 0 0 0

0 1 0 5

0 0 4 0


Each row correspond to an author (A, B, C, D) and the values says how many
times this author have cited the other authors. Hence the first row says
that author A have cited author B four times and author D one time. Thus the
matrix represents two groups of authors: (A,B) and (C,D) who cites each
other. But there is also a weak link between the groups. In reality this
matrix is much bigger and very sparce but it still consists of distinct
groups of authors.


My problem is that when I cluster the matrix using pam, clara or agnes the
algorithms does not find the obvious clusters. I have tried to turn it into
a dissimilarity matrix before clustering but that did not help either.


The layout of the clustering is not that important to me, my primary
interest is the to get the right nodes into the right clusters.



Sincerely


Henrik

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Clustering algorithms don't find obvious clusters

2010-06-11 Thread Cuvelier Etienne




Le 11/06/2010 12:45, Henrik Aldberg a écrit :

I have a directed graph which is represented as a matrix on the form


0 4 0 1

6 0 0 0

0 1 0 5

0 0 4 0


Each row correspond to an author (A, B, C, D) and the values says how many
times this author have cited the other authors. Hence the first row says
that author A have cited author B four times and author D one time. Thus the
matrix represents two groups of authors: (A,B) and (C,D) who cites each
other. But there is also a weak link between the groups. In reality this
matrix is much bigger and very sparce but it still consists of distinct
groups of authors.


My problem is that when I cluster the matrix using pam, clara or agnes the
algorithms does not find the obvious clusters. I have tried to turn it into
a dissimilarity matrix before clustering but that did not help either.


The layout of the clustering is not that important to me, my primary
interest is the to get the right nodes into the right clusters.



   

Hello Henrik,
You can use a graph clustering using the igraph package.
Example:

library(igraph)
simM-NULL
simM-rbind(simM,c(0, 4, 0, 1))
simM-rbind(simM,c(6, 0, 0, 0))
simM-rbind(simM,c(0, 1, 0, 5))
simM-rbind(simM,c(0, 0, 4, 0))
G - graph.adjacency( simM,weighted=TRUE,mode=directed)
plot(G,layout=layout.kamada.kawai)

### walktrap.community
wt - walktrap.community(G, modularity=TRUE)
wmemb - community.to.membership(G, wt$merges,
steps=which.max(wt$modularity)-1)

V(G)$color - rainbow(3)[wmemb$membership+1]
plot(G)

I hope  it helps

Etienne


Sincerely


Henrik

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Clustering algorithms don't find obvious clusters

2010-06-11 Thread Dave Roberts


Henrik,

The clustering algorithms you refer to (and almost all others) 
expect the matrix to be symmetric.  They do not seek a graph-theoretic 
solution, but rather proximity in geometric or topological space.


How did you convert y9oru matrix to a dissimilarity?

Dave Roberts

Henrik Aldberg wrote:

I have a directed graph which is represented as a matrix on the form


0 4 0 1

6 0 0 0

0 1 0 5

0 0 4 0


Each row correspond to an author (A, B, C, D) and the values says how many
times this author have cited the other authors. Hence the first row says
that author A have cited author B four times and author D one time. Thus the
matrix represents two groups of authors: (A,B) and (C,D) who cites each
other. But there is also a weak link between the groups. In reality this
matrix is much bigger and very sparce but it still consists of distinct
groups of authors.


My problem is that when I cluster the matrix using pam, clara or agnes the
algorithms does not find the obvious clusters. I have tried to turn it into
a dissimilarity matrix before clustering but that did not help either.


The layout of the clustering is not that important to me, my primary
interest is the to get the right nodes into the right clusters.



Sincerely


Henrik

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


-

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] clustering in R

2010-05-28 Thread Tal Galili

Hi Ayesha,
hclust is a way to go (much better then trying to invent the wheel here).

Please add what you used to create:
distA

And create a sample data set to show us what you did, using
dput

Best,
Tal



Contact
Details:---
Contact me: tal.gal...@gmail.com |  972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)
--




On Fri, May 28, 2010 at 2:41 AM, Ayesha Khan ayesha.diamond...@gmail.comwrote:

 i have a matrix with the following dimensions
 136   3

 and it looks something like

 [,1] [,2] [,3]
  [1,]  402  675 1.802758
  [2,]  402  696 1.938902
  [3,]  402  699 1.994253
  [4,]  402  945 1.898619
  [5,]  424  470 1.812857
  [6,]  424  905 1.816345
  [7,]  470  905 1.871252
  [8,]  504  780 1.958191
  [9,]  504  848 1.997111...

 
 so you get the idea. I want to group similar items in one group/cluster
 following the friends of friends approach. I tried doing

 distclust - hclust(distA,method=single)
 However, I got the following error.

 Error in if (n  2) stop(must have n = 2 objects to cluster) :  argument
 is of length zero
 which probably means there's something wrong with my input here. Is there
 another way of doing this kind of clustering without getting into all the
  looping and ifelse etc. Basically, if 402 is close to 675,696,and699 and
 thus fall in cluster A then all items close to 675,696,and 699 should also
 fall into the same cluster A following a friends of friedns strategy.
 Any help would be highly appreciated.

 --
 Ayesha Khan

 MS Bioengineering
 Dept. of Bioengineering
 Rice University, TX

[[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] clustering in R

2010-05-28 Thread Joris Meys

As Tal said.

Next to that, I read that column1 (and column2?) are supposed to be seen as
factors, not as numerical variables. Did you take that into account somehow?

It's easy to reproduce the error code :
 n - NULL
 if(n2)print(This is OK)
Error in if (n  2) print(This is OK) : argument is of length zero

In the hclust code, you find following line :
n - as.integer(attr(d, Size))
where d is the distance object entered in the hclust function. Looking at
the error you get, this means that the size attribute of your distance is
NULL. Which tells me that distA is not a dist-object.

 A - matrix(1:4,ncol=2)
 A
 [,1] [,2]
[1,]13
[2,]24
 hclust(A,method=single)
Error in if (n  2) stop(must have n = 2 objects to cluster) :
  argument is of length zero

Did you actually put in a distance object? see also ?dist or ?as.dist.

Cheers
Joris




On Fri, May 28, 2010 at 1:41 AM, Ayesha Khan ayesha.diamond...@gmail.comwrote:

 i have a matrix with the following dimensions
 136   3

 and it looks something like

 [,1] [,2] [,3]
  [1,]  402  675 1.802758
  [2,]  402  696 1.938902
  [3,]  402  699 1.994253
  [4,]  402  945 1.898619
  [5,]  424  470 1.812857
  [6,]  424  905 1.816345
  [7,]  470  905 1.871252
  [8,]  504  780 1.958191
  [9,]  504  848 1.997111...

 
 so you get the idea. I want to group similar items in one group/cluster
 following the friends of friends approach. I tried doing

 distclust - hclust(distA,method=single)
 However, I got the following error.

 Error in if (n  2) stop(must have n = 2 objects to cluster) :  argument
 is of length zero
 which probably means there's something wrong with my input here. Is there
 another way of doing this kind of clustering without getting into all the
  looping and ifelse etc. Basically, if 402 is close to 675,696,and699 and
 thus fall in cluster A then all items close to 675,696,and 699 should also
 fall into the same cluster A following a friends of friedns strategy.
 Any help would be highly appreciated.

 --
 Ayesha Khan

 MS Bioengineering
 Dept. of Bioengineering
 Rice University, TX

[[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Joris Meys
Statistical Consultant

Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control

Coupure Links 653
B-9000 Gent

tel : +32 9 264 59 87
joris.m...@ugent.be
---
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] clustering in R

2010-05-28 Thread Ayesha Khan

Thanks Tal  Joris!
I created my distance matrix distA by using the dist() function in R
manipulating my output in order to get a matrix.
distA =as.matrix(dist(t(x2))) # x2 being my original dataset
as according to the documentaion on dist()

For the default method, a dist object, or a matrix (of distances) or an
object which can be coerced to such a matrix using as.matrix()

On Fri, May 28, 2010 at 6:34 AM, Joris Meys jorism...@gmail.com wrote:

 As Tal said.

 Next to that, I read that column1 (and column2?) are supposed to be seen as
 factors, not as numerical variables. Did you take that into account somehow?

 It's easy to reproduce the error code :
  n - NULL
  if(n2)print(This is OK)
 Error in if (n  2) print(This is OK) : argument is of length zero

 In the hclust code, you find following line :
 n - as.integer(attr(d, Size))
 where d is the distance object entered in the hclust function. Looking at
 the error you get, this means that the size attribute of your distance is
 NULL. Which tells me that distA is not a dist-object.

  A - matrix(1:4,ncol=2)
  A
  [,1] [,2]
 [1,]13
 [2,]24
  hclust(A,method=single)

 Error in if (n  2) stop(must have n = 2 objects to cluster) :
   argument is of length zero

 Did you actually put in a distance object? see also ?dist or ?as.dist.

 Cheers
 Joris




  On Fri, May 28, 2010 at 1:41 AM, Ayesha Khan ayesha.diamond...@gmail.com
  wrote:

  i have a matrix with the following dimensions
 136   3

 and it looks something like

 [,1] [,2] [,3]
  [1,]  402  675 1.802758
  [2,]  402  696 1.938902
  [3,]  402  699 1.994253
  [4,]  402  945 1.898619
  [5,]  424  470 1.812857
  [6,]  424  905 1.816345
  [7,]  470  905 1.871252
  [8,]  504  780 1.958191
  [9,]  504  848 1.997111...

 
 so you get the idea. I want to group similar items in one group/cluster
 following the friends of friends approach. I tried doing

 distclust - hclust(distA,method=single)
 However, I got the following error.

 Error in if (n  2) stop(must have n = 2 objects to cluster) :
  argument
 is of length zero
 which probably means there's something wrong with my input here. Is there
 another way of doing this kind of clustering without getting into all the
  looping and ifelse etc. Basically, if 402 is close to 675,696,and699 and
 thus fall in cluster A then all items close to 675,696,and 699 should also
 fall into the same cluster A following a friends of friedns strategy.
 Any help would be highly appreciated.

 --
 Ayesha Khan

 MS Bioengineering
 Dept. of Bioengineering
 Rice University, TX

[[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.htmlhttp://www.r-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




 --
 Joris Meys
 Statistical Consultant

 Ghent University
 Faculty of Bioscience Engineering
 Department of Applied mathematics, biometrics and process control

 Coupure Links 653
 B-9000 Gent

 tel : +32 9 264 59 87
 joris.m...@ugent.be
 ---
 Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php




-- 
Ayesha Khan

MS Bioengineering
Dept. of Bioengineering
Rice University, TX

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] clustering in R

2010-05-28 Thread Tal Galili

Hi Ayesha,
I wish to help you, but without a simple self contained example that shows
your issue, I will not be able to help.
Try using the ?dput command to create some simple data, and let us see what
you are doing.

Best,
Tal
Contact
Details:---
Contact me: tal.gal...@gmail.com |  972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)
--




On Fri, May 28, 2010 at 9:04 PM, Ayesha Khan ayesha.diamond...@gmail.comwrote:

 Thanks Tal  Joris!
 I created my distance matrix distA by using the dist() function in R
 manipulating my output in order to get a matrix.
 distA =as.matrix(dist(t(x2))) # x2 being my original dataset
 as according to the documentaion on dist()

 For the default method, a dist object, or a matrix (of distances) or an
 object which can be coerced to such a matrix using as.matrix()

 On Fri, May 28, 2010 at 6:34 AM, Joris Meys jorism...@gmail.com wrote:

 As Tal said.

 Next to that, I read that column1 (and column2?) are supposed to be seen
 as factors, not as numerical variables. Did you take that into account
 somehow?

 It's easy to reproduce the error code :
  n - NULL
  if(n2)print(This is OK)
 Error in if (n  2) print(This is OK) : argument is of length zero

 In the hclust code, you find following line :
 n - as.integer(attr(d, Size))
 where d is the distance object entered in the hclust function. Looking at
 the error you get, this means that the size attribute of your distance is
 NULL. Which tells me that distA is not a dist-object.

  A - matrix(1:4,ncol=2)
  A
  [,1] [,2]
 [1,]13
 [2,]24
  hclust(A,method=single)

 Error in if (n  2) stop(must have n = 2 objects to cluster) :
   argument is of length zero

 Did you actually put in a distance object? see also ?dist or ?as.dist.

 Cheers
 Joris




  On Fri, May 28, 2010 at 1:41 AM, Ayesha Khan 
 ayesha.diamond...@gmail.com wrote:

  i have a matrix with the following dimensions
 136   3

 and it looks something like

 [,1] [,2] [,3]
  [1,]  402  675 1.802758
  [2,]  402  696 1.938902
  [3,]  402  699 1.994253
  [4,]  402  945 1.898619
  [5,]  424  470 1.812857
  [6,]  424  905 1.816345
  [7,]  470  905 1.871252
  [8,]  504  780 1.958191
  [9,]  504  848 1.997111...

 
 so you get the idea. I want to group similar items in one group/cluster
 following the friends of friends approach. I tried doing

 distclust - hclust(distA,method=single)
 However, I got the following error.

 Error in if (n  2) stop(must have n = 2 objects to cluster) :
  argument
 is of length zero
 which probably means there's something wrong with my input here. Is there
 another way of doing this kind of clustering without getting into all the
  looping and ifelse etc. Basically, if 402 is close to 675,696,and699 and
 thus fall in cluster A then all items close to 675,696,and 699 should
 also
 fall into the same cluster A following a friends of friedns strategy.
 Any help would be highly appreciated.

 --
 Ayesha Khan

 MS Bioengineering
 Dept. of Bioengineering
 Rice University, TX

[[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.htmlhttp://www.r-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




 --
 Joris Meys
 Statistical Consultant

 Ghent University
 Faculty of Bioscience Engineering
 Department of Applied mathematics, biometrics and process control

 Coupure Links 653
 B-9000 Gent

 tel : +32 9 264 59 87
 joris.m...@ugent.be
 ---
 Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php




 --
 Ayesha Khan

 MS Bioengineering
 Dept. of Bioengineering
 Rice University, TX


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] clustering in R

2010-05-28 Thread Joris Meys

errr, forget about the output of dput(q), but keep it in mind for next time.

f = dist(t(q))
hclust(f,method=single)

it's as simple as that.
Cheers
Joris

On Fri, May 28, 2010 at 10:39 PM, Ayesha Khan
ayesha.diamond...@gmail.comwrote:

 v - dput(x,sampledata.txt)
 dim(v)
 q - v[1:10,1:10]
 f =as.matrix(dist(t(q)))

 distB=NULL
 for(k in 1:(nrow(f)-1)) for( m in (k+1):ncol(f)) {
 if(f[k,m] 2) distB=rbind(distB,c(k,m,f[k,m]))
 }
 #now distB looks like this

  distB
   [,1] [,2]  [,3]
  [1,]12  1.6275568
  [2,]13  0.5252058
  [3,]14  0.7323116
  [4,]15  1 .9966001
  [5,]16  1.6664110
  [6,]17  1.0800540
  [7,]18  1.8698925
  [8,]1   10  0.5161808
  [9,]23  1.7325811
 [10,]25  0.8267843
 [11,]26  0.5963280
 [12,]27  0.8787230

 #now from this output i want to cluster all 1's, friedns of 1 and friends
 of friends of 1 in one cluster. The same goes for 2,3 and so on
 But when i do that using hclust, i get the following error. I think what I
 need to do is convert my cureent matrix somehow into a format that would be
 accepted by the hclust function but I dont know how to achieve that.
  distclust - hclust(distB,method=single)

 Error in if (n  2) stop(must have n = 2 objects to cluster) :
   argument is of length zero

 P.S: Please let me know if this makes things more clear? cuz i dont know
 how looking at the original data set would help becuase the matrix under
 consdieration right now is the distance matrix and how it can be altered. I
 have tried as.dist, doesnt work because my matrix as i mentioned eralier is
 not a square matrix.
 On Fri, May 28, 2010 at 2:37 PM, Tal Galili tal.gal...@gmail.com wrote:

 Hi Ayesha,
 I wish to help you, but without a simple self contained example that shows
 your issue, I will not be able to help.
 Try using the ?dput command to create some simple data, and let us see
 what you are doing.

 Best,
 Tal
 Contact
 Details:---
 Contact me: tal.gal...@gmail.com |  972-52-7275845
 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
 www.r-statistics.com (English)

 --




   On Fri, May 28, 2010 at 9:04 PM, Ayesha Khan 
 ayesha.diamond...@gmail.com wrote:

 Thanks Tal  Joris!
 I created my distance matrix distA by using the dist() function in R
 manipulating my output in order to get a matrix.
 distA =as.matrix(dist(t(x2))) # x2 being my original dataset
 as according to the documentaion on dist()

 For the default method, a dist object, or a matrix (of distances) or
 an object which can be coerced to such a matrix using as.matrix()

   On Fri, May 28, 2010 at 6:34 AM, Joris Meys jorism...@gmail.comwrote:

 As Tal said.

 Next to that, I read that column1 (and column2?) are supposed to be seen
 as factors, not as numerical variables. Did you take that into account
 somehow?

 It's easy to reproduce the error code :
  n - NULL
  if(n2)print(This is OK)
 Error in if (n  2) print(This is OK) : argument is of length zero

 In the hclust code, you find following line :
 n - as.integer(attr(d, Size))
 where d is the distance object entered in the hclust function. Looking
 at the error you get, this means that the size attribute of your distance 
 is
 NULL. Which tells me that distA is not a dist-object.

  A - matrix(1:4,ncol=2)
  A
  [,1] [,2]
 [1,]13
 [2,]24
  hclust(A,method=single)

 Error in if (n  2) stop(must have n = 2 objects to cluster) :
   argument is of length zero

 Did you actually put in a distance object? see also ?dist or ?as.dist.

 Cheers
 Joris




  On Fri, May 28, 2010 at 1:41 AM, Ayesha Khan 
 ayesha.diamond...@gmail.com wrote:

  i have a matrix with the following dimensions
 136   3

 and it looks something like

 [,1] [,2] [,3]
  [1,]  402  675 1.802758
  [2,]  402  696 1.938902
  [3,]  402  699 1.994253
  [4,]  402  945 1.898619
  [5,]  424  470 1.812857
  [6,]  424  905 1.816345
  [7,]  470  905 1.871252
  [8,]  504  780 1.958191
  [9,]  504  848 1.997111...

 
 so you get the idea. I want to group similar items in one group/cluster
 following the friends of friends approach. I tried doing

 distclust - hclust(distA,method=single)
 However, I got the following error.

 Error in if (n  2) stop(must have n = 2 objects to cluster) :
  argument
 is of length zero
 which probably means there's something wrong with my input here. Is
 there
 another way of doing this kind of clustering without getting into all
 the
  looping and ifelse etc. Basically, if 402 is close to 675,696,and699
 and
 thus fall in cluster A then all items close to 675,696,and 699 should
 also
 fall into the same cluster A following a friends of friedns strategy.
 Any help would be highly

Re: [R] clustering in R

2010-05-28 Thread Ayesha Khan

Yes Joris. I did try that and it does produce the results. I am now
wondering why I wanted a matrix like structure in the first place. However,
I do want 'f' to contain values less than 2 only. but when i try to get rid
of values greater than 2 by doing N - (f[f2], f strcuture disrupts and
hclust doesnt want to recognize it anyore again. Because obviously the data
frame changes again with that. Any ideas on how to do that?

On Fri, May 28, 2010 at 4:13 PM, Joris Meys jorism...@gmail.com wrote:

 errr, forget about the output of dput(q), but keep it in mind for next
 time.

 f = dist(t(q))
 hclust(f,method=single)

 it's as simple as that.
 Cheers
 Joris


 On Fri, May 28, 2010 at 10:39 PM, Ayesha Khan ayesha.diamond...@gmail.com
  wrote:

 v - dput(x,sampledata.txt)
 dim(v)
 q - v[1:10,1:10]
 f =as.matrix(dist(t(q)))

 distB=NULL
 for(k in 1:(nrow(f)-1)) for( m in (k+1):ncol(f)) {
 if(f[k,m] 2) distB=rbind(distB,c(k,m,f[k,m]))
 }
 #now distB looks like this

  distB
   [,1] [,2]  [,3]
  [1,]12  1.6275568
  [2,]13  0.5252058
  [3,]14  0.7323116
  [4,]15  1 .9966001
  [5,]16  1.6664110
  [6,]17  1.0800540
  [7,]18  1.8698925
  [8,]1   10  0.5161808
  [9,]23  1.7325811
 [10,]25  0.8267843
 [11,]26  0.5963280
 [12,]27  0.8787230

 #now from this output i want to cluster all 1's, friedns of 1 and friends
 of friends of 1 in one cluster. The same goes for 2,3 and so on
 But when i do that using hclust, i get the following error. I think what I
 need to do is convert my cureent matrix somehow into a format that would be
 accepted by the hclust function but I dont know how to achieve that.
  distclust - hclust(distB,method=single)

 Error in if (n  2) stop(must have n = 2 objects to cluster) :
   argument is of length zero

 P.S: Please let me know if this makes things more clear? cuz i dont know
 how looking at the original data set would help becuase the matrix under
 consdieration right now is the distance matrix and how it can be altered. I
 have tried as.dist, doesnt work because my matrix as i mentioned eralier is
 not a square matrix.
  On Fri, May 28, 2010 at 2:37 PM, Tal Galili tal.gal...@gmail.comwrote:

 Hi Ayesha,
 I wish to help you, but without a simple self contained example that
 shows your issue, I will not be able to help.
 Try using the ?dput command to create some simple data, and let us see
 what you are doing.

 Best,
 Tal
 Contact
 Details:---
 Contact me: tal.gal...@gmail.com |  972-52-7275845
 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
 www.r-statistics.com (English)

 --




   On Fri, May 28, 2010 at 9:04 PM, Ayesha Khan 
 ayesha.diamond...@gmail.com wrote:

 Thanks Tal  Joris!
 I created my distance matrix distA by using the dist() function in R
 manipulating my output in order to get a matrix.
 distA =as.matrix(dist(t(x2))) # x2 being my original dataset
 as according to the documentaion on dist()

 For the default method, a dist object, or a matrix (of distances) or
 an object which can be coerced to such a matrix using as.matrix()

   On Fri, May 28, 2010 at 6:34 AM, Joris Meys jorism...@gmail.comwrote:

 As Tal said.

 Next to that, I read that column1 (and column2?) are supposed to be
 seen as factors, not as numerical variables. Did you take that into 
 account
 somehow?

 It's easy to reproduce the error code :
  n - NULL
  if(n2)print(This is OK)
 Error in if (n  2) print(This is OK) : argument is of length zero

 In the hclust code, you find following line :
 n - as.integer(attr(d, Size))
 where d is the distance object entered in the hclust function. Looking
 at the error you get, this means that the size attribute of your distance 
 is
 NULL. Which tells me that distA is not a dist-object.

  A - matrix(1:4,ncol=2)
  A
  [,1] [,2]
 [1,]13
 [2,]24
  hclust(A,method=single)

 Error in if (n  2) stop(must have n = 2 objects to cluster) :
   argument is of length zero

 Did you actually put in a distance object? see also ?dist or ?as.dist.

 Cheers
 Joris




  On Fri, May 28, 2010 at 1:41 AM, Ayesha Khan 
 ayesha.diamond...@gmail.com wrote:

  i have a matrix with the following dimensions
 136   3

 and it looks something like

 [,1] [,2] [,3]
  [1,]  402  675 1.802758
  [2,]  402  696 1.938902
  [3,]  402  699 1.994253
  [4,]  402  945 1.898619
  [5,]  424  470 1.812857
  [6,]  424  905 1.816345
  [7,]  470  905 1.871252
  [8,]  504  780 1.958191
  [9,]  504  848 1.997111...

 
 so you get the idea. I want to group similar items in one
 group/cluster
 following the friends of friends approach. I tried doing

 distclust - hclust(distA,method=single)
 However, I got the

Re: [R] clustering in R

2010-05-28 Thread Ayesha Khan

v - dput(x,sampledata.txt)
dim(v)
q - v[1:10,1:10]
f =as.matrix(dist(t(q)))

distB=NULL
for(k in 1:(nrow(f)-1)) for( m in (k+1):ncol(f)) {
if(f[k,m] 2) distB=rbind(distB,c(k,m,f[k,m]))
}
#now distB looks like this

 distB
  [,1] [,2]  [,3]
 [1,]12  1.6275568
 [2,]13  0.5252058
 [3,]14  0.7323116
 [4,]15  1 .9966001
 [5,]16  1.6664110
 [6,]17  1.0800540
 [7,]18  1.8698925
 [8,]1   10  0.5161808
 [9,]23  1.7325811
[10,]25  0.8267843
[11,]26  0.5963280
[12,]27  0.8787230

#now from this output i want to cluster all 1's, friedns of 1 and friends
of friends of 1 in one cluster. The same goes for 2,3 and so on
But when i do that using hclust, i get the following error. I think what I
need to do is convert my cureent matrix somehow into a format that would be
accepted by the hclust function but I dont know how to achieve that.
 distclust - hclust(distB,method=single)

Error in if (n  2) stop(must have n = 2 objects to cluster) :
  argument is of length zero

P.S: Please let me know if this makes things more clear? cuz i dont know
how looking at the original data set would help becuase the matrix under
consdieration right now is the distance matrix and how it can be altered. I
have tried as.dist, doesnt work because my matrix as i mentioned eralier is
not a square matrix.
On Fri, May 28, 2010 at 2:37 PM, Tal Galili tal.gal...@gmail.com wrote:

 Hi Ayesha,
 I wish to help you, but without a simple self contained example that shows
 your issue, I will not be able to help.
 Try using the ?dput command to create some simple data, and let us see what
 you are doing.

 Best,
 Tal
 Contact
 Details:---
 Contact me: tal.gal...@gmail.com |  972-52-7275845
 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
 www.r-statistics.com (English)

 --




   On Fri, May 28, 2010 at 9:04 PM, Ayesha Khan 
 ayesha.diamond...@gmail.com wrote:

 Thanks Tal  Joris!
 I created my distance matrix distA by using the dist() function in R
 manipulating my output in order to get a matrix.
 distA =as.matrix(dist(t(x2))) # x2 being my original dataset
 as according to the documentaion on dist()

 For the default method, a dist object, or a matrix (of distances) or an
 object which can be coerced to such a matrix using as.matrix()

   On Fri, May 28, 2010 at 6:34 AM, Joris Meys jorism...@gmail.comwrote:

 As Tal said.

 Next to that, I read that column1 (and column2?) are supposed to be seen
 as factors, not as numerical variables. Did you take that into account
 somehow?

 It's easy to reproduce the error code :
  n - NULL
  if(n2)print(This is OK)
 Error in if (n  2) print(This is OK) : argument is of length zero

 In the hclust code, you find following line :
 n - as.integer(attr(d, Size))
 where d is the distance object entered in the hclust function. Looking at
 the error you get, this means that the size attribute of your distance is
 NULL. Which tells me that distA is not a dist-object.

  A - matrix(1:4,ncol=2)
  A
  [,1] [,2]
 [1,]13
 [2,]24
  hclust(A,method=single)

 Error in if (n  2) stop(must have n = 2 objects to cluster) :
   argument is of length zero

 Did you actually put in a distance object? see also ?dist or ?as.dist.

 Cheers
 Joris




  On Fri, May 28, 2010 at 1:41 AM, Ayesha Khan 
 ayesha.diamond...@gmail.com wrote:

  i have a matrix with the following dimensions
 136   3

 and it looks something like

 [,1] [,2] [,3]
  [1,]  402  675 1.802758
  [2,]  402  696 1.938902
  [3,]  402  699 1.994253
  [4,]  402  945 1.898619
  [5,]  424  470 1.812857
  [6,]  424  905 1.816345
  [7,]  470  905 1.871252
  [8,]  504  780 1.958191
  [9,]  504  848 1.997111...

 
 so you get the idea. I want to group similar items in one group/cluster
 following the friends of friends approach. I tried doing

 distclust - hclust(distA,method=single)
 However, I got the following error.

 Error in if (n  2) stop(must have n = 2 objects to cluster) :
  argument
 is of length zero
 which probably means there's something wrong with my input here. Is
 there
 another way of doing this kind of clustering without getting into all
 the
  looping and ifelse etc. Basically, if 402 is close to 675,696,and699
 and
 thus fall in cluster A then all items close to 675,696,and 699 should
 also
 fall into the same cluster A following a friends of friedns strategy.
 Any help would be highly appreciated.

 --
 Ayesha Khan

 MS Bioengineering
 Dept. of Bioengineering
 Rice University, TX

[[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE

Re: [R] clustering in R

2010-05-28 Thread Ayesha Khan

I assume my matrix should look something like this?..

round(distance, 4)
   P00A   P00B   M02A   M02B   P04A   P04B   M06A   M06B   P08A
P08B   M10A
P00B 0.9678
M02A 1.0054 1.0349
M02B 1.0258 1.0052 1.2106
P04A 1.0247 0.9928 1.0145 0.9260
P04B 0.9898 0.9769 0.9875 0.9855 0.6075
M06A 1.0159 0.9893 1.0175 0.9521 0.9266 0.9660
M06B 0.9837 0.9912 1.0124 1.0402 1.0272 1.0367 1.5693
P08A 1.0279 1.0303 0.9865 0.9748 1.0184 1.0452 0.9799 1.0400
P08B 1.0248 1.0299 0.9717 0.9673 1.0048 1.0329 1.0280 0.9907 0.2158
M10A 0.9850 0.9603 1.0246 0.9708 1.0231 0.9771 0.9916 1.0168 0.9722
0.9525
M10B 1.0150 1.0397 0.9754 1.0292 0.9769 1.0229 1.0084 0.9832 1.0278
1.0475 2.



On Fri, May 28, 2010 at 3:39 PM, Ayesha Khan ayesha.diamond...@gmail.comwrote:

 v - dput(x,sampledata.txt)
 dim(v)
 q - v[1:10,1:10]
 f =as.matrix(dist(t(q)))

 distB=NULL
 for(k in 1:(nrow(f)-1)) for( m in (k+1):ncol(f)) {
 if(f[k,m] 2) distB=rbind(distB,c(k,m,f[k,m]))
 }
 #now distB looks like this

  distB
   [,1] [,2]  [,3]
  [1,]12  1.6275568
  [2,]13  0.5252058
  [3,]14  0.7323116
  [4,]15  1 .9966001
  [5,]16  1.6664110
  [6,]17  1.0800540
  [7,]18  1.8698925
  [8,]1   10  0.5161808
  [9,]23  1.7325811
 [10,]25  0.8267843
 [11,]26  0.5963280
 [12,]27  0.8787230

 #now from this output i want to cluster all 1's, friedns of 1 and friends
 of friends of 1 in one cluster. The same goes for 2,3 and so on
 But when i do that using hclust, i get the following error. I think what I
 need to do is convert my cureent matrix somehow into a format that would be
 accepted by the hclust function but I dont know how to achieve that.
  distclust - hclust(distB,method=single)

 Error in if (n  2) stop(must have n = 2 objects to cluster) :
   argument is of length zero

 P.S: Please let me know if this makes things more clear? cuz i dont know
 how looking at the original data set would help becuase the matrix under
 consdieration right now is the distance matrix and how it can be altered. I
 have tried as.dist, doesnt work because my matrix as i mentioned eralier is
 not a square matrix.

   On Fri, May 28, 2010 at 2:37 PM, Tal Galili tal.gal...@gmail.comwrote:

 Hi Ayesha,
 I wish to help you, but without a simple self contained example that shows
 your issue, I will not be able to help.
 Try using the ?dput command to create some simple data, and let us see
 what you are doing.

 Best,
 Tal
 Contact
 Details:---
 Contact me: tal.gal...@gmail.com |  972-52-7275845
 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
 www.r-statistics.com (English)

 --




   On Fri, May 28, 2010 at 9:04 PM, Ayesha Khan 
 ayesha.diamond...@gmail.com wrote:

 Thanks Tal  Joris!
 I created my distance matrix distA by using the dist() function in R
 manipulating my output in order to get a matrix.
 distA =as.matrix(dist(t(x2))) # x2 being my original dataset
 as according to the documentaion on dist()

 For the default method, a dist object, or a matrix (of distances) or
 an object which can be coerced to such a matrix using as.matrix()

   On Fri, May 28, 2010 at 6:34 AM, Joris Meys jorism...@gmail.comwrote:

 As Tal said.

 Next to that, I read that column1 (and column2?) are supposed to be seen
 as factors, not as numerical variables. Did you take that into account
 somehow?

 It's easy to reproduce the error code :
  n - NULL
  if(n2)print(This is OK)
 Error in if (n  2) print(This is OK) : argument is of length zero

 In the hclust code, you find following line :
 n - as.integer(attr(d, Size))
 where d is the distance object entered in the hclust function. Looking
 at the error you get, this means that the size attribute of your distance 
 is
 NULL. Which tells me that distA is not a dist-object.

  A - matrix(1:4,ncol=2)
  A
  [,1] [,2]
 [1,]13
 [2,]24
  hclust(A,method=single)

 Error in if (n  2) stop(must have n = 2 objects to cluster) :
   argument is of length zero

 Did you actually put in a distance object? see also ?dist or ?as.dist.

 Cheers
 Joris




  On Fri, May 28, 2010 at 1:41 AM, Ayesha Khan 
 ayesha.diamond...@gmail.com wrote:

  i have a matrix with the following dimensions
 136   3

 and it looks something like

 [,1] [,2] [,3]
  [1,]  402  675 1.802758
  [2,]  402  696 1.938902
  [3,]  402  699 1.994253
  [4,]  402  945 1.898619
  [5,]  424  470 1.812857
  [6,]  424  905 1.816345
  [7,]  470  905 1.871252
  [8,]  504  780 1.958191
  [9,]  504  848 1.997111...

 
 so you get the idea. I want to group similar items in one group/cluster
 following the friends of friends approach. I tried doing

 distclust - hclust(distA,method=single)
 However, I got

Re: [R] clustering in R

2010-05-28 Thread Joris Meys

I can't run your code.
Please, just give me whatever comes on your screen when you run:
dput(q)


On Fri, May 28, 2010 at 10:57 PM, Ayesha Khan
ayesha.diamond...@gmail.comwrote:

 I assume my matrix should look something like this?..

 round(distance, 4)
P00A   P00B   M02A   M02B   P04A   P04B   M06A   M06B   P08A   P08B   
 M10A
 P00B 0.9678
 M02A 1.0054 1.0349
 M02B 1.0258 1.0052 1.2106
 P04A 1.0247 0.9928 1.0145 0.9260
 P04B 0.9898 0.9769 0.9875 0.9855 0.6075
 M06A 1.0159 0.9893 1.0175 0.9521 0.9266 0.9660
 M06B 0.9837 0.9912 1.0124 1.0402 1.0272 1.0367 1.5693
 P08A 1.0279 1.0303 0.9865 0.9748 1.0184 1.0452 0.9799 1.0400
 P08B 1.0248 1.0299 0.9717 0.9673 1.0048 1.0329 1.0280 0.9907 0.2158
 M10A 0.9850 0.9603 1.0246 0.9708 1.0231 0.9771 0.9916 1.0168 0.9722 0.9525
 M10B 1.0150 1.0397 0.9754 1.0292 0.9769 1.0229 1.0084 0.9832 1.0278 1.0475 
 2.



 On Fri, May 28, 2010 at 3:39 PM, Ayesha Khan 
 ayesha.diamond...@gmail.comwrote:

 v - dput(x,sampledata.txt)
 dim(v)
 q - v[1:10,1:10]
 f =as.matrix(dist(t(q)))

 distB=NULL
 for(k in 1:(nrow(f)-1)) for( m in (k+1):ncol(f)) {
 if(f[k,m] 2) distB=rbind(distB,c(k,m,f[k,m]))
 }
 #now distB looks like this

  distB
   [,1] [,2]  [,3]
  [1,]12  1.6275568
  [2,]13  0.5252058
  [3,]14  0.7323116
  [4,]15  1 .9966001
  [5,]16  1.6664110
  [6,]17  1.0800540
  [7,]18  1.8698925
  [8,]1   10  0.5161808
  [9,]23  1.7325811
 [10,]25  0.8267843
 [11,]26  0.5963280
 [12,]27  0.8787230

 #now from this output i want to cluster all 1's, friedns of 1 and friends
 of friends of 1 in one cluster. The same goes for 2,3 and so on
 But when i do that using hclust, i get the following error. I think what I
 need to do is convert my cureent matrix somehow into a format that would be
 accepted by the hclust function but I dont know how to achieve that.
  distclust - hclust(distB,method=single)

 Error in if (n  2) stop(must have n = 2 objects to cluster) :
   argument is of length zero

 P.S: Please let me know if this makes things more clear? cuz i dont know
 how looking at the original data set would help becuase the matrix under
 consdieration right now is the distance matrix and how it can be altered. I
 have tried as.dist, doesnt work because my matrix as i mentioned eralier is
 not a square matrix.

   On Fri, May 28, 2010 at 2:37 PM, Tal Galili tal.gal...@gmail.comwrote:

 Hi Ayesha,
 I wish to help you, but without a simple self contained example that
 shows your issue, I will not be able to help.
 Try using the ?dput command to create some simple data, and let us see
 what you are doing.

 Best,
 Tal
 Contact
 Details:---
 Contact me: tal.gal...@gmail.com |  972-52-7275845
 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
 www.r-statistics.com (English)

 --




   On Fri, May 28, 2010 at 9:04 PM, Ayesha Khan 
 ayesha.diamond...@gmail.com wrote:

 Thanks Tal  Joris!
 I created my distance matrix distA by using the dist() function in R
 manipulating my output in order to get a matrix.
 distA =as.matrix(dist(t(x2))) # x2 being my original dataset
 as according to the documentaion on dist()

 For the default method, a dist object, or a matrix (of distances) or
 an object which can be coerced to such a matrix using as.matrix()

   On Fri, May 28, 2010 at 6:34 AM, Joris Meys jorism...@gmail.comwrote:

 As Tal said.

 Next to that, I read that column1 (and column2?) are supposed to be
 seen as factors, not as numerical variables. Did you take that into 
 account
 somehow?

 It's easy to reproduce the error code :
  n - NULL
  if(n2)print(This is OK)
 Error in if (n  2) print(This is OK) : argument is of length zero

 In the hclust code, you find following line :
 n - as.integer(attr(d, Size))
 where d is the distance object entered in the hclust function. Looking
 at the error you get, this means that the size attribute of your distance 
 is
 NULL. Which tells me that distA is not a dist-object.

  A - matrix(1:4,ncol=2)
  A
  [,1] [,2]
 [1,]13
 [2,]24
  hclust(A,method=single)

 Error in if (n  2) stop(must have n = 2 objects to cluster) :
   argument is of length zero

 Did you actually put in a distance object? see also ?dist or ?as.dist.

 Cheers
 Joris




  On Fri, May 28, 2010 at 1:41 AM, Ayesha Khan 
 ayesha.diamond...@gmail.com wrote:

  i have a matrix with the following dimensions
 136   3

 and it looks something like

 [,1] [,2] [,3]
  [1,]  402  675 1.802758
  [2,]  402  696 1.938902
  [3,]  402  699 1.994253
  [4,]  402  945 1.898619
  [5,]  424  470 1.812857
  [6,]  424  905 1.816345
  [7,]  470  905 1.871252
  [8,]  504  780 1.958191
  [9,]  504  848 1.997111...

Re: [R] clustering in R

2010-05-28 Thread Joris Meys

Ah OK, I didn't get your question then.

a dist-object is actually a vector of numbers with a couple of attributes.
You can't just cut out values like that. The hclust function needs a perfect
distance matrix to use the calculations.

shortcut is easy : just do f - f/2*max(f), and all values are below 2.

Otherwise this function could do that for you :

to.dist - function(x){
x.names - sort(unique(c(x[[1]],x[[2]])))
n - length(x.names)
x.dist - matrix(0,n,n)
dimnames(x.dist) - list(x.names,x.names)
x.ind - rbind(cbind(match(x[[1]], x.names), match(x[[2]], x.names)),
cbind(match(x[[2]], x.names), match(x[[1]], x.names)))
x.dist[x.ind] - rep(x[[3]], 2)
x.dist - as.dist(x.dist)
return(x.dist)
}

 d - to.dist(distB)
 hclust(d)


Cheers
Joris



On Sat, May 29, 2010 at 12:04 AM, Ayesha Khan
ayesha.diamond...@gmail.comwrote:

 Yes Joris. I did try that and it does produce the results. I am now
 wondering why I wanted a matrix like structure in the first place. However,
 I do want 'f' to contain values less than 2 only. but when i try to get rid
 of values greater than 2 by doing N - (f[f2], f strcuture disrupts and
 hclust doesnt want to recognize it anyore again. Because obviously the data
 frame changes again with that. Any ideas on how to do that?


 On Fri, May 28, 2010 at 4:13 PM, Joris Meys jorism...@gmail.com wrote:

 errr, forget about the output of dput(q), but keep it in mind for next
 time.

 f = dist(t(q))
 hclust(f,method=single)

 it's as simple as that.
 Cheers
 Joris


 On Fri, May 28, 2010 at 10:39 PM, Ayesha Khan 
 ayesha.diamond...@gmail.com wrote:

 v - dput(x,sampledata.txt)
 dim(v)
 q - v[1:10,1:10]
 f =as.matrix(dist(t(q)))

 distB=NULL
 for(k in 1:(nrow(f)-1)) for( m in (k+1):ncol(f)) {
 if(f[k,m] 2) distB=rbind(distB,c(k,m,f[k,m]))
 }
 #now distB looks like this

  distB
   [,1] [,2]  [,3]
  [1,]12  1.6275568
  [2,]13  0.5252058
  [3,]14  0.7323116
  [4,]15  1 .9966001
  [5,]16  1.6664110
  [6,]17  1.0800540
  [7,]18  1.8698925
  [8,]1   10  0.5161808
  [9,]23  1.7325811
 [10,]25  0.8267843
 [11,]26  0.5963280
 [12,]27  0.8787230

 #now from this output i want to cluster all 1's, friedns of 1 and
 friends of friends of 1 in one cluster. The same goes for 2,3 and so on
 But when i do that using hclust, i get the following error. I think what
 I need to do is convert my cureent matrix somehow into a format that would
 be accepted by the hclust function but I dont know how to achieve that.
  distclust - hclust(distB,method=single)

 Error in if (n  2) stop(must have n = 2 objects to cluster) :
   argument is of length zero

 P.S: Please let me know if this makes things more clear? cuz i dont know
 how looking at the original data set would help becuase the matrix under
 consdieration right now is the distance matrix and how it can be altered. I
 have tried as.dist, doesnt work because my matrix as i mentioned eralier is
 not a square matrix.
  On Fri, May 28, 2010 at 2:37 PM, Tal Galili tal.gal...@gmail.comwrote:

 Hi Ayesha,
 I wish to help you, but without a simple self contained example that
 shows your issue, I will not be able to help.
 Try using the ?dput command to create some simple data, and let us see
 what you are doing.

 Best,
 Tal
 Contact
 Details:---
 Contact me: tal.gal...@gmail.com |  972-52-7275845
 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew)
 | www.r-statistics.com (English)

 --




   On Fri, May 28, 2010 at 9:04 PM, Ayesha Khan 
 ayesha.diamond...@gmail.com wrote:

 Thanks Tal  Joris!
 I created my distance matrix distA by using the dist() function in R
 manipulating my output in order to get a matrix.
 distA =as.matrix(dist(t(x2))) # x2 being my original dataset
 as according to the documentaion on dist()

 For the default method, a dist object, or a matrix (of distances) or
 an object which can be coerced to such a matrix using as.matrix()

   On Fri, May 28, 2010 at 6:34 AM, Joris Meys jorism...@gmail.comwrote:

 As Tal said.

 Next to that, I read that column1 (and column2?) are supposed to be
 seen as factors, not as numerical variables. Did you take that into 
 account
 somehow?

 It's easy to reproduce the error code :
  n - NULL
  if(n2)print(This is OK)
 Error in if (n  2) print(This is OK) : argument is of length zero

 In the hclust code, you find following line :
 n - as.integer(attr(d, Size))
 where d is the distance object entered in the hclust function. Looking
 at the error you get, this means that the size attribute of your 
 distance is
 NULL. Which tells me that distA is not a dist-object.

  A - matrix(1:4,ncol=2)
  A
  [,1] [,2]
 [1,]13
 [2,]24
  hclust(A,method=single)

 Error in if (n  2) stop(must have n = 2 objects to

[R] clustering in R

2010-05-27 Thread Ayesha Khan

i have a matrix with the following dimensions
136   3

and it looks something like

 [,1] [,2] [,3]
  [1,]  402  675 1.802758
  [2,]  402  696 1.938902
  [3,]  402  699 1.994253
  [4,]  402  945 1.898619
  [5,]  424  470 1.812857
  [6,]  424  905 1.816345
  [7,]  470  905 1.871252
  [8,]  504  780 1.958191
  [9,]  504  848 1.997111...

so you get the idea. I want to group similar items in one group/cluster
following the friends of friends approach. I tried doing

distclust - hclust(distA,method=single)
However, I got the following error.

Error in if (n  2) stop(must have n = 2 objects to cluster) :  argument
is of length zero
which probably means there's something wrong with my input here. Is there
another way of doing this kind of clustering without getting into all the
 looping and ifelse etc. Basically, if 402 is close to 675,696,and699 and
thus fall in cluster A then all items close to 675,696,and 699 should also
fall into the same cluster A following a friends of friedns strategy.
Any help would be highly appreciated.

-- 
Ayesha Khan

MS Bioengineering
Dept. of Bioengineering
Rice University, TX

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Clustering with clara

2010-01-14 Thread pacomet

Hello everyone

I am trying to use CLARA method for finding clusters in my spatial surface
temperature data and noticed one problem. My data are in the form
lat,lon,temperature. I extract lat,lon and cluster number for each point in
the dataset. When I plotted a map of cluster numbers I found empty areas in
the map. The point is that the number of points that were assigned a cluster
number are less than the original temperature analyzed points.

Why are there less points in the clustering results? is there any option in
the CLARA method to retain every single point? is there another clustering
method that preserves all the points?

Thanks in advance

Paco

-- 
_
El ponent la mou, el llevant la plou
Usuari Linux registrat: 363952
---
Fotos: http://picasaweb.google.es/pacomet

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Clustering with clara

2010-01-14 Thread Christian Hennig


Dear Paco,

as far as I know, there is no such problem with clara, but I may be wrong. 
However, in order to help you (though I'm not sure whether I'll be able to 
do that), we'd need to understand precisely what you were doing in R and 
how your data looks like (code and data; you can show us a relevant bit 
of the data using the str command). Chances are that the problem is 
not in clara but some other thing that you do doesn't do what you expect

it to do.

Christian

On Thu, 14 Jan 2010, pacomet wrote:


Hello everyone

I am trying to use CLARA method for finding clusters in my spatial surface
temperature data and noticed one problem. My data are in the form
lat,lon,temperature. I extract lat,lon and cluster number for each point in
the dataset. When I plotted a map of cluster numbers I found empty areas in
the map. The point is that the number of points that were assigned a cluster
number are less than the original temperature analyzed points.

Why are there less points in the clustering results? is there any option in
the CLARA method to retain every single point? is there another clustering
method that preserves all the points?

Thanks in advance

Paco

--
_
El ponent la mou, el llevant la plou
Usuari Linux registrat: 363952
---
Fotos: http://picasaweb.google.es/pacomet

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Clustering for Ordinal data

2009-10-15 Thread Dylan Beaudette

On Wednesday 14 October 2009, Paul Evans wrote:
 Hi,

 I just wanted to check whether there is a clustering package available for
 ordinal data. My data looks something like: #1 #2 #3 #4.
 A B C D...
 D B C A...
 D C A A...
 where each column represents a sample, and each row some ordinal values. I
 would like to cluster such that similar samples appear together. thanks!



Hi,

See the 'cluster' package. You will need to select a distance metric that can 
deal with factors. The 'Gower' metric is one that is commonly used.

Cheers,
Dylan


-- 
Dylan Beaudette
Soil Resource Laboratory
http://casoilresource.lawr.ucdavis.edu/
University of California at Davis
530.754.7341

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Clustering for Ordinal data

2009-10-14 Thread Paul Evans

Hi,

I just wanted to check whether there is a clustering package available for 
ordinal data. My data looks something like:
#1 #2 #3 #4.
A B C D...
D B C A...
D C A A...
where each column represents a sample, and each row some ordinal values. I 
would like to cluster such that similar samples appear together.
thanks!


  
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Clustering with R - efficient processing of large sparse data sets (text data)

2009-09-27 Thread dataguru

I checked the R procedure HCLUST (hierarchical clustering) but it
looks like it requires a full triangular n x n similarity matrix as
input, where n = number of observations. The number of variables is
200.

My data set has n = 50,000 observations (keywords), and I use ad-hoc
similarity measures, not available in R, to measure keyword
similarity. Here, the vast majority of the n x n similarities are
equal to zero.

So I am looking for a clustering procedure that would accept the
following alternate input:

x1, y1, s1
x2, y2, s2

...

xk, yk, sk

where xi, yi are 2 keywords with similarity si  0 (1 = i = k). This
input would contain k = 10,000 rows, which is much smaller than n x n
= 50,000 x 50,000 elements when using the similarity matrix. The
HCLUST function would crash if it used the dissimilarity matrix as
input.

Do you know how to use my small data input in R, instead of a very
large sparse similarity matrix? Or in SAS? I need a simple solution,
otherwise I'll just write myself the code that does hierarchical
clustering, in C or Perl, or use a library. It would take me 2 hours
to write the hierarchical clustering code from scratch, so I'm looking
for a simple solution that will take less than 2 hours to implement.

Follow up at 
http://www.analyticbridge.com/group/R_Packages/forum/topics/clustering-with-r-efficient

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Clustering with R - efficient processing of large sparse data sets (text data)

2009-09-27 Thread dataguru

I checked the R procedure HCLUST (hierarchical clustering) but it
looks like it requires a full triangular n x n similarity matrix as
input, where n = number of observations. The number of variables is
200.

My data set has n = 50,000 observations (keywords), and I use ad-hoc
similarity measures, not available in R, to measure keyword
similarity. Here, the vast majority of the n x n similarities are
equal to zero.

So I am looking for a clustering procedure that would accept the
following alternate input:

x1, y1, s1
x2, y2, s2

...

xk, yk, sk

where xi, yi are 2 keywords with similarity si  0 (1 = i = k). This
input would contain k = 10,000 rows, which is much smaller than n x n
= 50,000 x 50,000 elements when using the similarity matrix. The
HCLUST function would crash if it used the dissimilarity matrix as
input.

Do you know how to use my small data input in R, instead of a very
large sparse similarity matrix? Or in SAS? I need a simple solution,
otherwise I'll just write myself the code that does hierarchical
clustering, in C or Perl, or use a library. It would take me 2 hours
to write the hierarchical clustering code from scratch, so I'm looking
for a simple solution that will take less than 2 hours to implement.

Follow up at: 
http://www.analyticbridge.com/group/R_Packages/forum/topics/clustering-with-r-efficient

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Clustering within part of a cluster result

2009-07-09 Thread Albert Vernon Smith

How can I cluster and order within part of a previous clustering result?

For example, I am clustering and ordering results as follows:

 rows - 30
 cols - 3
 x - matrix(sample(-1:1,rows*cols,replace=T), nrow=rows, 
 ncol=cols,dimnames=list(c(paste(R,1:rows,sep=)),c(paste(C,1:cols,sep=
 x
C1 C2 C3
R1   0  1  1
R2   0 -1  1
R3  -1  1  0
R4  -1  0  1
R5   0 -1  0
R6  -1 -1  1
R7  -1 -1 -1
R8  -1 -1  0
...
 hc - hclust(dist(x, method = binary),method=single)
 hc - as.dendrogram(hc)
 ord.hc - order.dendrogram(hc)
 xc - x[rev(ord.hc),]
 xc
C1 C2 C3
R7  -1 -1 -1
R6  -1 -1  1
R9   1 -1  1
R11  1  1 -1
R12 -1 -1 -1
R16 -1 -1  1
R18 -1 -1 -1
R24  1 -1 -1
R27 -1 -1  1
R2   0 -1  1
R1   0  1  1
R10  0 -1 -1
...


Given the binary distance I am using, the first nine rows all have a
distance of 0 to one another.  How can I sort again within certain
nodes to bring those which are closer together.  In the example above,
I'd like some sort of cluster within the first 9 rows so that R7,
R12, and R18 are together, as they are all -1,-1,-1, and
R6/R16/R27, etc.

How might I go about that?

Thanks,
-albert

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] clustering, don't understand this error

2009-04-16 Thread Christian Hennig


Hi there,

I'm travelling right now so I can't really check this but it seems that 
the problem is that cluster.stats needs a partition as input. hclust 
doesn't give you a partition but you can generate one from it using 
cutree.

BTW, rather use - than =.

Best wishes,
Christian


On Wed, 15 Apr 2009, Ana M Aparicio Carrasco wrote:


Hello,
I am using the dunn metric, but something is wrong and I dont understand
what or what that this error mean. Please can you help me with this?

The instructions are:

#Indice de Dunn
disbupa=dist(bupa[,1:6])
a=hclust(disbupa)
cluster.stats(disbupa,a,bupa[,7])$dunn

And the error is:

Error in max(clustering) : invalid 'type' (list) of argument

thank you so much.

Ana Maria Aparicio.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] clustering, don't understand this error

2009-04-15 Thread Ana M Aparicio Carrasco

Hello,
I am using the dunn metric, but something is wrong and I dont understand
what or what that this error mean. Please can you help me with this?

The instructions are:

 #Indice de Dunn
 disbupa=dist(bupa[,1:6])
 a=hclust(disbupa)
 cluster.stats(disbupa,a,bupa[,7])$dunn

And the error is:

Error in max(clustering) : invalid 'type' (list) of argument

thank you so much.

Ana Maria Aparicio.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Clustering with Mahalanobis Distance

2008-12-10 Thread Wayne F


I don't have any experience with your particular problem, but the thing I
notice is that mahalanobis is that by default you specify a covariance
matrix, and it uses solve to calculate its inverse. If you could supply the
inverse covariance matrix (and specify inverted=TRUE to mahalanobis), that
might save a lot of memory.

If you cannot externally calculate the inverse before bringing it into R,
perhaps if you read only the covariance matrix and inverted it first, before
doing anything else? Or perhaps someone else knows some matrix magic?
-- 
View this message in context: 
http://www.nabble.com/Clustering-with-Mahalanobis-Distance-tp20901487p20949816.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Clustering with Mahalanobis Distance

2008-12-08 Thread Richardson, Patrick

Dear R ExpeRts,

I'm having memory difficulties using mahalanobis distance to trying to cluster 
in R.  I was wondering if anyone has done it with a matrix of 6525x17 (or 
something similar to that size).  I have a matrix of 6525 genes and 17 samples. 
I have my R memory increased to the max and am still getting cannot allocate 
vector of size errors.  My matrix x is actually a transpose of the original 
matrix (as I want to cluster by samples and not genes). y is a vector of the 
mean gene expression levels and z is the covariance matrix of x (I think 
this is where the problem lies as the covariance matrix is enormous.

I can't really provide a reproducible example as I would have to attach my data 
files, which I don't think anyone would appreciate.

rm(list=ls())  #removes everything from memory#
gc()  #collects garbage#
memory.limit(size = 4095)   #increases memory limit#
x - as.matrix(read.table(x.txt, header=TRUE, row.names=1))
y - as.matrix(read.table(y.txt, header=TRUE, row.names=1))
z - as.matrix(read.table(z.txt, header=TRUE, row.names=1))
mal - mahalanobis(x, y, z)

The ultimate goal is to run hclust with the mahalanobis distance matrix.


If anyone knows where I could find a more memory friendly function or any 
advise as to what I might try to optimize my code,  I would appreciate it.  
sessionInfo() is below.

Many Thanks,

_
Patrick Richardson
Biostatistician - Program of Translational Medicine
Van Andel Research Institute - Webb Lab
333 Bostwick Avenue NE
Grand Rapids, MI  49503


R version 2.8.0 (2008-10-20)
i386-pc-mingw32

locale:
LC_COLLATE=English_United States.1252;LC_CTYPE=English_United 
States.1252;LC_MONETARY=English_United 
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics  grDevices datasets  tcltk utils methods   base

other attached packages:
[1] svSocket_0.9-5 svIO_0.9-5 R2HTML_1.59svMisc_0.9-5   svIDE_0.9-5

loaded via a namespace (and not attached):
[1] tools_2.8.0

This email message, including any attachments, is for th...{{dropped:9}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Clustering and functions

2008-11-08 Thread Bryan Richardson

I am new to R and have written a function that clusters on subsets of a big 
data data set with 60,000 points.  I am not sure why, but I keep getting a 
run-time error.  Any suggestions would be greatly appreciated.

Here is the code:

library(cba)

d-read.csv(data.csv, header=TRUE)

v-c(53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,21,72 
73,74,75,76,77,78)

clusterMe-function(d,v){
tempMat-subset(d,d[,v[1]]==TRUE)
rc-rockCluster(tempMat,n=5,theta=.2)
tempMat-cbind(tempMat,rc$cl)
M-tempMat
for (i in 2:26){
tempMat-subset(d,d[,v[i]]==TRUE)
rc-rockCluster(tempMat,n=5,theta=.2)
tempMat-cbind(tempMat,rc$cl)
M-rbind(M,tempMat)
}
M
}

clusters-clusterMe(d,v)

Bryan

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Clustering and functions

2008-11-08 Thread Sarah Goslee

It would help a lot if you told us what the error message was, and provided
some data to work with. As it is, we can't even run the function to find
out what goes wrong.

And also, OS, version of R - all that stuff that the posting guide requests.

Sarah

On Sat, Nov 8, 2008 at 10:31 AM, Bryan Richardson [EMAIL PROTECTED] wrote:
 I am new to R and have written a function that clusters on subsets of a big 
 data data set with 60,000 points.  I am not sure why, but I keep getting a 
 run-time error.  Any suggestions would be greatly appreciated.

 Here is the code:

 library(cba)

 d-read.csv(data.csv, header=TRUE)

 v-c(53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,21,72 
 73,74,75,76,77,78)

 clusterMe-function(d,v){
tempMat-subset(d,d[,v[1]]==TRUE)
rc-rockCluster(tempMat,n=5,theta=.2)
tempMat-cbind(tempMat,rc$cl)
M-tempMat
for (i in 2:26){
tempMat-subset(d,d[,v[i]]==TRUE)
rc-rockCluster(tempMat,n=5,theta=.2)
tempMat-cbind(tempMat,rc$cl)
M-rbind(M,tempMat)
}
M
 }

 clusters-clusterMe(d,v)

 Bryan
-- 
Sarah Goslee
http://www.functionaldiversity.org

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Clustering In R. (rookie)

2008-11-04 Thread paul murima

Hi all.
I have  alrge microarray dat set that i would like to analyze using
hierarchical clustering. The problem is  when i use the command below,
 hc- hclust(dist(array), ave)
i get get this feedback...
Error in as.vector(x, mode) :
cannot coerce type 'closure' to vector of type 'any'

Can some one help me on how i can go about this operation.I am a
rookie in R and still learinig to find my way familising with the r
environment. Your help will be utmost appreciated.
--
BEST

Paul =

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

1 2 >

1 - 100 of 117 matches

Mail list logo