Re: [R] Question about PCA with prcomp
Hi James, Have a look at Cadima et al.'s subselect package [Cadima worked with/was a student of Prof Jolliffe, one of _the_ experts on PCA; Jolliffe devotes part of a Chapter to this question in his text (Principal Component Analysis, pub. Springer)]. Then you should look at psychometric stuff: a good place to start would be Professor Revelle's psych package. BestR, Mark. James R. Graham wrote: Hello All, The basic premise of what I want to do is the following: I have 20 entities for which I have ~500 measurements each. So, I have a matrix of 20 rows by ~500 columns. The 20 entities fall into two classes: good and bad. I eventually would like to derive a model that would then be able to classify new entities as being in good territory or bad territory based upon my existing data set. I know that not all ~500 measurements are meaningful, so I thought the best place to begin would be to do a PCA in order to reduce the amount of data with which I have to work. I did this using the prcomp function and found that nearly 90% of the variance in the data is explained by PC1 and 2. So far, so good. I would now like to find out which of the original ~500 measurements contribute to PC1 and 2 and by how much. Any tips would be greatly appreciated! And apologies in advance if this turns out to be an idiotic question. james __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- View this message in context: http://www.nabble.com/Question-about-PCA-with-prcomp-tf4012919.html#a11398608 Sent from the R help mailing list archive at Nabble.com. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question about PCA with prcomp
Mark, What you are referring to deals with the selection of covariates, since PC doesn't do dimensionality reduction in the sense of covariate selection. But what Mark is asking for is to identify how much each data point contributes to individual PCs. I don't think that Mark's query makes much sense, unless he meant to ask: which individuals have high/low scores on PC1/PC2. Here are some comments that may be tangentially related to Mark's question: 1. If one is worried about a few data points contributing heavily to the estimation of PCs, then one can use robust PCA, for example, using robust covariance matrices. MASS has some tools for this. 2. The biplot for the first 2 PCs can give some insights 3. PCs, especially, the last few PCs, can be used to identify outliers. Hope this is helpful, Ravi. --- Ravi Varadhan, Ph.D. Assistant Professor, The Center on Aging and Health Division of Geriatric Medicine and Gerontology Johns Hopkins University Ph: (410) 502-2619 Fax: (410) 614-9625 Email: [EMAIL PROTECTED] Webpage: http://www.jhsph.edu/agingandhealth/People/Faculty/Varadhan.html -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Mark Difford Sent: Monday, July 02, 2007 1:55 PM To: r-help@stat.math.ethz.ch Subject: Re: [R] Question about PCA with prcomp Hi James, Have a look at Cadima et al.'s subselect package [Cadima worked with/was a student of Prof Jolliffe, one of _the_ experts on PCA; Jolliffe devotes part of a Chapter to this question in his text (Principal Component Analysis, pub. Springer)]. Then you should look at psychometric stuff: a good place to start would be Professor Revelle's psych package. BestR, Mark. James R. Graham wrote: Hello All, The basic premise of what I want to do is the following: I have 20 entities for which I have ~500 measurements each. So, I have a matrix of 20 rows by ~500 columns. The 20 entities fall into two classes: good and bad. I eventually would like to derive a model that would then be able to classify new entities as being in good territory or bad territory based upon my existing data set. I know that not all ~500 measurements are meaningful, so I thought the best place to begin would be to do a PCA in order to reduce the amount of data with which I have to work. I did this using the prcomp function and found that nearly 90% of the variance in the data is explained by PC1 and 2. So far, so good. I would now like to find out which of the original ~500 measurements contribute to PC1 and 2 and by how much. Any tips would be greatly appreciated! And apologies in advance if this turns out to be an idiotic question. james __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- View this message in context: http://www.nabble.com/Question-about-PCA-with-prcomp-tf4012919.html#a1139860 8 Sent from the R help mailing list archive at Nabble.com. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question about PCA with prcomp
On Mon, 02-Jul-2007 at 03:16PM -0400, Ravi Varadhan wrote: | Mark, | | What you are referring to deals with the selection of covariates, since PC | doesn't do dimensionality reduction in the sense of covariate selection. | But what Mark is asking for is to identify how much each data point | contributes to individual PCs. I don't think that Mark's query makes much | sense, unless he meant to ask: which individuals have high/low scores on | PC1/PC2. Here are some comments that may be tangentially related to Mark's | question: | | 1. If one is worried about a few data points contributing heavily to the | estimation of PCs, then one can use robust PCA, for example, using robust | covariance matrices. MASS has some tools for this. | 2. The biplot for the first 2 PCs can give some insights | 3. PCs, especially, the last few PCs, can be used to identify outliers. What is meant by last few PCs? -- ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~. ___Patrick Connolly {~._.~} Great minds discuss ideas _( Y )_Middle minds discuss events (:_~*~_:)Small minds discuss people (_)-(_) . Anon ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question about PCA with prcomp
The PCs that are associated with the smaller eigenvalues. --- Ravi Varadhan, Ph.D. Assistant Professor, The Center on Aging and Health Division of Geriatric Medicine and Gerontology Johns Hopkins University Ph: (410) 502-2619 Fax: (410) 614-9625 Email: [EMAIL PROTECTED] Webpage: http://www.jhsph.edu/agingandhealth/People/Faculty/Varadhan.html -Original Message- From: Patrick Connolly [mailto:[EMAIL PROTECTED] Sent: Monday, July 02, 2007 4:23 PM To: Ravi Varadhan Cc: 'Mark Difford'; r-help@stat.math.ethz.ch Subject: Re: [R] Question about PCA with prcomp On Mon, 02-Jul-2007 at 03:16PM -0400, Ravi Varadhan wrote: | Mark, | | What you are referring to deals with the selection of covariates, since PC | doesn't do dimensionality reduction in the sense of covariate selection. | But what Mark is asking for is to identify how much each data point | contributes to individual PCs. I don't think that Mark's query makes much | sense, unless he meant to ask: which individuals have high/low scores on | PC1/PC2. Here are some comments that may be tangentially related to Mark's | question: | | 1. If one is worried about a few data points contributing heavily to the | estimation of PCs, then one can use robust PCA, for example, using robust | covariance matrices. MASS has some tools for this. | 2. The biplot for the first 2 PCs can give some insights | 3. PCs, especially, the last few PCs, can be used to identify outliers. What is meant by last few PCs? -- ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~. ___Patrick Connolly {~._.~} Great minds discuss ideas _( Y )_Middle minds discuss events (:_~*~_:)Small minds discuss people (_)-(_) . Anon ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question about PCA with prcomp
Hi James, Ravi: James wrote: ... I have 20 entities for which I have ~500 measurements each. So, I have a matrix of 20 rows by ~500 columns. ... Perhaps I misread James' question, but I don't think so. As James described it, we have ~500 measurements made on 20 objects. A PCA on this [20 rows/observations by ~ 500 columns/descriptors/variables] should return ~ 500 eigenvalues. And each of these columns/descriptors/variables will have a loading on each PC. James wants to reduce his descriptors/measurements/variables to the most important (variance). A primitive way of doing this would be to examine the loadings on the first 2--3 PCs and choose those columns/descriptors/variables with the highest loadings, and throw away the rest. [He has already decided that he can throw away all but the first two PCs.] In fact, it would be a very good idea to do a coinertia analysis on the pre- and post-selected sets, and look at the RV value. If this is above [thumbsuck] 0.9, then you're doing very well (there's a good plot method for this in ade4, cf coinertia c). But see Cadima et al. (+refs for caution; and elsewhere) for more sophisticated methods of subsetting. Regards, Mark. Ravi Varadhan wrote: Mark, What you are referring to deals with the selection of covariates, since PC doesn't do dimensionality reduction in the sense of covariate selection. But what Mark is asking for is to identify how much each data point contributes to individual PCs. I don't think that Mark's query makes much sense, unless he meant to ask: which individuals have high/low scores on PC1/PC2. Here are some comments that may be tangentially related to Mark's question: 1. If one is worried about a few data points contributing heavily to the estimation of PCs, then one can use robust PCA, for example, using robust covariance matrices. MASS has some tools for this. 2. The biplot for the first 2 PCs can give some insights 3. PCs, especially, the last few PCs, can be used to identify outliers. Hope this is helpful, Ravi. --- Ravi Varadhan, Ph.D. Assistant Professor, The Center on Aging and Health Division of Geriatric Medicine and Gerontology Johns Hopkins University Ph: (410) 502-2619 Fax: (410) 614-9625 Email: [EMAIL PROTECTED] Webpage: http://www.jhsph.edu/agingandhealth/People/Faculty/Varadhan.html -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Mark Difford Sent: Monday, July 02, 2007 1:55 PM To: r-help@stat.math.ethz.ch Subject: Re: [R] Question about PCA with prcomp Hi James, Have a look at Cadima et al.'s subselect package [Cadima worked with/was a student of Prof Jolliffe, one of _the_ experts on PCA; Jolliffe devotes part of a Chapter to this question in his text (Principal Component Analysis, pub. Springer)]. Then you should look at psychometric stuff: a good place to start would be Professor Revelle's psych package. BestR, Mark. James R. Graham wrote: Hello All, The basic premise of what I want to do is the following: I have 20 entities for which I have ~500 measurements each. So, I have a matrix of 20 rows by ~500 columns. The 20 entities fall into two classes: good and bad. I eventually would like to derive a model that would then be able to classify new entities as being in good territory or bad territory based upon my existing data set. I know that not all ~500 measurements are meaningful, so I thought the best place to begin would be to do a PCA in order to reduce the amount of data with which I have to work. I did this using the prcomp function and found that nearly 90% of the variance in the data is explained by PC1 and 2. So far, so good. I would now like to find out which of the original ~500 measurements contribute to PC1 and 2 and by how much. Any tips would be greatly appreciated! And apologies in advance if this turns out to be an idiotic question. james __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- View this message in context: http://www.nabble.com/Question-about-PCA-with-prcomp-tf4012919.html#a1139860 8 Sent from the R help mailing list archive at Nabble.com. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help
Re: [R] Question about PCA with prcomp
...but with 500 variables and only 20 'entities' (observations) you will have 481 PCs with dead zero eigenvalues. How small is 'smaller' and how many is a few? Everyone who has responded to this seems to accept the idea that PCA is the way to go here, but that is not clear to me at all. There is a 2-sample structure in the 20 observations that you have. If you simply ignore that in doing your PCA you are making strong assumptions about sampling that would seem to me unlikely to be met. If you allow for the structure and project orthogonal to it then you are probably throwing the baby out with the bathwater - you want to choose variables which maximise separation between the 2 samples (and now you are up to 482 zero principal variances, if that matters...). I think this problem probably needs a bit of a re-think. Some variant on singular LDA, for example, may be a more useful way to think about it. Bill Venables. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Ravi Varadhan Sent: Monday, 2 July 2007 1:29 PM To: 'Patrick Connolly' Cc: r-help@stat.math.ethz.ch; 'Mark Difford' Subject: Re: [R] Question about PCA with prcomp The PCs that are associated with the smaller eigenvalues. --- Ravi Varadhan, Ph.D. Assistant Professor, The Center on Aging and Health Division of Geriatric Medicine and Gerontology Johns Hopkins University Ph: (410) 502-2619 Fax: (410) 614-9625 Email: [EMAIL PROTECTED] Webpage: http://www.jhsph.edu/agingandhealth/People/Faculty/Varadhan.html -Original Message- From: Patrick Connolly [mailto:[EMAIL PROTECTED] Sent: Monday, July 02, 2007 4:23 PM To: Ravi Varadhan Cc: 'Mark Difford'; r-help@stat.math.ethz.ch Subject: Re: [R] Question about PCA with prcomp On Mon, 02-Jul-2007 at 03:16PM -0400, Ravi Varadhan wrote: | Mark, | | What you are referring to deals with the selection of covariates, | since PC | doesn't do dimensionality reduction in the sense of covariate selection. | But what Mark is asking for is to identify how much each data point | contributes to individual PCs. I don't think that Mark's query makes much | sense, unless he meant to ask: which individuals have high/low scores | on PC1/PC2. Here are some comments that may be tangentially related | to Mark's | question: | | 1. If one is worried about a few data points contributing heavily to | the estimation of PCs, then one can use robust PCA, for example, | using robust covariance matrices. MASS has some tools for this. | 2. The biplot for the first 2 PCs can give some insights 3. PCs, | especially, the last few PCs, can be used to identify outliers. What is meant by last few PCs? -- ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~. ___Patrick Connolly {~._.~} Great minds discuss ideas _( Y )_Middle minds discuss events (:_~*~_:)Small minds discuss people (_)-(_) . Anon ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question about PCA with prcomp
To all ..., Bill's lateral wisdom is almost certainly a better solution. So thanks for the advice (and everything else that went before it [Bill: apropos of termplot, what happened to tplot ?]). And I will [almost] desist from asking the obvious: and if there were 10 000 observations ? BestR, Mark. Bill.Venables wrote: ...but with 500 variables and only 20 'entities' (observations) you will have 481 PCs with dead zero eigenvalues. How small is 'smaller' and how many is a few? Everyone who has responded to this seems to accept the idea that PCA is the way to go here, but that is not clear to me at all. There is a 2-sample structure in the 20 observations that you have. If you simply ignore that in doing your PCA you are making strong assumptions about sampling that would seem to me unlikely to be met. If you allow for the structure and project orthogonal to it then you are probably throwing the baby out with the bathwater - you want to choose variables which maximise separation between the 2 samples (and now you are up to 482 zero principal variances, if that matters...). I think this problem probably needs a bit of a re-think. Some variant on singular LDA, for example, may be a more useful way to think about it. Bill Venables. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Ravi Varadhan Sent: Monday, 2 July 2007 1:29 PM To: 'Patrick Connolly' Cc: r-help@stat.math.ethz.ch; 'Mark Difford' Subject: Re: [R] Question about PCA with prcomp The PCs that are associated with the smaller eigenvalues. --- Ravi Varadhan, Ph.D. Assistant Professor, The Center on Aging and Health Division of Geriatric Medicine and Gerontology Johns Hopkins University Ph: (410) 502-2619 Fax: (410) 614-9625 Email: [EMAIL PROTECTED] Webpage: http://www.jhsph.edu/agingandhealth/People/Faculty/Varadhan.html -Original Message- From: Patrick Connolly [mailto:[EMAIL PROTECTED] Sent: Monday, July 02, 2007 4:23 PM To: Ravi Varadhan Cc: 'Mark Difford'; r-help@stat.math.ethz.ch Subject: Re: [R] Question about PCA with prcomp On Mon, 02-Jul-2007 at 03:16PM -0400, Ravi Varadhan wrote: | Mark, | | What you are referring to deals with the selection of covariates, | since PC | doesn't do dimensionality reduction in the sense of covariate selection. | But what Mark is asking for is to identify how much each data point | contributes to individual PCs. I don't think that Mark's query makes much | sense, unless he meant to ask: which individuals have high/low scores | on PC1/PC2. Here are some comments that may be tangentially related | to Mark's | question: | | 1. If one is worried about a few data points contributing heavily to | the estimation of PCs, then one can use robust PCA, for example, | using robust covariance matrices. MASS has some tools for this. | 2. The biplot for the first 2 PCs can give some insights 3. PCs, | especially, the last few PCs, can be used to identify outliers. What is meant by last few PCs? -- ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~. ___Patrick Connolly {~._.~} Great minds discuss ideas _( Y )_ Middle minds discuss events (:_~*~_:) Small minds discuss people (_)-(_) . Anon ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- View this message in context: http://www.nabble.com/Question-about-PCA-with-prcomp-tf4012919.html#a11402204 Sent from the R help mailing list archive at Nabble.com. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.