Re: [R] Question about PCA with prcomp

2007-07-02 Thread Mark Difford

Hi James,

Have a look at Cadima et al.'s subselect package [Cadima worked with/was a
student of Prof Jolliffe, one of _the_ experts on PCA; Jolliffe devotes part
of a Chapter to this question in his text (Principal Component Analysis,
pub. Springer)].  Then you should look at psychometric stuff: a good place
to start would be Professor Revelle's psych package.

BestR,
Mark.


James R. Graham wrote:
 
 Hello All,
 
 The basic premise of what I want to do is the following:
 
 I have 20 entities for which I have ~500 measurements each. So, I  
 have a matrix of 20 rows by ~500 columns.
 
 The 20 entities fall into two classes: good and bad.
 
 I eventually would like to derive a model that would then be able to  
 classify new entities as being in good territory or bad territory  
 based upon my existing data set.
 
 I know that not all ~500 measurements are meaningful, so I thought  
 the best place to begin would be to do a PCA in order to reduce the  
 amount of data with which I have to work.
 
 I did this using the prcomp function and found that nearly 90% of the  
 variance in the data is explained by PC1 and 2.
 
 So far, so good.
 
 I would now like to find out which of the original ~500 measurements  
 contribute to PC1 and 2 and by how much.
 
 Any tips would be greatly appreciated! And apologies in advance if  
 this turns out to be an idiotic question.
 
 
 james
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 

-- 
View this message in context: 
http://www.nabble.com/Question-about-PCA-with-prcomp-tf4012919.html#a11398608
Sent from the R help mailing list archive at Nabble.com.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Question about PCA with prcomp

2007-07-02 Thread Ravi Varadhan
Mark,

What you are referring to deals with the selection of covariates, since PC
doesn't do dimensionality reduction in the sense of covariate selection.
But what Mark is asking for is to identify how much each data point
contributes to individual PCs.  I don't think that Mark's query makes much
sense, unless he meant to ask: which individuals have high/low scores on
PC1/PC2.  Here are some comments that may be tangentially related to Mark's
question:

1.  If one is worried about a few data points contributing heavily to the
estimation of PCs, then one can use robust PCA, for example, using robust
covariance matrices.  MASS has some tools for this.
2.  The biplot for the first 2 PCs can give some insights
3. PCs, especially, the last few PCs, can be used to identify outliers.
  
Hope this is helpful,
Ravi.


---

Ravi Varadhan, Ph.D.

Assistant Professor, The Center on Aging and Health

Division of Geriatric Medicine and Gerontology 

Johns Hopkins University

Ph: (410) 502-2619

Fax: (410) 614-9625

Email: [EMAIL PROTECTED]

Webpage:  http://www.jhsph.edu/agingandhealth/People/Faculty/Varadhan.html

 




-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Mark Difford
Sent: Monday, July 02, 2007 1:55 PM
To: r-help@stat.math.ethz.ch
Subject: Re: [R] Question about PCA with prcomp


Hi James,

Have a look at Cadima et al.'s subselect package [Cadima worked with/was a
student of Prof Jolliffe, one of _the_ experts on PCA; Jolliffe devotes part
of a Chapter to this question in his text (Principal Component Analysis,
pub. Springer)].  Then you should look at psychometric stuff: a good place
to start would be Professor Revelle's psych package.

BestR,
Mark.


James R. Graham wrote:
 
 Hello All,
 
 The basic premise of what I want to do is the following:
 
 I have 20 entities for which I have ~500 measurements each. So, I  
 have a matrix of 20 rows by ~500 columns.
 
 The 20 entities fall into two classes: good and bad.
 
 I eventually would like to derive a model that would then be able to  
 classify new entities as being in good territory or bad territory  
 based upon my existing data set.
 
 I know that not all ~500 measurements are meaningful, so I thought  
 the best place to begin would be to do a PCA in order to reduce the  
 amount of data with which I have to work.
 
 I did this using the prcomp function and found that nearly 90% of the  
 variance in the data is explained by PC1 and 2.
 
 So far, so good.
 
 I would now like to find out which of the original ~500 measurements  
 contribute to PC1 and 2 and by how much.
 
 Any tips would be greatly appreciated! And apologies in advance if  
 this turns out to be an idiotic question.
 
 
 james
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 

-- 
View this message in context:
http://www.nabble.com/Question-about-PCA-with-prcomp-tf4012919.html#a1139860
8
Sent from the R help mailing list archive at Nabble.com.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Question about PCA with prcomp

2007-07-02 Thread Patrick Connolly
On Mon, 02-Jul-2007 at 03:16PM -0400, Ravi Varadhan wrote:

| Mark,
| 
| What you are referring to deals with the selection of covariates, since PC
| doesn't do dimensionality reduction in the sense of covariate selection.
| But what Mark is asking for is to identify how much each data point
| contributes to individual PCs.  I don't think that Mark's query makes much
| sense, unless he meant to ask: which individuals have high/low scores on
| PC1/PC2.  Here are some comments that may be tangentially related to Mark's
| question:
| 
| 1.  If one is worried about a few data points contributing heavily to the
| estimation of PCs, then one can use robust PCA, for example, using robust
| covariance matrices.  MASS has some tools for this.
| 2.  The biplot for the first 2 PCs can give some insights
| 3. PCs, especially, the last few PCs, can be used to identify outliers.

What is meant by last few PCs?

-- 
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.   
   ___Patrick Connolly   
 {~._.~} Great minds discuss ideas
 _( Y )_Middle minds discuss events 
(:_~*~_:)Small minds discuss people  
 (_)-(_)   . Anon
  
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Question about PCA with prcomp

2007-07-02 Thread Ravi Varadhan
The PCs that are associated with the smaller eigenvalues. 


---

Ravi Varadhan, Ph.D.

Assistant Professor, The Center on Aging and Health

Division of Geriatric Medicine and Gerontology 

Johns Hopkins University

Ph: (410) 502-2619

Fax: (410) 614-9625

Email: [EMAIL PROTECTED]

Webpage:  http://www.jhsph.edu/agingandhealth/People/Faculty/Varadhan.html

 




-Original Message-
From: Patrick Connolly [mailto:[EMAIL PROTECTED] 
Sent: Monday, July 02, 2007 4:23 PM
To: Ravi Varadhan
Cc: 'Mark Difford'; r-help@stat.math.ethz.ch
Subject: Re: [R] Question about PCA with prcomp

On Mon, 02-Jul-2007 at 03:16PM -0400, Ravi Varadhan wrote:

| Mark,
| 
| What you are referring to deals with the selection of covariates, since
PC
| doesn't do dimensionality reduction in the sense of covariate selection.
| But what Mark is asking for is to identify how much each data point
| contributes to individual PCs.  I don't think that Mark's query makes
much
| sense, unless he meant to ask: which individuals have high/low scores on
| PC1/PC2.  Here are some comments that may be tangentially related to
Mark's
| question:
| 
| 1.  If one is worried about a few data points contributing heavily to the
| estimation of PCs, then one can use robust PCA, for example, using robust
| covariance matrices.  MASS has some tools for this.
| 2.  The biplot for the first 2 PCs can give some insights
| 3. PCs, especially, the last few PCs, can be used to identify outliers.

What is meant by last few PCs?

-- 
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.   
   ___Patrick Connolly   
 {~._.~} Great minds discuss ideas
 _( Y )_Middle minds discuss events 
(:_~*~_:)Small minds discuss people  
 (_)-(_)   . Anon
  
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Question about PCA with prcomp

2007-07-02 Thread Mark Difford

Hi James, Ravi:

James wrote:
...
 I have 20 entities for which I have ~500 measurements each. So, I   
 have a matrix of 20 rows by ~500 columns.
...

Perhaps I misread James' question, but I don't think so.  As James described
it, we have ~500 measurements made on 20 objects.  A PCA on this [20
rows/observations by ~ 500 columns/descriptors/variables] should return ~
500 eigenvalues.  And each of these columns/descriptors/variables will have
a loading on each PC.

James wants to reduce his descriptors/measurements/variables to the most
important (variance).  A primitive way of doing this would be to examine
the loadings on the first 2--3 PCs and choose those
columns/descriptors/variables with the highest loadings, and throw away the
rest.  [He has already decided that he can throw away all but the first two
PCs.]  In fact, it would be a very good idea to do a coinertia analysis on
the pre- and post-selected sets, and look at the RV value.  If this is above
[thumbsuck] 0.9, then you're doing very well (there's a good plot method for
this in ade4, cf coinertia c).

But see Cadima et al. (+refs for caution; and elsewhere) for more
sophisticated methods of subsetting.

Regards,
Mark.


Ravi Varadhan wrote:
 
 Mark,
 
 What you are referring to deals with the selection of covariates, since PC
 doesn't do dimensionality reduction in the sense of covariate selection.
 But what Mark is asking for is to identify how much each data point
 contributes to individual PCs.  I don't think that Mark's query makes much
 sense, unless he meant to ask: which individuals have high/low scores on
 PC1/PC2.  Here are some comments that may be tangentially related to
 Mark's
 question:
 
 1.  If one is worried about a few data points contributing heavily to the
 estimation of PCs, then one can use robust PCA, for example, using robust
 covariance matrices.  MASS has some tools for this.
 2.  The biplot for the first 2 PCs can give some insights
 3. PCs, especially, the last few PCs, can be used to identify outliers.
   
 Hope this is helpful,
 Ravi.
 
 
 ---
 
 Ravi Varadhan, Ph.D.
 
 Assistant Professor, The Center on Aging and Health
 
 Division of Geriatric Medicine and Gerontology 
 
 Johns Hopkins University
 
 Ph: (410) 502-2619
 
 Fax: (410) 614-9625
 
 Email: [EMAIL PROTECTED]
 
 Webpage:  http://www.jhsph.edu/agingandhealth/People/Faculty/Varadhan.html
 
  
 
 
 
 
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Mark Difford
 Sent: Monday, July 02, 2007 1:55 PM
 To: r-help@stat.math.ethz.ch
 Subject: Re: [R] Question about PCA with prcomp
 
 
 Hi James,
 
 Have a look at Cadima et al.'s subselect package [Cadima worked with/was a
 student of Prof Jolliffe, one of _the_ experts on PCA; Jolliffe devotes
 part
 of a Chapter to this question in his text (Principal Component Analysis,
 pub. Springer)].  Then you should look at psychometric stuff: a good place
 to start would be Professor Revelle's psych package.
 
 BestR,
 Mark.
 
 
 James R. Graham wrote:
 
 Hello All,
 
 The basic premise of what I want to do is the following:
 
 I have 20 entities for which I have ~500 measurements each. So, I  
 have a matrix of 20 rows by ~500 columns.
 
 The 20 entities fall into two classes: good and bad.
 
 I eventually would like to derive a model that would then be able to  
 classify new entities as being in good territory or bad territory  
 based upon my existing data set.
 
 I know that not all ~500 measurements are meaningful, so I thought  
 the best place to begin would be to do a PCA in order to reduce the  
 amount of data with which I have to work.
 
 I did this using the prcomp function and found that nearly 90% of the  
 variance in the data is explained by PC1 and 2.
 
 So far, so good.
 
 I would now like to find out which of the original ~500 measurements  
 contribute to PC1 and 2 and by how much.
 
 Any tips would be greatly appreciated! And apologies in advance if  
 this turns out to be an idiotic question.
 
 
 james
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 
 
 -- 
 View this message in context:
 http://www.nabble.com/Question-about-PCA-with-prcomp-tf4012919.html#a1139860
 8
 Sent from the R help mailing list archive at Nabble.com.
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 __
 R-help

Re: [R] Question about PCA with prcomp

2007-07-02 Thread Bill.Venables
...but with 500 variables and only 20 'entities' (observations) you will
have 481 PCs with dead zero eigenvalues.  How small is 'smaller' and how
many is a few?

Everyone who has responded to this seems to accept the idea that PCA is
the way to go here, but that is not clear to me at all.  There is a
2-sample structure in the 20 observations that you have.  If you simply
ignore that in doing your PCA you are making strong assumptions about
sampling that would seem to me unlikely to be met.  If you allow for the
structure and project orthogonal to it then you are probably throwing
the baby out with the bathwater - you want to choose variables which
maximise separation between the 2 samples (and now you are up to 482
zero principal variances, if that matters...).

I think this problem probably needs a bit of a re-think.  Some variant
on singular LDA, for example, may be a more useful way to think about
it.

Bill Venables.  

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Ravi Varadhan
Sent: Monday, 2 July 2007 1:29 PM
To: 'Patrick Connolly'
Cc: r-help@stat.math.ethz.ch; 'Mark Difford'
Subject: Re: [R] Question about PCA with prcomp

The PCs that are associated with the smaller eigenvalues. 



---

Ravi Varadhan, Ph.D.

Assistant Professor, The Center on Aging and Health

Division of Geriatric Medicine and Gerontology 

Johns Hopkins University

Ph: (410) 502-2619

Fax: (410) 614-9625

Email: [EMAIL PROTECTED]

Webpage:
http://www.jhsph.edu/agingandhealth/People/Faculty/Varadhan.html

 





-Original Message-
From: Patrick Connolly [mailto:[EMAIL PROTECTED]
Sent: Monday, July 02, 2007 4:23 PM
To: Ravi Varadhan
Cc: 'Mark Difford'; r-help@stat.math.ethz.ch
Subject: Re: [R] Question about PCA with prcomp

On Mon, 02-Jul-2007 at 03:16PM -0400, Ravi Varadhan wrote:

| Mark,
| 
| What you are referring to deals with the selection of covariates, 
| since
PC
| doesn't do dimensionality reduction in the sense of covariate
selection.
| But what Mark is asking for is to identify how much each data point 
| contributes to individual PCs.  I don't think that Mark's query makes
much
| sense, unless he meant to ask: which individuals have high/low scores

| on PC1/PC2.  Here are some comments that may be tangentially related 
| to
Mark's
| question:
| 
| 1.  If one is worried about a few data points contributing heavily to

| the estimation of PCs, then one can use robust PCA, for example, 
| using robust covariance matrices.  MASS has some tools for this.
| 2.  The biplot for the first 2 PCs can give some insights 3. PCs, 
| especially, the last few PCs, can be used to identify outliers.

What is meant by last few PCs?

-- 
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.

   ___Patrick Connolly   
 {~._.~} Great minds discuss ideas
 _( Y )_Middle minds discuss events 
(:_~*~_:)Small minds discuss people  
 (_)-(_)   . Anon
  
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Question about PCA with prcomp

2007-07-02 Thread Mark Difford

To all ...,

Bill's lateral wisdom is almost certainly a better solution.  So thanks
for the advice (and everything else that went before it [Bill: apropos of
termplot, what happened to tplot ?]).  And I will [almost] desist from
asking the obvious: and if there were 10 000 observations ?

BestR,
Mark.


Bill.Venables wrote:
 
 ...but with 500 variables and only 20 'entities' (observations) you will
 have 481 PCs with dead zero eigenvalues.  How small is 'smaller' and how
 many is a few?
 
 Everyone who has responded to this seems to accept the idea that PCA is
 the way to go here, but that is not clear to me at all.  There is a
 2-sample structure in the 20 observations that you have.  If you simply
 ignore that in doing your PCA you are making strong assumptions about
 sampling that would seem to me unlikely to be met.  If you allow for the
 structure and project orthogonal to it then you are probably throwing
 the baby out with the bathwater - you want to choose variables which
 maximise separation between the 2 samples (and now you are up to 482
 zero principal variances, if that matters...).
 
 I think this problem probably needs a bit of a re-think.  Some variant
 on singular LDA, for example, may be a more useful way to think about
 it.
 
 Bill Venables.  
 
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Ravi Varadhan
 Sent: Monday, 2 July 2007 1:29 PM
 To: 'Patrick Connolly'
 Cc: r-help@stat.math.ethz.ch; 'Mark Difford'
 Subject: Re: [R] Question about PCA with prcomp
 
 The PCs that are associated with the smaller eigenvalues. 
 
 
 
 ---
 
 Ravi Varadhan, Ph.D.
 
 Assistant Professor, The Center on Aging and Health
 
 Division of Geriatric Medicine and Gerontology 
 
 Johns Hopkins University
 
 Ph: (410) 502-2619
 
 Fax: (410) 614-9625
 
 Email: [EMAIL PROTECTED]
 
 Webpage:
 http://www.jhsph.edu/agingandhealth/People/Faculty/Varadhan.html
 
  
 
 
 
 
 
 -Original Message-
 From: Patrick Connolly [mailto:[EMAIL PROTECTED]
 Sent: Monday, July 02, 2007 4:23 PM
 To: Ravi Varadhan
 Cc: 'Mark Difford'; r-help@stat.math.ethz.ch
 Subject: Re: [R] Question about PCA with prcomp
 
 On Mon, 02-Jul-2007 at 03:16PM -0400, Ravi Varadhan wrote:
 
 | Mark,
 | 
 | What you are referring to deals with the selection of covariates, 
 | since
 PC
 | doesn't do dimensionality reduction in the sense of covariate
 selection.
 | But what Mark is asking for is to identify how much each data point 
 | contributes to individual PCs.  I don't think that Mark's query makes
 much
 | sense, unless he meant to ask: which individuals have high/low scores
 
 | on PC1/PC2.  Here are some comments that may be tangentially related 
 | to
 Mark's
 | question:
 | 
 | 1.  If one is worried about a few data points contributing heavily to
 
 | the estimation of PCs, then one can use robust PCA, for example, 
 | using robust covariance matrices.  MASS has some tools for this.
 | 2.  The biplot for the first 2 PCs can give some insights 3. PCs, 
 | especially, the last few PCs, can be used to identify outliers.
 
 What is meant by last few PCs?
 
 -- 
 ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.
 
___Patrick Connolly   
  {~._.~}   Great minds discuss ideas
  _( Y )_  Middle minds discuss events 
 (:_~*~_:)  Small minds discuss people  
  (_)-(_) . Anon
 
 ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 

-- 
View this message in context: 
http://www.nabble.com/Question-about-PCA-with-prcomp-tf4012919.html#a11402204
Sent from the R help mailing list archive at Nabble.com.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.