Re:[tips] factor analysis

Mike Palij Wed, 19 Jun 2013 07:53:01 -0700

Okay, let me make a few points:

(1) Consider getting a professional data analyst or
statistician if you are serious about this analysis.  This
will either cost money and/or an co-authorship because
there are no simple answers to your data analysis problem
and there is much more examination of your data that
needs to be done as well as use of statistical software way
beyond SPSS.  There's a literature on what you want to
do and some people have spent a lot of time thinking and
working on these issues.  This is not a trivial enterprise.


(2) Have you run SPSS' reliability on your 23 items?
This would be done not to get a measure of the reliability
of the 23 items you have (though that will fall out of the
analysis) but to get descriptive statistics about the correlations
you are analyzing.  With 23 items, you are factor analyzing
a 23 by 23 matrix or (23*22)/2=506/2=253 unique correlations.
What is the range of values for these correlations? Are there
any negative correlations?  If all of these items are "measuring
the same thing" (i.e., a latent variable), there should be no
negative correlations -- if there are, why are there?  What is
the squared multiple correlation (SMC) of each item with
other 22 items -- these SMCs will be the communalities
estimates that will be used in the diagonal of the correlation
that is factor analyzed (principal components will use 1.00
in the diagonal because it analyzes the "total" variance instead
of the "common" variance [it excludes error/unique variance).
What is the range of SMC values?  Small SMCs indicate that
an item has little "common variance" with the other items,
indicating that whatever you think you are measuring, they
are not measuring the same thing (metaphorically, you may
think you have 23 oranges and you want orange juice but
you might actually have different fruits and you're going to
wind up with fruit salad).

(3) Conceptually, are these 23 items supposed to be measuring the
"same thing" (i.e., a single factor or a single latent variable) or are
there theoretical reasons to group the items together to reflect different
factors/latent variables?  This is a theoretical issue and a statistical
issue, just like figuring out what effect size you want to determine,
how much statistical power you want, what sample size you need to
have, and so on.  If the items represent different latent variables,
are the latent variable correlated or uncorrelated?  Are the errors
and unique variances independent or are they correlated?  Do you
expect methods factors to be present?  These are question you should
try to answer *BEFORE* you analyze the data.

(4) Let's get clear on the difference between principal components
analysis and the numerous types of factor analyses.  One can think of
principal components analysis are being a purely *descriptive procedure"
because its goal is to come up with a set of equations that account for
the total variance represented in a correlation matrix.  Consider:

(a)  If you have a correlation matrix that is just a diagonal matrix
(i.e., you have 1.00 [standardized variance for a variable] in the
diagonal and 0.00 in all of the off-diagonal elements [none of the pairs
of variables are correlated], you cannot reduce the matrix to a smaller
matrix.  If you have 23 items, then the *RANK* of the matrix is 23.
You need all 23 items to explain the phenomenon you are measuring.

(b) If you have a correlation matrix with off-diagonal elementssignificantly

different from zero, then you can reduce the, say, 23x23 matrix to a
smaller matrix, say FxF because the rows/columns are not independent
of each other.  The correlated rows/columns can be combined into a
single equation that will retain the original information but allow one to
reduce the rank of the matrix to some smaller size.

(c) If you believe that all 23 items are measures of the same thing (asingle

latent variable), then the 23x23 correlation matrix will reduce to a single
value or factor.  Imagine a regression equation that relates the values of
the 23 items to the single factor score -- the regression coefficients are

factor loadings which indicate the contribution of a single item to thefactorscore. In the ideal case, all pairwise correlations are high, errors andunique

variances are uncorrelated, and a single factor accounts for all of the

variance among the items (in principal components analysis, the firstcomponent

will be some huge value for the first eigenvalue [which is a solution to the
set of equation used in this analysis]) and the remaining 22 items will have
eigenvalues around 1.00.  This is the type of model assumed by researchers

who believe in "g" or a general intelligence factor accounting for all ofthevariance in the items (NOTE: this is strictly true only under idealconditions,

in real life, well, that's another story).

(d) If you believe that the 23 items are measuring different things ordifferent

latent variables -- Jim Clark's SPSS simulation example created empirical

variable based on three uncorrelated latent variables and the question hewas

asking was how well did SPSS factoring program recover the latent structure,

an issue that we'll return to. Annette expressed surprise that she obtain10

components/factors, with 5 that appeared to be "good" and a few with 1-3

items loading on them. Referring back to point (3) above, this reactionsuggests

that some thought about the latent structure for the items was given but not
really worked out.  This raises questions about what the items are supposed

to measure, to what extent do the individual items appear to measure the"samething" or "different things" -- these are not simple questions and theiranswersare also likely not be simple but they will guide how the analyses shouldproceed


(5) Some people use principal components analys a purely *descriptive* tool,

that is, under relatively simple assumptions, one can ask if a correlationmatrix(in this case 23x23) be reduced to a smaller matrix (say FxF) that accountforsome large proportion of the total variance among the items. If the 10components

obtained from the analysis accounts for, say 70% of the total variance, then
some people will be very happy  with that.  Instead of using 23 values, one
can compute 10 component scores and use them in additional analyses.

NOTE:  It is important to understand that principal components analysis is

a procedure that attempts to allocate portions of the total variance tocombinationsof correlated items -- it is concerned with explaining how the totalvariance is to be

divided, much like eta^2 does in ANOVA. It does *NOT* attempt to explain

why there are specific correlations between variables or how many latentvariablesthere might be serving as the basis for correlation. Factor analysis isconcernedwith doing that but one then commits to specific models in order to be ableto doso. So, since principal components can be thought of a simply a variancepartitioning

procedure, one can treat its results in the same way one would treat other
descriptive statistics.

(6) Everything I have said above technically applies to variables measuredon aninterval or better scale (possibly for Likert type scales as well) butthings becomemuddier (uglier?) when we are dealing with dichotomous variables. Thecorrelationmatrix now contains phi correlations and though phi is supposed to rangefrom-1 to + 1, it can only do so only if the proportions in one variable matchesthatin another variable (Jim Clark did what was essentially a median split forhissimulated data but it real life this condition is unlikely to be met). Ifthe proportionsare not equal, phi cannot reach +/- 1.00 (Guildford and Fruchter'sstatistics textbookhas coverage of this topic). The phi correlations are likely to provide adistorted

picture of what is going on even when conditions are ideal (I'll provided an
example shortly).  This is one reason why some people have suggested the use

of the tetrachoric correlation (i.e., a dichotomous variable that has anunderlyinglatent distribution such as the normal distribution) but, in my opinion, oneshould

think long and hard about whether one's data meets this assumption.  If one
has a dichotomous item such as "Are you pregnant?", the answer seem to be
a true dichotomy -- what would the underlying normal distribution mean?

One should be able to justify why one is using tetrachoric correlationsinsteadof phi coefficient and explanations based on convenience as not likely towork.


(7)  Jim Clark provided SPSS code that produces 24 "empirical" variables

that are based on three factors (A, B, and C). He also suggested somesimple

analyses.  Let me suggest a couple more:

(A) The following SPSS code conducts a maximum likelihood (ML) factoranalysiswith "promax" rotation (i.e., assume correlated factors and calculates thecorrelationbetween factors and adjusts the rest of the calculations according). I alsoask SPSSto "blank" our factor loadings less than .25 in order to make the pattern ofloadings

of variables on factors clear.  Here is the code:

factor /vari = a1 to c8/
format=sort blank(.25)/
extraction=ml/rotation=promax.

Three factors are extracted but account for only about 54% of the common

variance. The pattern matrix and structure matrix cleanly identify whichitems

load on which factors -- they load as expected. One reason why people use
ML factor analysis is because it provide a goodness of fit Chi-square that
tells one whether the factor model fits the observed data.  In this case, we
want a *NONSIGNIFICANT* Chi-square because cause this implies that
the data is consistent with the obtained model: X^2(df=207)= 208.65, p=.46.

So far, so good. The Factor Correlation Matrix has off-diagonal valuescloseto zero, which is what we would expect if we assume three uncorrelatedfactors

as the basis for the observed variables.

(B) The following SPSS code conducts a ML factor analysis on the
dichotomous variable that Jim Clark created.

factor /vari = na1 to nc8 /
  format=sort blank(.25)/
extraction=ml/rotation=promax.

Now SPSS analyzes a PHI correlation matrix and, if dichotomization has
no effect, we should expect to get results very similar to those seen with
the original data.  Unfortunately, this is not the case.  Five factors are
extracted that account for about 49.70% of the common variance. After
rotation we see most of the empirical variables loading on the appropriate
factors but there are two items that load on "singleton" factors (i.e., only
one observed variable loads on the factor) these being Na2 and Na3.
However, the Goodness of Fit statistic indicates that this is a good
model: X^2(df=166)=140.69, p=.92.  Examination of the Factor
Correlation Matrix shows that it is the singleton factors (factor 4 and 5)
that are correlated with other factors (e.g., Factor 2, Factor 4 with 5).
These seem like reasonable results but we know that this is wrong
because the data was generated from three independent variables.
The difference between the results here and those in 7(A) above
seem to be attributable to dichotomizing the data.

(8)  Well, it seems like doing a factor analysis on a phi correlation
matrix is not such a good idea and it might be better to do it on a
tetrachoic correlation matrix.  There's just one problem:  SPSS doesn't
calculate tetrachoric correlations.  Let me correct that:

(A) *IF* you have the Python language essentials package as a
plug-in to SPSS *AND* you have the "R" statistics program on your
machine, then you can call these programs to do the calculations.
See the following for details:
http://www-01.ibm.com/support/docview.wss?uid=swg21475247

(B) Well, if you don't have Python and/or R, don't worry, you can
use the TETRA-com SPSS program to calculate it for you. See:
http://link.springer.com/article/10.3758/s13428-012-0200-6
Of course, you'll have to download the files from here and
install them:
http://brm.psychonomic-journals.org/content/supplemental
Make sure you read and follow the instructions carefully.

(9)  One can overcome some of these problems by using other
programs that will calculate a tetrachoric correlation matrix and
do a factor analysis/structural equation model (SEM) on it.
Programs like EQS, Lisrel, Mplus, and others (I don't know if
AMOS is up to date on this) but I would recommend using Mplus
because the Bengt Muthen who is responsible for the program
is also a talented statistician who has been developing the underlying
statistical theory for these types of analyses; see:
http://www.statmodel.com/

(10) NYU colleague Pat Shrout who teaches the graduate course
on structural equation modeling over in Arts & Science, has a nice
powerpoint that goes over the issues of analyzing dichotomous data
in FA/SEM which might be useful; see:
www.nyu.edu/classes/shrout/G89-2247/04Lect10.ppt‎

Good luck!

-Mike Palij
New York University
[email protected]

P.S. Remember that Zen koan about "those who know...", well nevermind. ;-)





---
You are currently subscribed to tips as: [email protected].
To unsubscribe click here: 
http://fsulist.frostburg.edu/u?id=13090.68da6e6e5325aa33287ff385b70df5d5&n=T&l=tips&o=26155
or send a blank email to 
leave-26155-13090.68da6e6e5325aa33287ff385b70df...@fsulist.frostburg.edu

Re:[tips] factor analysis

Reply via email to