Okay, let me make a few points:
(1) Consider getting a professional data analyst or
statistician if you are serious about this analysis. This
will either cost money and/or an co-authorship because
there are no simple answers to your data analysis problem
and there is much more examination of your data that
needs to be done as well as use of statistical software way
beyond SPSS. There's a literature on what you want to
do and some people have spent a lot of time thinking and
working on these issues. This is not a trivial enterprise.
(2) Have you run SPSS' reliability on your 23 items?
This would be done not to get a measure of the reliability
of the 23 items you have (though that will fall out of the
analysis) but to get descriptive statistics about the correlations
you are analyzing. With 23 items, you are factor analyzing
a 23 by 23 matrix or (23*22)/2=506/2=253 unique correlations.
What is the range of values for these correlations? Are there
any negative correlations? If all of these items are "measuring
the same thing" (i.e., a latent variable), there should be no
negative correlations -- if there are, why are there? What is
the squared multiple correlation (SMC) of each item with
other 22 items -- these SMCs will be the communalities
estimates that will be used in the diagonal of the correlation
that is factor analyzed (principal components will use 1.00
in the diagonal because it analyzes the "total" variance instead
of the "common" variance [it excludes error/unique variance).
What is the range of SMC values? Small SMCs indicate that
an item has little "common variance" with the other items,
indicating that whatever you think you are measuring, they
are not measuring the same thing (metaphorically, you may
think you have 23 oranges and you want orange juice but
you might actually have different fruits and you're going to
wind up with fruit salad).
(3) Conceptually, are these 23 items supposed to be measuring the
"same thing" (i.e., a single factor or a single latent variable) or are
there theoretical reasons to group the items together to reflect different
factors/latent variables? This is a theoretical issue and a statistical
issue, just like figuring out what effect size you want to determine,
how much statistical power you want, what sample size you need to
have, and so on. If the items represent different latent variables,
are the latent variable correlated or uncorrelated? Are the errors
and unique variances independent or are they correlated? Do you
expect methods factors to be present? These are question you should
try to answer *BEFORE* you analyze the data.
(4) Let's get clear on the difference between principal components
analysis and the numerous types of factor analyses. One can think of
principal components analysis are being a purely *descriptive procedure"
because its goal is to come up with a set of equations that account for
the total variance represented in a correlation matrix. Consider:
(a) If you have a correlation matrix that is just a diagonal matrix
(i.e., you have 1.00 [standardized variance for a variable] in the
diagonal and 0.00 in all of the off-diagonal elements [none of the pairs
of variables are correlated], you cannot reduce the matrix to a smaller
matrix. If you have 23 items, then the *RANK* of the matrix is 23.
You need all 23 items to explain the phenomenon you are measuring.
(b) If you have a correlation matrix with off-diagonal elements
significantly
different from zero, then you can reduce the, say, 23x23 matrix to a
smaller matrix, say FxF because the rows/columns are not independent
of each other. The correlated rows/columns can be combined into a
single equation that will retain the original information but allow one to
reduce the rank of the matrix to some smaller size.
(c) If you believe that all 23 items are measures of the same thing (a
single
latent variable), then the 23x23 correlation matrix will reduce to a single
value or factor. Imagine a regression equation that relates the values of
the 23 items to the single factor score -- the regression coefficients are
factor loadings which indicate the contribution of a single item to the
factor
score. In the ideal case, all pairwise correlations are high, errors and
unique
variances are uncorrelated, and a single factor accounts for all of the
variance among the items (in principal components analysis, the first
component
will be some huge value for the first eigenvalue [which is a solution to the
set of equation used in this analysis]) and the remaining 22 items will have
eigenvalues around 1.00. This is the type of model assumed by researchers
who believe in "g" or a general intelligence factor accounting for all of
the
variance in the items (NOTE: this is strictly true only under ideal
conditions,
in real life, well, that's another story).
(d) If you believe that the 23 items are measuring different things or
different
latent variables -- Jim Clark's SPSS simulation example created empirical
variable based on three uncorrelated latent variables and the question he
was
asking was how well did SPSS factoring program recover the latent structure,
an issue that we'll return to. Annette expressed surprise that she obtain
10
components/factors, with 5 that appeared to be "good" and a few with 1-3
items loading on them. Referring back to point (3) above, this reaction
suggests
that some thought about the latent structure for the items was given but not
really worked out. This raises questions about what the items are supposed
to measure, to what extent do the individual items appear to measure the
"same
thing" or "different things" -- these are not simple questions and their
answers
are also likely not be simple but they will guide how the analyses should
proceed
(5) Some people use principal components analys a purely *descriptive* tool,
that is, under relatively simple assumptions, one can ask if a correlation
matrix
(in this case 23x23) be reduced to a smaller matrix (say FxF) that account
for
some large proportion of the total variance among the items. If the 10
components
obtained from the analysis accounts for, say 70% of the total variance, then
some people will be very happy with that. Instead of using 23 values, one
can compute 10 component scores and use them in additional analyses.
NOTE: It is important to understand that principal components analysis is
a procedure that attempts to allocate portions of the total variance to
combinations
of correlated items -- it is concerned with explaining how the total
variance is to be
divided, much like eta^2 does in ANOVA. It does *NOT* attempt to explain
why there are specific correlations between variables or how many latent
variables
there might be serving as the basis for correlation. Factor analysis is
concerned
with doing that but one then commits to specific models in order to be able
to do
so. So, since principal components can be thought of a simply a variance
partitioning
procedure, one can treat its results in the same way one would treat other
descriptive statistics.
(6) Everything I have said above technically applies to variables measured
on an
interval or better scale (possibly for Likert type scales as well) but
things become
muddier (uglier?) when we are dealing with dichotomous variables. The
correlation
matrix now contains phi correlations and though phi is supposed to range
from
-1 to + 1, it can only do so only if the proportions in one variable matches
that
in another variable (Jim Clark did what was essentially a median split for
his
simulated data but it real life this condition is unlikely to be met). If
the proportions
are not equal, phi cannot reach +/- 1.00 (Guildford and Fruchter's
statistics textbook
has coverage of this topic). The phi correlations are likely to provide a
distorted
picture of what is going on even when conditions are ideal (I'll provided an
example shortly). This is one reason why some people have suggested the use
of the tetrachoric correlation (i.e., a dichotomous variable that has an
underlying
latent distribution such as the normal distribution) but, in my opinion, one
should
think long and hard about whether one's data meets this assumption. If one
has a dichotomous item such as "Are you pregnant?", the answer seem to be
a true dichotomy -- what would the underlying normal distribution mean?
One should be able to justify why one is using tetrachoric correlations
instead
of phi coefficient and explanations based on convenience as not likely to
work.
(7) Jim Clark provided SPSS code that produces 24 "empirical" variables
that are based on three factors (A, B, and C). He also suggested some
simple
analyses. Let me suggest a couple more:
(A) The following SPSS code conducts a maximum likelihood (ML) factor
analysis
with "promax" rotation (i.e., assume correlated factors and calculates the
correlation
between factors and adjusts the rest of the calculations according). I also
ask SPSS
to "blank" our factor loadings less than .25 in order to make the pattern of
loadings
of variables on factors clear. Here is the code:
factor /vari = a1 to c8/
format=sort blank(.25)/
extraction=ml/rotation=promax.
Three factors are extracted but account for only about 54% of the common
variance. The pattern matrix and structure matrix cleanly identify which
items
load on which factors -- they load as expected. One reason why people use
ML factor analysis is because it provide a goodness of fit Chi-square that
tells one whether the factor model fits the observed data. In this case, we
want a *NONSIGNIFICANT* Chi-square because cause this implies that
the data is consistent with the obtained model: X^2(df=207)= 208.65, p=.46.
So far, so good. The Factor Correlation Matrix has off-diagonal values
close
to zero, which is what we would expect if we assume three uncorrelated
factors
as the basis for the observed variables.
(B) The following SPSS code conducts a ML factor analysis on the
dichotomous variable that Jim Clark created.
factor /vari = na1 to nc8 /
format=sort blank(.25)/
extraction=ml/rotation=promax.
Now SPSS analyzes a PHI correlation matrix and, if dichotomization has
no effect, we should expect to get results very similar to those seen with
the original data. Unfortunately, this is not the case. Five factors are
extracted that account for about 49.70% of the common variance. After
rotation we see most of the empirical variables loading on the appropriate
factors but there are two items that load on "singleton" factors (i.e., only
one observed variable loads on the factor) these being Na2 and Na3.
However, the Goodness of Fit statistic indicates that this is a good
model: X^2(df=166)=140.69, p=.92. Examination of the Factor
Correlation Matrix shows that it is the singleton factors (factor 4 and 5)
that are correlated with other factors (e.g., Factor 2, Factor 4 with 5).
These seem like reasonable results but we know that this is wrong
because the data was generated from three independent variables.
The difference between the results here and those in 7(A) above
seem to be attributable to dichotomizing the data.
(8) Well, it seems like doing a factor analysis on a phi correlation
matrix is not such a good idea and it might be better to do it on a
tetrachoic correlation matrix. There's just one problem: SPSS doesn't
calculate tetrachoric correlations. Let me correct that:
(A) *IF* you have the Python language essentials package as a
plug-in to SPSS *AND* you have the "R" statistics program on your
machine, then you can call these programs to do the calculations.
See the following for details:
http://www-01.ibm.com/support/docview.wss?uid=swg21475247
(B) Well, if you don't have Python and/or R, don't worry, you can
use the TETRA-com SPSS program to calculate it for you. See:
http://link.springer.com/article/10.3758/s13428-012-0200-6
Of course, you'll have to download the files from here and
install them:
http://brm.psychonomic-journals.org/content/supplemental
Make sure you read and follow the instructions carefully.
(9) One can overcome some of these problems by using other
programs that will calculate a tetrachoric correlation matrix and
do a factor analysis/structural equation model (SEM) on it.
Programs like EQS, Lisrel, Mplus, and others (I don't know if
AMOS is up to date on this) but I would recommend using Mplus
because the Bengt Muthen who is responsible for the program
is also a talented statistician who has been developing the underlying
statistical theory for these types of analyses; see:
http://www.statmodel.com/
(10) NYU colleague Pat Shrout who teaches the graduate course
on structural equation modeling over in Arts & Science, has a nice
powerpoint that goes over the issues of analyzing dichotomous data
in FA/SEM which might be useful; see:
www.nyu.edu/classes/shrout/G89-2247/04Lect10.ppt
Good luck!
-Mike Palij
New York University
[email protected]
P.S. Remember that Zen koan about "those who know...", well nevermind. ;-)
---
You are currently subscribed to tips as: [email protected].
To unsubscribe click here:
http://fsulist.frostburg.edu/u?id=13090.68da6e6e5325aa33287ff385b70df5d5&n=T&l=tips&o=26155
or send a blank email to
leave-26155-13090.68da6e6e5325aa33287ff385b70df...@fsulist.frostburg.edu