Applied analysis question

2002-02-27 Thread Brad Anderson

I have a continuous response variable that ranges from 0 to 750.  I
only have 90 observations and 26 are at the lower limit of 0, which is
the modal category.  The mean is about 60 and the median is 3; the
distribution is highly skewed, extremely kurtotic, etc.  Obviously,
none of the power transformations are especially useful.  The product
moment correlation between the response and the primary covariate is
near zero, however, a rank-order correlation coefficient is about .3
and is signficant.  We have 5 additional control variables.  I'm
convinced that any attempt to model the conditional mean response is
completely inappropriate, yet all of the alternatives appear flawed as
well.  Here's what I've done:

I've collapsed the outcome into 3- and 4- category ordered response
variables and estimated ordered logit models.  I dichotomized the
response (any vs none) and estimated binomial logit.  All of these
approaches yield substantively consistent results using both the model
based standard errors and the Huber-White sandwich robust standard
errors.  My concerns about this approach are 1) the somewhat arbitrary
classification restricts the observed variability, and 2) the
estimators assume large sample sizes.

I rank transformed the response variable and estimated a robust
regression (using the rreg procedure in Stata)--results were
consistent with those obtained for the ordered and binomial logit
models described above.  I know that Stokes, Davis, and Koch have
presented procedures to estimate analysis of covariance on ranks, but
I've not seen reference to the use of rank transformed response
variables in a regression context.

A plot of the rank-transformed response with the primary covariate
clearly suggests a meaningful pattern.  Contingency table analysis
with a collapsed covariate strongly suggest a meaningful pattern.  But
I'm at something of a loss to know the best way to analyze and report
the results.  Thanks in advance.


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Applied analysis question

2002-02-28 Thread Brad Anderson

Rich Ulrich [EMAIL PROTECTED] wrote in message 
news:[EMAIL PROTECTED]...
 On 27 Feb 2002 11:59:53 -0800, [EMAIL PROTECTED] (Brad Anderson)
 wrote:
 
  I have a continuous response variable that ranges from 0 to 750.  I
  only have 90 observations and 26 are at the lower limit of 0, which is
  the modal category.  The mean is about 60 and the median is 3; the
  distribution is highly skewed, extremely kurtotic, etc.  Obviously,
  none of the power transformations are especially useful.  The product
 
 I guess it is 'continuous'  except for having 26 ties at 0.  
 I have to wonder how that set of scores arose, and also, 
 what should a person guess about the *error*  associated
 with those:   Are the numbers near 750  measured with
 as much accuracy as the numbers near 3?

I should have been more precise.  It's technically a count variable
representing the number of times respondents report using dirty
needles/syringes after someone else had used them during the past 90
days.  Subjects were first asked to report the number of days they had
injected drugs, then the average number of times they injected on
injection days, and finally, on how many of those total times they had
used dirty needles/syringes.  All of the subjects are injection drug
users, but not all use dirty needles.  The reliability of reports near
0 is likely much better than the reliability of estimates near 750. 
Indeed, substantively, the difference between a 0 and 1 is much more
significant than the difference between a 749 and a 750--0 represents
no risk, 1 represents at least some risk, and high values--regardless
of the precision, represent high risk.
 
 How do zero scores arise?  Is this truncation;  the limit of
 practical measurement;  or just what?

Zero scores are logical and represent no risk, negative values are not
logical.
 
 Extremely kurtotic, you say.  That huge lump at 0 and skew
 is not consistent with what I think of as kurtosis, but I guess
 I have not paid attention to kurtosis at all, once I know that
 skewness is extraordinary.

True, the kurtosis statistic exceeded 11, and and a plot against the
normal indicates a huge lump in the low end of the tail, and also a
larger proportion of very high values than expected.
 
 Categorizing the values into a few categories labeled, 
 none, almost none,   is one way to convert your scores.  
 If those labels do make sense.

Makes sense at the low end 0 risk.  And at the high end I used 90+
representing using a dirty needle/syringe once a day or more often. 
The 2 middle categories were pretty arbitrary.

If I analyze a contingency Table using the 4-category response and a
3-category measure of the primary covariate (categories defined using
clinically meaningful categories, the association is quite strong
and I used the exact p-value associated with the CMH difference in row
means test (using SAS) and the association is signficant.  I also used
the 3-category predictor and the procedures outlined by Stokes et al.
(2000) to estimate a rank analysis of covariance--again with
consistent results.

I've also run a few other analyses I didn't describe.  I used the
Box-Cox procedure to find a power transformation.  Although the
skewness statistic then looks great, the distribution is still not
approximately normal.  However, a regression using the transformed
variable is consistent with the ordered logit and the contingency
table analysis.

One of the other posters asked about the appropriate error term--I
guess that lies at the heart of my inquiry.  I have no idea what the
appropriate error term would be, and to best model such data.  I often
deal with similar response variables that have distributions in which
observations are clustered at 1 or both ends of the continuum.  In
most cases, these distributions are not even approximately unimodal
and a bit skewed--variables for which normalizing power
transformations make sense.  Additionally, these typically aren't
outcomes that could be thought of as being generated by a gaussian
process.

In some cases I think it makes sense to consider poisson and
generalizations of poisson processes although there is clearly much
greater between subject heterogeneity than assumed by a poisson
process.  I estimated poission and negative binomial regression
models--there was compelling evidence that the poission was
overdispersed.  I also used a Vuong statistic to compare NB regression
with zero-inflated NB regression--the results support the
zero-inflated model.  The model standard errors for a zero-inflated
model are wildly different than the Huber-White sandwich robust
standard errors.  The later give results that are fairly consistent
with the ordered logit, the model based standard errors are
huge--given that these are asymptotic statistics and I have a
relatively small sample, I don't really trust either.

I think a lot of folks just run standard analyses or arbitrarily apply
some normalizing transformation because that's whats done

Re: Applied analysis question

2002-03-01 Thread Brad Anderson

[EMAIL PROTECTED] (Eric Bohlman) wrote in message 
news:a5o5b1$fi0$[EMAIL PROTECTED]...
 Rolf Dalin [EMAIL PROTECTED] wrote:
 
 IIRC, your example is exactly the sort of situation for which Tobit 
 modelling was invented.

Considered that (actually estimated a couple of Tobit models and if I
use a log transformed or box-cox transformed response the results are
consistent with the ordinal logit I originally described) but Tobt
assumes a normally distributed censored response -- the observed
distribution for the non-zero responses is not approximately normal
(even with transformations) and I don't think it's reasonable to
assume the errors are generated by an underlying gaussian process.  My
understanding of the Tobit model is that it's not especially robust to
violations of the this assumption.


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=