Re: [R] Is it possible to use glm() with 30 observations?

2005-07-02 Thread David Firth
On 2 Jul 2005, at 06:01, Spencer Graves wrote:

 The issue is not 30 observations but whether it is possible to
 perfectly separate the two possible outcomes.  Consider the following:

 tst.glm - data.frame(x=1:3, y=c(0, 1, 0))
 glm(y~x, family=binomial, data=tst.glm)

 tst2.glm - data.frame(x=1:1000,
   y=rep(0:1, each=500))
 glm(y~x, family=binomial, data=tst2.glm)

 The algorithm fits y~x to tst.glm without complaining for tst.glm,
 but issues warnings for tst2.glm.  This is called the Hauck-Donner
 effect, and RSiteSearch(Hauck-Donner) just now produced 8 hits.  For
 more information, look for Hauck-Donnner in the index of Venables, W.
 N. and Ripley, B. D. (2002) _Modern Applied Statistics with S._ New
 York: Springer.

Not exactly.  The phenomenon that causes the warning for tst2.glm above 
is more commonly known as complete separation.  For some comments on 
its implications you might look at another work by B D Ripley, the 1996 
book Pattern Recognition and Neural Networks.  There are some further 
references in the help files of the brlr package on CRAN.

The problem noted by Hauck and Donner (1997, JASA) is slightly related, 
but not the same.  See the aforementioned book by Venables and Ripley, 
for example.  The glm function does not routinely warn us about the 
Hauck-Donner effect, afaik.

The original poster did not say what was the purpose of the logistic 
regression was, so it is hard to advise.  Depending on the purpose, the 
separation that was detected may or may not be a problem.

Regards,
David

 (If you don't already have this book, I recommend you
 give serious consideration to purchasing a copy.  It is excellent on
 many issues relating to statistical analysis and R.

 Spencer Graves

 Kerry Bush wrote:

 I have a very simple problem. When using glm to fit
 binary logistic regression model, sometimes I receive
 the following warning:

 Warning messages:
 1: fitted probabilities numerically 0 or 1 occurred
 in: glm.fit(x = X, y = Y, weights = weights, start =
 start, etastart = etastart,
 2: fitted probabilities numerically 0 or 1 occurred
 in: glm.fit(x = X, y = Y, weights = weights, start =
 start, etastart = etastart,

 What does this output tell me? Since I only have 30
 observations, i assume this is a small sample problem.
 Is it possible to fit this model in R with only 30
 observations? Could any expert provide suggestions to
 avoid the warning?

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! 
 http://www.R-project.org/posting-guide.html

 -- 
 Spencer Graves, PhD
 Senior Development Engineer
 PDF Solutions, Inc.
 333 West San Carlos Street Suite 700
 San Jose, CA 95110, USA

 [EMAIL PROTECTED]
 www.pdf.com http://www.pdf.com
 Tel:  408-938-4420
 Fax: 408-280-7915

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! 
 http://www.R-project.org/posting-guide.html

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Is it possible to use glm() with 30 observations?

2005-07-02 Thread Ted Harding
On 02-Jul-05 Kerry Bush wrote:
 I have a very simple problem. When using glm to fit
 binary logistic regression model, sometimes I receive
 the following warning:
 
 Warning messages:
 1: fitted probabilities numerically 0 or 1 occurred
 in: glm.fit(x = X, y = Y, weights = weights, start =
 start, etastart = etastart,  
 2: fitted probabilities numerically 0 or 1 occurred
 in: glm.fit(x = X, y = Y, weights = weights, start =
 start, etastart = etastart,  
 
 What does this output tell me? Since I only have 30
 observations, i assume this is a small sample problem.

It isn't. Spencer Graves has shown clearly with two examples
that you can get a fit with no warnings with 3 observations,
and a fit with warnings for 1000 observations. As he says,
it arises when you get perfect separation with respect to
the linear model.

However, it may be worth expanding Spencer's explanation.
With a single explanatory variable x (as in Spencer's examples),
perfect separation occurs when y = 0 for all x = some x0,
and y = 1 for all x  x0.

One of the parameters in the linear model is the scale parameter
(i.e. the recipsorcal of the slope). If you express the model
in the form

  logit(P(Y=1;x)) = (x - mu)/sigma

then sigma is the scale parameter in question.

As sigma - 0, P(Y=1;x) - 0 for x  mu, and - 1 for x  mu.

Therefore, for any value of mu between x0 (at and below which
all y=0 in your data) and x1 (the next larger value of x, at
and above which all y=1), letting sigma - 0 gives a fit
which perfectly predicts your y-values: it predicts P(Y=1) = 0,
i.e. P(Y=0) = 1, for x  mu, and predicts P(Y=1) = 1 for x  mu;
and this is exactly what you have observed in the data.

So it's not a disaster -- in model-fitting terms, it is a
resounding success!

However, in real life one does not expect to be dealing with
a situation where the outcomes are so perfectly predictable,
and therefore one views such a result with due mistrust.
One attributes the perfect separation not to perfect
predictability, but to the possibility that, by chance,
all the y=0 occur at lower values of x, and all the y=1
at higher values of x.

 Is it possible to fit this model in R with only 30
 observations? Could any expert provide suggestions to
 avoid the warning?

Yes! As Spencer showed, it is possible with 3 -- but of
course it depends on the outcomes y.

As to a suggestion to avoid the warning -- what you really
need to avoid is data where the x-values are so sparse in
the neighbourhood of the P(Y=1;x) = 0.5 area that it becomes
likely that you will get y-values showing perfect separation.

What that means in practice is that, over the range of x-values
such that P(Y=1;x) rises from (say) 0.2 to 0.8 (chosen for
illustration), you should have several x values in your data.
Then the phenomonon of perfect separation becomes unlikely.

But what that means in real life is that you need enough data,
over the relevant range of x-values, to enable you to obtain
(with high probability) a non-zero estimate of sigma (i.e. an
estimate of slope 1/sigma which is not infinite) -- i.e. that
you have enough data, and in the right places, to estimate the
rate at which the probability increases from low to high values.

(Theoretically, it is possible that you get perfect separation
even with well-distributed x-values and sigma  0; but with
well-distributed x-values the chance that this would occur is so
small that it can be neglected).

So, to come back to your particular case, the really meaningful
suggestion for avoiding the warning is that you need better
data. If your study is such that you have to take the x-values
as they come (as opposed to a designed experiement where you
can decide what they are to be), then this suggestion boils
down to get more data.

What that would mean depends on having information about the
smallest value of sigma (largest value of slope) that is
*plausible* in your context. Your data are not particularly
useful, since they positively encourage adopting sigma=0.
So objective information about this could only come from
independent knowledge.

However, as a rule of thumb, in such a situation I would try
to get more data until I had, say, 10 x-values roughly evenly
distributed between the largest for which y=0 and the smallest
for which y=1. If that didn't work first time, then repeat
using the extended data as starting-point.

Or simply sample more data until the phenomenonof perfect
separation was well avoided, and the S.D. of the x-coefficient
was distinctly smaller than the value of the x-coefficient.

Hoping this helps,
Ted.



E-Mail: (Ted Harding) [EMAIL PROTECTED]
Fax-to-email: +44 (0)870 094 0861
Date: 02-Jul-05   Time: 10:45:04
-- XFMail --

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help

Re: [R] Is it possible to use glm() with 30 observations?

2005-07-02 Thread Spencer Graves
  I agree with Ted:  in model-fitting terms, it is a
resounding success!  With any data set having at least one point with a 
binomial yield of 0 or 100%, you can get this phenomenon by adding 
series of random numbers sequentially to a model.  Eventually, you will 
add enough variables that some linear combination of the predictors 
will provide perfect prediction for that case.  In such cases, the usual 
asymptotic normality of the parameter estimates breaks down, but you can 
still test using 2*log(likelihood ratio) being approximately chi-square 
with anova(fit1, fit2), as explained in Venables and Ripley and many 
other sources, e.g., ?anova.glm:

  tst2.glm - data.frame(x=1:1000,
+  y=rep(0:1, each=500))
  fit0 - glm(y~1, family=binomial, data=tst2.glm)
  fit1 - glm(y~x, family=binomial, data=tst2.glm)
Warning messages:
1: algorithm did not converge in: glm.fit(x = X, y = Y, weights = 
weights, start = start, etastart = etastart,
2: fitted probabilities numerically 0 or 1 occurred in: glm.fit(x = X, y 
= Y, weights = weights, start = start, etastart = etastart,
  anova(fit0, fit1, test=Chisq)
Analysis of Deviance Table

Model 1: y ~ 1
Model 2: y ~ x
   Resid. Df Resid. Dev  Df Deviance  P(|Chi|)
1   999 1386.3
2   998  6.764e-05   1   1386.3 1.999e-303
 
  This effect exposes a limit to the traditional advice for 
experimental design to test only at two levels for each experimental 
factor, and put them as far apart as possible without jeopardizing 
linearity.  Here, you want to estimate a priori the range over which the 
probability of success goes from, say, 20% to 80%, then double that 
range and make all sets of test conditions unique and spaced roughly 
evenly over that range.  Designing experiments with binary response is 
an extremely difficult problem.  This crude procedure should get you 
something moderately close to an optimal design.

  Best Wishes,
  spencer graves

(Ted Harding) wrote:
 On 02-Jul-05 Kerry Bush wrote:
 
I have a very simple problem. When using glm to fit
binary logistic regression model, sometimes I receive
the following warning:

Warning messages:
1: fitted probabilities numerically 0 or 1 occurred
in: glm.fit(x = X, y = Y, weights = weights, start =
start, etastart = etastart,  
2: fitted probabilities numerically 0 or 1 occurred
in: glm.fit(x = X, y = Y, weights = weights, start =
start, etastart = etastart,  

What does this output tell me? Since I only have 30
observations, i assume this is a small sample problem.
 
 
 It isn't. Spencer Graves has shown clearly with two examples
 that you can get a fit with no warnings with 3 observations,
 and a fit with warnings for 1000 observations. As he says,
 it arises when you get perfect separation with respect to
 the linear model.
 
 However, it may be worth expanding Spencer's explanation.
 With a single explanatory variable x (as in Spencer's examples),
 perfect separation occurs when y = 0 for all x = some x0,
 and y = 1 for all x  x0.
 
 One of the parameters in the linear model is the scale parameter
 (i.e. the recipsorcal of the slope). If you express the model
 in the form
 
   logit(P(Y=1;x)) = (x - mu)/sigma
 
 then sigma is the scale parameter in question.
 
 As sigma - 0, P(Y=1;x) - 0 for x  mu, and - 1 for x  mu.
 
 Therefore, for any value of mu between x0 (at and below which
 all y=0 in your data) and x1 (the next larger value of x, at
 and above which all y=1), letting sigma - 0 gives a fit
 which perfectly predicts your y-values: it predicts P(Y=1) = 0,
 i.e. P(Y=0) = 1, for x  mu, and predicts P(Y=1) = 1 for x  mu;
 and this is exactly what you have observed in the data.
 
 So it's not a disaster -- in model-fitting terms, it is a
 resounding success!
 
 However, in real life one does not expect to be dealing with
 a situation where the outcomes are so perfectly predictable,
 and therefore one views such a result with due mistrust.
 One attributes the perfect separation not to perfect
 predictability, but to the possibility that, by chance,
 all the y=0 occur at lower values of x, and all the y=1
 at higher values of x.
 
 
Is it possible to fit this model in R with only 30
observations? Could any expert provide suggestions to
avoid the warning?
 
 
 Yes! As Spencer showed, it is possible with 3 -- but of
 course it depends on the outcomes y.
 
 As to a suggestion to avoid the warning -- what you really
 need to avoid is data where the x-values are so sparse in
 the neighbourhood of the P(Y=1;x) = 0.5 area that it becomes
 likely that you will get y-values showing perfect separation.
 
 What that means in practice is that, over the range of x-values
 such that P(Y=1;x) rises from (say) 0.2 to 0.8 (chosen for
 illustration), you should have several x values in your data.
 Then the phenomonon of perfect separation becomes unlikely.
 
 But what that means in real life is that you need enough data,
 over the relevant range of x-values, to enable 

Re: [R] Is it possible to use glm() with 30 observations?

2005-07-01 Thread Spencer Graves
  The issue is not 30 observations but whether it is possible to 
perfectly separate the two possible outcomes.  Consider the following:

tst.glm - data.frame(x=1:3, y=c(0, 1, 0))
glm(y~x, family=binomial, data=tst.glm)

tst2.glm - data.frame(x=1:1000,
  y=rep(0:1, each=500))
glm(y~x, family=binomial, data=tst2.glm)

  The algorithm fits y~x to tst.glm without complaining for tst.glm, 
but issues warnings for tst2.glm.  This is called the Hauck-Donner 
effect, and RSiteSearch(Hauck-Donner) just now produced 8 hits.  For 
more information, look for Hauck-Donnner in the index of Venables, W. 
N. and Ripley, B. D. (2002) _Modern Applied Statistics with S._ New 
York: Springer.  (If you don't already have this book, I recommend you 
give serious consideration to purchasing a copy.  It is excellent on 
many issues relating to statistical analysis and R.

  Spencer Graves

Kerry Bush wrote:

 I have a very simple problem. When using glm to fit
 binary logistic regression model, sometimes I receive
 the following warning:
 
 Warning messages:
 1: fitted probabilities numerically 0 or 1 occurred
 in: glm.fit(x = X, y = Y, weights = weights, start =
 start, etastart = etastart,  
 2: fitted probabilities numerically 0 or 1 occurred
 in: glm.fit(x = X, y = Y, weights = weights, start =
 start, etastart = etastart,  
 
 What does this output tell me? Since I only have 30
 observations, i assume this is a small sample problem.
 Is it possible to fit this model in R with only 30
 observations? Could any expert provide suggestions to
 avoid the warning?
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

-- 
Spencer Graves, PhD
Senior Development Engineer
PDF Solutions, Inc.
333 West San Carlos Street Suite 700
San Jose, CA 95110, USA

[EMAIL PROTECTED]
www.pdf.com http://www.pdf.com
Tel:  408-938-4420
Fax: 408-280-7915

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html