Re: [R] naive collinear weighted linear regression

2009-11-15 Thread Mauricio O Calvao
Peter Dalgaard p.dalgaard at biostat.ku.dk writes:
 
 The point is that R (as well as almost all other mainstream statistical 
 software) assumes that a weight means that the variance of the 
 corresponding observation is the general variance divided by the weight 
 factor. The general variance is still determined from the residuals, and 
 if they are zero to machine precision, well, there you go. I suspect you 
   get closer to the mark with glm, which allows you to assume that the 
 dispersion is known:
 
   summary(glm(y~x,family=gaussian),dispersion=0.3^2)
 
 or
 
   summary(glm(y~x,family=gaussian,weights=1/error^2),dispersion=1)
 

Excellent; any of these commands provide Std. Errors which now coincide with my
naive expectation: though the data fall perfectly in a straight line, since they
have some associated uncertainties (only) in the response variables
(homoskedasticity), the estimated coefficients should have some kind of
nonvanishing uncertainties as well, should they not??

Now, forgive me, but I did not get the explanation for the distinct meanings of
Std. Error when calling simply summary(lm(y~x,weights=1/error^2), which I had
done before, and your suggested calls; could you rephrase and dwell a little bit
more upon this point. What does the option dispersion exactly mean?

Also, could you suggest some specific reference for me to read about this? I
have your excellent book Introductory statistics with R, 1st edition, but was
not able (perhaps I have missed some point) to find this kind of distinction
there... Does this theme is specifically what statisticians call really
generalized linear models (glm) as opposed to (ordinary) linear models? If so,
which good references could you please suggest?? I thought of the following
books and would feel much obliged should you give me your impressions about
them, if any, or about any other relevant references at all:

1) Faraway, Linear models with R
2) Faraway, Extending the linear model with R: generalized linear...
3) Fox, An R and S-Plus companion..
4) Uusipaikka, Confidence intervals in generalized linear regression models

Thank you very much!!

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] naive collinear weighted linear regression

2009-11-15 Thread Mauricio O Calvao
David Winsemius dwinsemius at comcast.net writes:


 
 It's really not that difficult to get the variance covariance matrix.  
 What is not so clear is why you think differential weighting of a set  
 that has a perfect fit should give meaningfully different results than  
 a fit that has no weights.

Again, David, what I have in mind is: since there are errors or uncertainties in
the response variables (despite the perfect collinearity of the data), which I
assume are Gaussian, if I make a large enough number of simulations of four
response values, there will undoubtedly be a dispersion in the best fit
intercept and slope obtained from a usual unweighted least squares procedure,
right? Then, if I calculate the arithmetic mean of these simulated intercept and
slope, I would certainly check that they would be 0 and 2, respectively.
However, and THAT IS THE POINT, there will also be a standard deviation
associated with each one of these two coefficients, right??, and that is what I
would assign as the measure of uncertainty in the estimation of the
coefficients. This is not, as Dalgaard has called attention to, what the simple
command summary(lm(y~x,weights=1/err^2)) provides in its Std. Error. However, as
Dalgaard also recalled, the command
summary(glm(y~x,family=gaussian,weights=1/err^2),dispersion=1) does provide Std.
Errors in the coefficients which look plausible (at least to me) and, at any
rate, which do coincide with results from other packages (Numerical Recipes,
ROOT and possibly GSL...)

 
 ?lm
 ?vcov
 
y - c(2,4,6,8) # response vect
   fit_mod - lm(y~x,weights=1/error^2)
 Error in eval(expr, envir, enclos) : object 'error' not found
   error - c(0.3,0.3,0.3,0.3)
   fit_mod - lm(y~x,weights=1/error^2)
   vcov(fit_mod)
(Intercept) x
 (Intercept)  2.396165e-30 -7.987217e-31
 x   -7.987217e-31  3.194887e-31
 
 Numerically those are effectively zero.
 
   fit_mod - lm(y~x)
   vcov(fit_mod)
  (Intercept) x
 (Intercept)   0 0
 x 0 0


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] naive collinear weighted linear regression

2009-11-14 Thread Mauricio O Calvao
David Winsemius dwinsemius at comcast.net writes:


 
 Which means those x, y, and error figures did not come from an  
 experiment, but rather from theory???
 

The fact is I am trying to compare the results of:
(1) lm under R and 
(2) the Java applet at http://omnis.if.ufrj.br/~carlos/applets/reta/reta.html 
(3) the Fit method of the ROOT system used by CERN,
(4) the Numerical Recipes functions for weighted linear regression

The three last ones all provide, for the fake data set I furnished in my first
post in this thread, the same results; particularly they give erros or
uncertainties in the estimated coefficients of intercept and slope which, as
seems intuitive, are not zero at all, but of the order 0.1 or 0.2, whereas the
method lm under R issues a Std. Error, which is zero. Independently of
terminology, which sure is of utmost importance, the data I provided should give
rise to a best fit straight line with intercept zero and slope 2, but also with
non-vanishing errors associated with them. How do I get this from lm

I only want, for instance, calculation of the so-called covariance matrix for
the estimated coefficients, as given, for instance, in Equation (2.3.2) of the
second edition of Draper and Smith, Applied regression analysis; this is a
standard statistical result, right? So why does R not directly provide it as a
summary from an lm object???


 
  Of course the best fit coefficients should be 0 for the intercept  
  and 2 for the slope. Furthermore, it seems completely plausible (or  
  not?) that, since the y_i have associated non-vanishing  
  ``errors'' (dispersions), there should be corresponding non- 
  vanishing ``errors'' associated to the best fit coefficients, right?
 
  When I try:
 
   fit_mod - lm(y~x,weights=1/error^2)
 
  I get
 
  Warning message:
  In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
   extra arguments weigths are just disregarded.
 
   (Actually the weights are for adjusting for sampling, and I do not  
 see any sampling in your design.)
 
 
  Keeping on, despite the warning message, which I did not quite  
  understand, when I type:
 
   summary(fit_mod)
 
  I get
 
  Call:
  lm(formula = y ~ x, weigths = 1/error^2)
 
  Residuals:
  1  2  3  4
  -5.067e-17  8.445e-17 -1.689e-17 -1.689e-17
 
  Coefficients:
  Estimate Std. Error   t value Pr(|t|)
  (Intercept) 0.000e+00  8.776e-17 0.000e+001
  x   2.000e+00  3.205e-17 6.241e+16   2e-16 ***
  ---
  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
  Residual standard error: 7.166e-17 on 2 degrees of freedom
  Multiple R-squared: 1,  Adjusted R-squared: 1
  F-statistic: 3.895e+33 on 1 and 2 DF,  p-value:  2.2e-16
 
 
  Naively, should not the column Std. Error be different from zero??  
  What I have in mind, and sure is not what Std. Error means, is that  
  if I carried out a large simulation, assuming each response y_i a  
  Gaussian random variable with mean y_i and standard deviation  
  2*error=0.6, and then making an ordinary least squares fitting of  
  the slope and intercept, I would end up with a mean for these  
  simulated coefficients which should be 2 and 0, respectively,
 
 Well, not precisely 2 and 0, but rather something very close ... i.e,  
 within experimental error. Please note that numbers in the range of  
 10e-17 are effectively zero from a numerical analysis perspective.
 

http://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f
 
   .Machine$double.eps ^ 0.5
 [1] 1.490116e-08

I know this all too well and it is obviously a trivial supernewbie issue, which
I have already overcome a long time ago...

 
  and, that's the point, a non-vanishing standard deviation for these  
  fitted coefficients, right?? This somehow is what I expected should  
  be an estimate or, at least, a good indicator, of the degree of  
  uncertainty which I should assign to the fitted coefficients; it  
  seems to me these deviations, thus calculated as a result of the  
  simulation, will certainly not be zero (or  3e-17, for that matter).  
  So this Std. Error does not provide what I, naively, think should be  
  given as a measure of the uncertainties or errors in the fitted  
  coefficients...
 
 You are trying to impose an error structure on a data situation that  
 you constructed artificially to be perfect.
 
 
  What am I not getting right??
 
 That if you input perfection into R's linear regression program, you  
 get appropriate warnings?
 
 
  Thanks and sorry for the naive and non-expert question!
 
 You are a Professor of physics, right? You do experiments, right? You  
 replicate them.  S0 perhaps I'm the one who should be puzzled.

Unfortunately you eschewed answering objectively any of my questions; I insist
they do make sense. Don't mention the data are perfect; this does not help to
make any progress in understanding the choice of convenient