Re: [R] A Tip: lm, glm, and retained cases

2008-08-28 Thread Prof Brian Ripley

In R-devel


na.action(GLM)


will work as the extractor.  The problem with attr(GLM$model, na.action)
is that the 'model' component is optional, and with 
model.frame(ModelObject) that if the 'model' component has been omitted it 
will try to recreate the model frame from the currently visible objects of 
the name originally used.  (Because that is error-prone, we switched to 
model=TRUE as the default.)


In earlier versions of R, GLM$na.action is the copy you want.

However, I think if you care about omitted rows, you should use 
na.action=na.exclude, for then most auxiliary functions will give you 
results for all the rows.



On Tue, 26 Aug 2008, Marc Schwartz wrote:


on 08/26/2008 07:31 PM (Ted Harding) wrote:

On 26-Aug-08 23:49:37, hadley wickham wrote:

On Tue, Aug 26, 2008 at 6:45 PM, Ted Harding
[EMAIL PROTECTED] wrote:

Hi Folks,
This tip is probably lurking somewhere already, but I've just
discovered it the hard way, so it is probably worth passing
on for the benefit of those who might otherwise hack their
way along the same path.

Say (for example) you want to do a logistic regression of a
binary response Y on variables X1, X2, X3, X4:

 GLM - glm(Y ~ X1 + X2 + X3 + X4)

Say there are 1000 cases in the data. Because of missing values
(NAs) in the variables, the number of complete cases retained
for the regression is, say, 600. glm() does this automatically.

QUESTION: Which cases are they?

You can of course find out by hand on the lines of

 ix - which( (!is.na(Y))(!is.na(X1))...(!is.na(X4)) )

but one feels that GLM already knows -- so how to get it to talk?

ANSWER: (e.g.)

 ix - as.integer(names(GLM$fit))


This is a partial match to 'fitted', and will only work if default row 
names were used.



Alternatively, you can use:

attr(GLM$model, na.action)

Hadley


Thanks! I can see that it works -- though understanding how
requires a deeper knowledge of R internals. However, since
you've approached it from that direction, simply

  GLM$model

is a dataframe of the retained cases (with corresponding
row-names), all variables at once, and that is possibly an
even simpler approach!


Or just use:

  model.frame(ModelObject)

as the extractor function...  :-)

Another 'a priori' approach would be to use na.omit() or one of its
brethren on the data frame before creating the model. Which function is
used depends upon how 'na.action' is set.

The returned value, or more specifically the 'na.action' attribute as
appropriate, would yield information similar to Hadley's approach
relative to which records were excluded.

For example, using the simple data frame in ?na.omit:

DF - data.frame(x = c(1, 2, 3), y = c(0, 10, NA))


DF

 x  y
1 1  0
2 2 10
3 3 NA

DF.na - na.omit(DF)


DF.na

 x  y
1 1  0
2 2 10


attr(DF.na, na.action)

3
3
attr(,class)
[1] omit


So you can see that record 3 was removed from the original data frame
due to the NA for 'y'.

HTH,

Marc Schwartz

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] A Tip: lm, glm, and retained cases

2008-08-27 Thread Peter Dalgaard
Marc Schwartz wrote:
 on 08/26/2008 07:31 PM (Ted Harding) wrote:
   
 On 26-Aug-08 23:49:37, hadley wickham wrote:
 
 On Tue, Aug 26, 2008 at 6:45 PM, Ted Harding
 [EMAIL PROTECTED] wrote:
   
 Hi Folks,
 This tip is probably lurking somewhere already, but I've just
 discovered it the hard way, so it is probably worth passing
 on for the benefit of those who might otherwise hack their
 way along the same path.

 Say (for example) you want to do a logistic regression of a
 binary response Y on variables X1, X2, X3, X4:

  GLM - glm(Y ~ X1 + X2 + X3 + X4)

 Say there are 1000 cases in the data. Because of missing values
 (NAs) in the variables, the number of complete cases retained
 for the regression is, say, 600. glm() does this automatically.

 QUESTION: Which cases are they?

 You can of course find out by hand on the lines of

  ix - which( (!is.na(Y))(!is.na(X1))...(!is.na(X4)) )

 but one feels that GLM already knows -- so how to get it to talk?

 ANSWER: (e.g.)

  ix - as.integer(names(GLM$fit))
 
 Alternatively, you can use:

 attr(GLM$model, na.action)

 Hadley
   
 Thanks! I can see that it works -- though understanding how
 requires a deeper knowledge of R internals. However, since
 you've approached it from that direction, simply

   GLM$model

 is a dataframe of the retained cases (with corresponding
 row-names), all variables at once, and that is possibly an
 even simpler approach!
 

 Or just use:

model.frame(ModelObject)

 as the extractor function...  :-)

 Another 'a priori' approach would be to use na.omit() or one of its
 brethren on the data frame before creating the model. Which function is
 used depends upon how 'na.action' is set.

 The returned value, or more specifically the 'na.action' attribute as
 appropriate, would yield information similar to Hadley's approach
 relative to which records were excluded.

 For example, using the simple data frame in ?na.omit:

 DF - data.frame(x = c(1, 2, 3), y = c(0, 10, NA))

   
 DF
 
   x  y
 1 1  0
 2 2 10
 3 3 NA

 DF.na - na.omit(DF)

   
 DF.na
 
   x  y
 1 1  0
 2 2 10

   
 attr(DF.na, na.action)
 
 3
 3
 attr(,class)
 [1] omit


 So you can see that record 3 was removed from the original data frame
 due to the NA for 'y'.
   
Also notice the possibility of

(g)lm(., na.action=na.exclude)

as in

library(ISwR); attach(thuesen)
fit - lm(short.velocity ~ blood.glucose, na.action=na.exclude)
which(is.na(fitted(fit))) # 16

This is often recommendable anyway, e.g. in case you want to plot
residuals against original predictors.

-- 
   O__   Peter Dalgaard Ă˜ster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark  Ph:  (+45) 35327918
~~ - ([EMAIL PROTECTED])  FAX: (+45) 35327907

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] A Tip: lm, glm, and retained cases

2008-08-26 Thread hadley wickham
On Tue, Aug 26, 2008 at 6:45 PM, Ted Harding
[EMAIL PROTECTED] wrote:
 Hi Folks,
 This tip is probably lurking somewhere already, but I've just
 discovered it the hard way, so it is probably worth passing
 on for the benefit of those who might otherwise hack their
 way along the same path.

 Say (for example) you want to do a logistic regression of a
 binary response Y on variables X1, X2, X3, X4:

  GLM - glm(Y ~ X1 + X2 + X3 + X4)

 Say there are 1000 cases in the data. Because of missing values
 (NAs) in the variables, the number of complete cases retained
 for the regression is, say, 600. glm() does this automatically.

 QUESTION: Which cases are they?

 You can of course find out by hand on the lines of

  ix - which( (!is.na(Y))(!is.na(X1))...(!is.na(X4)) )

 but one feels that GLM already knows -- so how to get it to talk?

 ANSWER: (e.g.)

  ix - as.integer(names(GLM$fit))

Alternatively, you can use:

attr(GLM$model, na.action)

Hadley

-- 
http://had.co.nz/

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] A Tip: lm, glm, and retained cases

2008-08-26 Thread Ted Harding
On 26-Aug-08 23:49:37, hadley wickham wrote:
 On Tue, Aug 26, 2008 at 6:45 PM, Ted Harding
 [EMAIL PROTECTED] wrote:
 Hi Folks,
 This tip is probably lurking somewhere already, but I've just
 discovered it the hard way, so it is probably worth passing
 on for the benefit of those who might otherwise hack their
 way along the same path.

 Say (for example) you want to do a logistic regression of a
 binary response Y on variables X1, X2, X3, X4:

  GLM - glm(Y ~ X1 + X2 + X3 + X4)

 Say there are 1000 cases in the data. Because of missing values
 (NAs) in the variables, the number of complete cases retained
 for the regression is, say, 600. glm() does this automatically.

 QUESTION: Which cases are they?

 You can of course find out by hand on the lines of

  ix - which( (!is.na(Y))(!is.na(X1))...(!is.na(X4)) )

 but one feels that GLM already knows -- so how to get it to talk?

 ANSWER: (e.g.)

  ix - as.integer(names(GLM$fit))
 
 Alternatively, you can use:
 
 attr(GLM$model, na.action)
 
 Hadley

Thanks! I can see that it works -- though understanding how
requires a deeper knowledge of R internals. However, since
you've approached it from that direction, simply

  GLM$model

is a dataframe of the retained cases (with corresponding
row-names), all variables at once, and that is possibly an
even simpler approach!

Ted.


E-Mail: (Ted Harding) [EMAIL PROTECTED]
Fax-to-email: +44 (0)870 094 0861
Date: 27-Aug-08   Time: 01:31:46
-- XFMail --

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] A Tip: lm, glm, and retained cases

2008-08-26 Thread Marc Schwartz
on 08/26/2008 07:31 PM (Ted Harding) wrote:
 On 26-Aug-08 23:49:37, hadley wickham wrote:
 On Tue, Aug 26, 2008 at 6:45 PM, Ted Harding
 [EMAIL PROTECTED] wrote:
 Hi Folks,
 This tip is probably lurking somewhere already, but I've just
 discovered it the hard way, so it is probably worth passing
 on for the benefit of those who might otherwise hack their
 way along the same path.

 Say (for example) you want to do a logistic regression of a
 binary response Y on variables X1, X2, X3, X4:

  GLM - glm(Y ~ X1 + X2 + X3 + X4)

 Say there are 1000 cases in the data. Because of missing values
 (NAs) in the variables, the number of complete cases retained
 for the regression is, say, 600. glm() does this automatically.

 QUESTION: Which cases are they?

 You can of course find out by hand on the lines of

  ix - which( (!is.na(Y))(!is.na(X1))...(!is.na(X4)) )

 but one feels that GLM already knows -- so how to get it to talk?

 ANSWER: (e.g.)

  ix - as.integer(names(GLM$fit))
 Alternatively, you can use:

 attr(GLM$model, na.action)

 Hadley
 
 Thanks! I can see that it works -- though understanding how
 requires a deeper knowledge of R internals. However, since
 you've approached it from that direction, simply
 
   GLM$model
 
 is a dataframe of the retained cases (with corresponding
 row-names), all variables at once, and that is possibly an
 even simpler approach!

Or just use:

   model.frame(ModelObject)

as the extractor function...  :-)

Another 'a priori' approach would be to use na.omit() or one of its
brethren on the data frame before creating the model. Which function is
used depends upon how 'na.action' is set.

The returned value, or more specifically the 'na.action' attribute as
appropriate, would yield information similar to Hadley's approach
relative to which records were excluded.

For example, using the simple data frame in ?na.omit:

DF - data.frame(x = c(1, 2, 3), y = c(0, 10, NA))

 DF
  x  y
1 1  0
2 2 10
3 3 NA

DF.na - na.omit(DF)

 DF.na
  x  y
1 1  0
2 2 10

 attr(DF.na, na.action)
3
3
attr(,class)
[1] omit


So you can see that record 3 was removed from the original data frame
due to the NA for 'y'.

HTH,

Marc Schwartz

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.