[R] specifying model terms when using predict

2009-01-16 Thread VanHezewijk, Brian
I've recently encountered an issue when trying to use the predict.glm
function.

 

I've gotten into the habit of using the dataframe$variablename method of
specifying terms in my model statements.  I thought this unambiguous
notation would be acceptable in all situations but it seems models
written this way are not accepted by the predict function.  Perhaps
others have encountered this problem as well.

 

The code below illustrates the issue.

 

 

##

## linear model example

 

# this works

 x-1:100

 y-2*x

 

 lm1-glm(y~x)

 pred1-predict(lm1,newdata=data.frame(x=101:150))

 

## so does this

 x-1:100

 y-2*x

 orig.df-data.frame(x1=x,y1=y)

 

 lm1-glm(y1~x1,data=orig.df)

 pred1-predict(lm1,newdata=data.frame(x1=101:150))

 

## this does not run

 x-1:100

 y-2*x

 orig.df-data.frame(x1=x,y1=y)

 

 lm1-glm(orig.df$y1~orig.df$x1,data=orig.df)

 pred1-predict(lm1,newdata=data.frame(x1=101:150))

 

 

The final statement generates the following warning:

 

Warning message:

'newdata' had 50 rows but variable(s) found have 100 rows

 

 

Hope this is of some help.

 

 

 

Brian Van Hezewijk 

 


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] specifying model terms when using predict

2009-01-16 Thread Marc Schwartz
on 01/16/2009 02:20 PM VanHezewijk, Brian wrote:
 I've recently encountered an issue when trying to use the predict.glm
 function.
 
  
 
 I've gotten into the habit of using the dataframe$variablename method of
 specifying terms in my model statements.  I thought this unambiguous
 notation would be acceptable in all situations but it seems models
 written this way are not accepted by the predict function.  Perhaps
 others have encountered this problem as well.

snip

The bottom line is don't do that.  :-)

When the predict.*() functions look for the variable names, they use the
names as specified in the formula that was used in the initial creation
of the model object.

As per ?predict.glm:

Note

Variables are first looked for in newdata and then searched for in the
usual way (which will include the environment of the formula used in the
fit). A warning will be given if the variables found are not of the same
length as those in newdata if it was supplied.


As per your example, using:

 x - 1:100

 y - 2 * x

 orig.df - data.frame(x1 = x, y1 = y)

 lm1 - glm(orig.df$y1 ~ orig.df$x1, data = orig.df)

 pred1 - predict(lm1, newdata = data.frame(x1 = 101:150))


When predict.glm() tries to locate the variable orig.df$x1 in the data
frame passed to 'newdata', it cannot be found. The correct name in the
model is orig.df$x1, not x1 as you used above.

Thus, since it cannot find that variable in 'newdata', it begins to look
elsewhere for a variable called orig.df$x1. Guess what?  It finds it
in the global environment as a column the original dataframe 'orig.df'.

Since that column has a length of 100 and the data frame that you passed
to newdata only has 50, you get an error.

Warning message:

'newdata' had 50 rows but variable(s) found have 100 rows


There is a method to the madness and good reason why the modeling
functions and others that take a formula argument also have a 'data'
argument to specify the location of the variables to be used.

HTH,

Marc Schwartz

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] specifying model terms when using predict

2009-01-16 Thread David Winsemius


On Jan 16, 2009, at 4:30 PM, Marc Schwartz wrote:


on 01/16/2009 02:20 PM VanHezewijk, Brian wrote:

I've recently encountered an issue when trying to use the predict.glm
function.



I've gotten into the habit of using the dataframe$variablename  
method of

specifying terms in my model statements.  I thought this unambiguous
notation would be acceptable in all situations but it seems models
written this way are not accepted by the predict function.  Perhaps
others have encountered this problem as well.


snip

The bottom line is don't do that.  :-)

When the predict.*() functions look for the variable names, they use  
the
names as specified in the formula that was used in the initial  
creation

of the model object.

As per ?predict.glm:

Note

Variables are first looked for in newdata and then searched for in the
usual way (which will include the environment of the formula used in  
the
fit). A warning will be given if the variables found are not of the  
same

length as those in newdata if it was supplied.


As per your example, using:

x - 1:100

y - 2 * x

orig.df - data.frame(x1 = x, y1 = y)

lm1 - glm(orig.df$y1 ~ orig.df$x1, data = orig.df)

pred1 - predict(lm1, newdata = data.frame(x1 = 101:150))


When predict.glm() tries to locate the variable orig.df$x1 in the  
data

frame passed to 'newdata', it cannot be found. The correct name in the
model is orig.df$x1, not x1 as you used above.

Thus, since it cannot find that variable in 'newdata', it begins to  
look

elsewhere for a variable called orig.df$x1. Guess what?  It finds it
in the global environment as a column the original dataframe  
'orig.df'.


Since that column has a length of 100 and the data frame that you  
passed

to newdata only has 50, you get an error.

Warning message:

'newdata' had 50 rows but variable(s) found have 100 rows


Mark;

Knowing your skill level, which far exceeds mine, you probably do know  
that it was not an error, only a warning, and the assignment to pred1  
proceeded (as you described), just not the assignment that VanHezewijk  
expected. newdata was ignored, orig.df$x1 was found and no  
extrapolation occurred.


--
David





There is a method to the madness and good reason why the modeling
functions and others that take a formula argument also have a 'data'
argument to specify the location of the variables to be used.

HTH,

Marc Schwartz

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] specifying model terms when using predict

2009-01-16 Thread Marc Schwartz
on 01/16/2009 03:44 PM David Winsemius wrote:
 
 On Jan 16, 2009, at 4:30 PM, Marc Schwartz wrote:
 
 on 01/16/2009 02:20 PM VanHezewijk, Brian wrote:
 I've recently encountered an issue when trying to use the predict.glm
 function.



 I've gotten into the habit of using the dataframe$variablename method of
 specifying terms in my model statements.  I thought this unambiguous
 notation would be acceptable in all situations but it seems models
 written this way are not accepted by the predict function.  Perhaps
 others have encountered this problem as well.

 snip

 The bottom line is don't do that.  :-)

 When the predict.*() functions look for the variable names, they use the
 names as specified in the formula that was used in the initial creation
 of the model object.

 As per ?predict.glm:

 Note

 Variables are first looked for in newdata and then searched for in the
 usual way (which will include the environment of the formula used in the
 fit). A warning will be given if the variables found are not of the same
 length as those in newdata if it was supplied.


 As per your example, using:

 x - 1:100

 y - 2 * x

 orig.df - data.frame(x1 = x, y1 = y)

 lm1 - glm(orig.df$y1 ~ orig.df$x1, data = orig.df)

 pred1 - predict(lm1, newdata = data.frame(x1 = 101:150))


 When predict.glm() tries to locate the variable orig.df$x1 in the data
 frame passed to 'newdata', it cannot be found. The correct name in the
 model is orig.df$x1, not x1 as you used above.

 Thus, since it cannot find that variable in 'newdata', it begins to look
 elsewhere for a variable called orig.df$x1. Guess what?  It finds it
 in the global environment as a column the original dataframe 'orig.df'.

 Since that column has a length of 100 and the data frame that you passed
 to newdata only has 50, you get an error.

 Warning message:

 'newdata' had 50 rows but variable(s) found have 100 rows
 
 Mark;
 
 Knowing your skill level, which far exceeds mine, you probably do know
 that it was not an error, only a warning, and the assignment to pred1
 proceeded (as you described), just not the assignment that VanHezewijk
 expected. newdata was ignored, orig.df$x1 was found and no
 extrapolation occurred.

David,

Excellent correction.

For additional clarification:

 str(fitted(lm1))
 Named num [1:100] 2 4 6 8 10 ...
 - attr(*, names)= chr [1:100] 1 2 3 4 ...

 str(pred1)
 Named num [1:100] 2 4 6 8 10 ...
 - attr(*, names)= chr [1:100] 1 2 3 4 ...

 all(fitted(lm1) == pred1)
[1] TRUE

which reinforces David's comment that the values in 'pred1' are the same
100 fitted values from the original model, covering x values 1:100.

This is reinforced in ?predict.glm, in the description of 'newdata':

optionally, a data frame in which to look for variables with which to
predict. If omitted, the fitted linear predictors are used.


Note that I can get away using == above as the fitted values are all
integers here, as opposed to having to use all.equal() or another
approach had the values been floats.

Thanks David for pointing out the distinction and my own error.

Marc

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.