Re: [R] Issue with predict() for glm models

2004-09-23 Thread Uwe Ligges
[EMAIL PROTECTED] wrote:
Hello everyone, 

I am having a problem using the predict (or the predict.glm) function in R.
Basically, I run the glm model on a training data set and try to obtain
predictions for a set of new predictors from a test data set (i.e., not the
predictors that were utilized to obtain the glm parameter estimates).
Unfortunately, every time that I attempt this, I obtain the predictions for the
predictors that were used to fit the glm model. I have looked at the R mailing
list archives and don't believe I am making the same mistakes that have been
made in the past and also have tried to closely follow the predict.glm example
in the help file. Here is an example of what I am trying to do: 


set.seed(545345)

# Necessary Variables # 


p - 2
train.n - 20
test.n - 25 
mean.vec.1 - c(1,1)
mean.vec.2 - c(0,0)

Sigma.1 - matrix(c(1,.5,.5,1),p,p)
Sigma.2 - matrix(c(1,.5,.5,1),p,p)
###
# Load MASS Library #
###
library(MASS)
###
# Data to Parameters for Logistic Regression Model #
###
train.data.1 - mvrnorm(train.n,mu=mean.vec.1,Sigma=Sigma.1)
train.data.2 - mvrnorm(train.n,mu=mean.vec.2,Sigma=Sigma.2)
train.class.var - as.factor(c(rep(1,train.n),rep(2,train.n)))
predictors.train - rbind(train.data.1,train.data.2)
##
# Test Data Where Predictions for Probabilities Using Logistic Reg.  #
# From Training Data are of Interest  #
## 

test.data.1 - mvrnorm(test.n,mu=mean.vec.1,Sigma=Sigma.1)
test.data.2 - mvrnorm(test.n,mu=mean.vec.2,Sigma=Sigma.2)
predictors.test - rbind(test.data.1,test.data.2)
##
# Run Logistic Regression on Training Data #
##
log.reg - glm(train.class.var~predictors.train,
family=binomial(link=logit))
Well, you haven't specified the data argument, but given the two 
variables directly. Exactly those variables will be used in the 
predict() step below! If you want the predict() step to work, use 
something like:

  train - data.frame(class = train.class.var,
  predictors = predictors.train)
  log.reg - glm(class ~ ., data = train,
 family=binomial(link=logit))

log.reg
# log.reg
#Call:  glm(formula = train.class.var ~ predictors.train, family =
#binomial(link = logit)) 
#
#Coefficients:
#  (Intercept)  predictors.train1  predictors.train2  
#   0.5105-0.2945-1.0811  
#
#Degrees of Freedom: 39 Total (i.e. Null);  37 Residual
#Null Deviance:  55.45 
#Residual Deviance: 41.67AIC: 47.67 

###
# Predicted Probabilities for Test Data #
###
New.Data - data.frame(predictors.train1=predictors.test[,1],
predictors.train2=predictors.test[,2])
logreg.pred.prob.test - predict.glm(log.reg,New.Data,type=response)
logreg.pred.prob.test
Instead, use:
  test - data.frame(predictors = predictors.test)
  predict(log.reg, newdata = test, type=response)
note also: please call the generic predict() rather than its glm method.
Uwe Ligges

#logreg.pred.prob.test
# [1] 0.51106406 0.15597423 0.04948404 0.03863875 0.35587589 0.71331091
# [7] 0.17320087 0.14176632 0.30966718 0.61878952 0.12525988 0.21271139
#[13] 0.70068113 0.18340723 0.10295501 0.44591568 0.72285161 0.31499339
#[19] 0.65789420 0.42750139 0.14435889 0.93008117 0.70798465 0.80109005
#[25] 0.89161472 0.47480625 0.56520952 0.63981834 0.57595189 0.60075882
#[31] 0.96493393 0.77015507 0.87643986 0.62973986 0.63043351 0.45398955
#[37] 0.80855782 0.90835588 0.54809117 0.11568637

Of course, notice that the vector for the predicted probabilities has only 40
elements, while the New.Data has 50 elements (since n.test has 25 per group
for 2 groups) and thus should have 50 predicted probabilities. As it turns out,
the output is for the training data predictors and not for the New.Data as I
would like it to be. I should also note that I have made sure that the names
for the predictors in the New.Data are the same as the names for the
predictors within the glm object (i.e., within log.reg) as this is what is
done in the example for predict.glm() within the help files. 

Could some one help me understand either what I am doing incorrectly or what
problems there might be within the predict() function? I should mention that I
tried the same program using predict.glm() and obtained the same problematic
results. 

Thanks and take care, 

Joe 

Joe Rausch, M.A. 
Psychology Liaison 
Lab for Social Research 
917 Flanner Hall 
University of Notre Dame 
Notre Dame, IN 46556
(574) 631-3910

If we knew what it was we were doing, it would not be called research, would
it?
- Albert Einstein

RE: [R] Issue with predict() for glm models

2004-09-23 Thread John Fox
Dear Uwe,

Unless I've somehow messed this up, as I mentioned yesterday, what you
suggest doesn't seem to work when the predictor is a matrix. Here's a
simplified example:

 X - matrix(rnorm(200), 100, 2)
 y - (X %*% c(1,2) + rnorm(100))  0
 dat - data.frame(y=y, X=X)
 mod - glm(y ~ X, family=binomial, data=dat)
 new - data.frame(X = matrix(rnorm(20),2))
 predict(mod, new)
   12345
6 
  1.81224443  -5.92955128   1.98718051 -10.05331521   2.6506
-2.50635812 
   789   10   11
12 
  5.63728698  -0.94845276  -3.61657377  -1.63141320   5.03417372
1.80400271 
  13   14   15   16   17
18 
  9.32876273  -5.32723406   5.29373023  -3.90822713 -10.95065186
4.90038016 

 . . .

   97   98   99  100 
 -6.92509812   0.59357486  -1.17205723   0.04209578 


Note that there are 100 rather than 10 predicted values.

But with individuals predictors (rather than a matrix),

 x1 - X[,1]
 x2 - X[,2]
 dat.2 - data.frame(y=y, x1=x1, x2=x2)
 mod.2 - glm(y ~ x1 + x2, family=binomial, data=dat.2)
 new.2 - data.frame(x1=rnorm(10), x2=rnorm(10))
 predict(mod.2, new.2)
 1  2  3  4  5  6  7

 6.5723823  0.6356392  4.0291018 -4.7914650  2.1435485 -3.1738096 -2.8261585

 8  9 10 
-1.5255329 -4.7087592  4.0619290 

works as expected (?).

Regards,
 John
 

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Uwe Ligges
 Sent: Thursday, September 23, 2004 1:33 AM
 To: [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED]
 Subject: Re: [R] Issue with predict() for glm models
 
 [EMAIL PROTECTED] wrote:
 
  Hello everyone,
  
  I am having a problem using the predict (or the 
 predict.glm) function in R.
  Basically, I run the glm model on a training data set and try to 
  obtain predictions for a set of new predictors from a 
 test data set 
  (i.e., not the predictors that were utilized to obtain the 
 glm parameter estimates).
  Unfortunately, every time that I attempt this, I obtain the 
  predictions for the predictors that were used to fit the 
 glm model. I 
  have looked at the R mailing list archives and don't believe I am 
  making the same mistakes that have been made in the past 
 and also have 
  tried to closely follow the predict.glm example in the help 
 file. Here is an example of what I am trying to do:
  
  
  set.seed(545345)
  
  
  # Necessary Variables #
  
  
  p - 2
  train.n - 20
  test.n - 25
  mean.vec.1 - c(1,1)
  mean.vec.2 - c(0,0)
  
  Sigma.1 - matrix(c(1,.5,.5,1),p,p)
  Sigma.2 - matrix(c(1,.5,.5,1),p,p)
  
  ###
  # Load MASS Library #
  ###
  
  library(MASS)
  
  ###
  # Data to Parameters for Logistic Regression Model # 
  ###
  
  train.data.1 - mvrnorm(train.n,mu=mean.vec.1,Sigma=Sigma.1)
  train.data.2 - mvrnorm(train.n,mu=mean.vec.2,Sigma=Sigma.2)
  train.class.var - as.factor(c(rep(1,train.n),rep(2,train.n)))
  predictors.train - rbind(train.data.1,train.data.2)
  
  ##
  # Test Data Where Predictions for Probabilities Using 
 Logistic Reg.  #
  # From Training Data are of Interest
   #
  ##
  
  test.data.1 - mvrnorm(test.n,mu=mean.vec.1,Sigma=Sigma.1)
  test.data.2 - mvrnorm(test.n,mu=mean.vec.2,Sigma=Sigma.2)
  predictors.test - rbind(test.data.1,test.data.2)
  
  ##
  # Run Logistic Regression on Training Data # 
  ##
  
  log.reg - glm(train.class.var~predictors.train,
  family=binomial(link=logit))
 
 Well, you haven't specified the data argument, but given 
 the two variables directly. Exactly those variables will be 
 used in the
 predict() step below! If you want the predict() step to work, 
 use something like:
 
train - data.frame(class = train.class.var,
predictors = predictors.train)
log.reg - glm(class ~ ., data = train,
   family=binomial(link=logit))
 
 
 
  log.reg
  
  # log.reg
  
  #Call:  glm(formula = train.class.var ~ predictors.train, family = 
  #binomial(link = logit)) #
  #Coefficients:
  #  (Intercept)  predictors.train1  predictors.train2  
  #   0.5105-0.2945-1.0811  
  #
  #Degrees of Freedom: 39 Total (i.e. Null);  37 Residual
  #Null Deviance:  55.45 
  #Residual Deviance: 41.67AIC: 47.67 
  
  ###
  # Predicted Probabilities for Test Data # 
 ###
  
  New.Data - data.frame(predictors.train1=predictors.test[,1],
  predictors.train2=predictors.test[,2])
  
  logreg.pred.prob.test

Re: [R] Issue with predict() for glm models

2004-09-23 Thread Uwe Ligges
John Fox wrote:
Dear Uwe,
Unless I've somehow messed this up, as I mentioned yesterday, what you
suggest doesn't seem to work when the predictor is a matrix. Here's a
simplified example:

X - matrix(rnorm(200), 100, 2)
y - (X %*% c(1,2) + rnorm(100))  0
dat - data.frame(y=y, X=X)
mod - glm(y ~ X, family=binomial, data=dat)
new - data.frame(X = matrix(rnorm(20),2))
predict(mod, new)
Dear John,
the questioner had a 2 column matrix with 40 and one with 50 
observations (not a 100 column matrix with 2 observation) and for those 
matrices it works ...

Best,
Uwe



   12345
6 
  1.81224443  -5.92955128   1.98718051 -10.05331521   2.6506
-2.50635812 
   789   10   11
12 
  5.63728698  -0.94845276  -3.61657377  -1.63141320   5.03417372
1.80400271 
  13   14   15   16   17
18 
  9.32876273  -5.32723406   5.29373023  -3.90822713 -10.95065186
4.90038016 

 . . .
   97   98   99  100 
 -6.92509812   0.59357486  -1.17205723   0.04209578 

Note that there are 100 rather than 10 predicted values.
But with individuals predictors (rather than a matrix),

x1 - X[,1]
x2 - X[,2]
dat.2 - data.frame(y=y, x1=x1, x2=x2)
mod.2 - glm(y ~ x1 + x2, family=binomial, data=dat.2)
new.2 - data.frame(x1=rnorm(10), x2=rnorm(10))
predict(mod.2, new.2)
 1  2  3  4  5  6  7
 6.5723823  0.6356392  4.0291018 -4.7914650  2.1435485 -3.1738096 -2.8261585
 8  9 10 
-1.5255329 -4.7087592  4.0619290 

works as expected (?).
Regards,
 John
 


-Original Message-
From: [EMAIL PROTECTED] 
[mailto:[EMAIL PROTECTED] On Behalf Of Uwe Ligges
Sent: Thursday, September 23, 2004 1:33 AM
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: Re: [R] Issue with predict() for glm models

[EMAIL PROTECTED] wrote:

Hello everyone,
I am having a problem using the predict (or the 
predict.glm) function in R.
Basically, I run the glm model on a training data set and try to 
obtain predictions for a set of new predictors from a 
test data set 

(i.e., not the predictors that were utilized to obtain the 
glm parameter estimates).
Unfortunately, every time that I attempt this, I obtain the 
predictions for the predictors that were used to fit the 
glm model. I 

have looked at the R mailing list archives and don't believe I am 
making the same mistakes that have been made in the past 
and also have 

tried to closely follow the predict.glm example in the help 
file. Here is an example of what I am trying to do:

set.seed(545345)

# Necessary Variables #

p - 2
train.n - 20
test.n - 25
mean.vec.1 - c(1,1)
mean.vec.2 - c(0,0)
Sigma.1 - matrix(c(1,.5,.5,1),p,p)
Sigma.2 - matrix(c(1,.5,.5,1),p,p)
###
# Load MASS Library #
###
library(MASS)
###
# Data to Parameters for Logistic Regression Model # 
###

train.data.1 - mvrnorm(train.n,mu=mean.vec.1,Sigma=Sigma.1)
train.data.2 - mvrnorm(train.n,mu=mean.vec.2,Sigma=Sigma.2)
train.class.var - as.factor(c(rep(1,train.n),rep(2,train.n)))
predictors.train - rbind(train.data.1,train.data.2)
##
# Test Data Where Predictions for Probabilities Using 
Logistic Reg.  #
# From Training Data are of Interest
 #
##
test.data.1 - mvrnorm(test.n,mu=mean.vec.1,Sigma=Sigma.1)
test.data.2 - mvrnorm(test.n,mu=mean.vec.2,Sigma=Sigma.2)
predictors.test - rbind(test.data.1,test.data.2)
##
# Run Logistic Regression on Training Data # 
##

log.reg - glm(train.class.var~predictors.train,
family=binomial(link=logit))
Well, you haven't specified the data argument, but given 
the two variables directly. Exactly those variables will be 
used in the
predict() step below! If you want the predict() step to work, 
use something like:

  train - data.frame(class = train.class.var,
  predictors = predictors.train)
  log.reg - glm(class ~ ., data = train,
 family=binomial(link=logit))


log.reg
# log.reg
#Call:  glm(formula = train.class.var ~ predictors.train, family = 
#binomial(link = logit)) #
#Coefficients:
#  (Intercept)  predictors.train1  predictors.train2  
#   0.5105-0.2945-1.0811  
#
#Degrees of Freedom: 39 Total (i.e. Null);  37 Residual
#Null Deviance:  55.45 
#Residual Deviance: 41.67AIC: 47.67 

###
# Predicted Probabilities for Test Data # 
###
New.Data - data.frame(predictors.train1=predictors.test[,1],
predictors.train2=predictors.test[,2])
logreg.pred.prob.test - 
predict.glm(log.reg,New.Data,type

RE: [R] Issue with predict() for glm models

2004-09-23 Thread John Fox
Dear Uwe, 

 -Original Message-
 From: Uwe Ligges [mailto:[EMAIL PROTECTED] 
 Sent: Thursday, September 23, 2004 8:06 AM
 To: John Fox
 Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
 Subject: Re: [R] Issue with predict() for glm models
 
 John Fox wrote:
 
  Dear Uwe,
  
  Unless I've somehow messed this up, as I mentioned 
 yesterday, what you 
  suggest doesn't seem to work when the predictor is a 
 matrix. Here's a 
  simplified example:
  
  
 X - matrix(rnorm(200), 100, 2)
 y - (X %*% c(1,2) + rnorm(100))  0
 dat - data.frame(y=y, X=X)
 mod - glm(y ~ X, family=binomial, data=dat) new - data.frame(X = 
 matrix(rnorm(20),2)) predict(mod, new)
 
 Dear John,
 
 the questioner had a 2 column matrix with 40 and one with 50 
 observations (not a 100 column matrix with 2 observation) and 
 for those matrices it works ...
 

Indeed, and in my example the matrix predictor X has 2 columns and 100 rows;
I did screw up the matrix for the new data to be used for predictions (in
the example I sent today but not yesterday), but even when this is done
right -- where the new data has 10 rows and 2 columns -- there are 100 (not
10) predicted values:

 X - matrix(rnorm(200), 100, 2)  # original predictor matrix with 100 rows
 y - (X %*% c(1,2) + rnorm(100))  0
 dat - data.frame(y=y, X=X)
 mod - glm(y ~ X, family=binomial, data=dat)
 new - data.frame(X = matrix(rnorm(20),10, 2)) # corrected -- note 10 rows
 predict(mod, new) # note 100 predicted values
   12345
6 
  5.75238091   0.31874587  -3.00515893  -3.77282121  -1.97511221
0.54712914 
   789   10   11
12 
  1.85091226   4.38465524  -0.41028694  -1.53942869   0.57613555
-1.82761518 

 . . .

  91   92   93   94   95
96 
  0.36210780   1.71358713  -9.63612775  -4.54257576  -5.29740468
2.64363405 
  97   98   99  100 
 -4.45478627  -2.44973209   2.51587537  -4.09584837 

Actually, I now see the source of the problem:

The data frames dat and new don't contain a matrix named X; rather the
matrix is split columnwise:

 names(dat)
[1] y   X.1 X.2
 names(new)
[1] X.1 X.2

Consequently, both glm and predict pick up the X in the global environment
(since there is none in dat or new), which accounts for why there are 100
predicted values.

Using list() rather than data.frame() produces the originally expected
behaviour:

 new - list(X = matrix(rnorm(20),10, 2))
 predict(mod, new)
 1  2  3  4  5  6  7

 5.9373064  0.3687360 -8.3793045  0.7645584 -2.6773842  2.4130547  0.7387318

 8  9 10 
-0.4347916  8.4678728 -0.8976054 

Regards,
 John

 Best,
 Uwe
 
 
 
 
 
 
 
 12345
  6 
1.81224443  -5.92955128   1.98718051 -10.05331521   2.6506
  -2.50635812 
 789   10   11
  12 
5.63728698  -0.94845276  -3.61657377  -1.63141320   5.03417372
  1.80400271 
13   14   15   16   17
  18 
9.32876273  -5.32723406   5.29373023  -3.90822713 -10.95065186
  4.90038016
  
   . . .
  
 97   98   99  100 
   -6.92509812   0.59357486  -1.17205723   0.04209578 
  
  
  Note that there are 100 rather than 10 predicted values.
  
  But with individuals predictors (rather than a matrix),
  
  
 x1 - X[,1]
 x2 - X[,2]
 dat.2 - data.frame(y=y, x1=x1, x2=x2)
 mod.2 - glm(y ~ x1 + x2, family=binomial, data=dat.2)
 new.2 - data.frame(x1=rnorm(10), x2=rnorm(10)) 
 predict(mod.2, new.2)
  
   1  2  3  4  5  
 6  7
  
   6.5723823  0.6356392  4.0291018 -4.7914650  2.1435485 -3.1738096 
  -2.8261585
  
   8  9 10 
  -1.5255329 -4.7087592  4.0619290
  
  works as expected (?).
  
  Regards,
   John
   
  
  
 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Uwe Ligges
 Sent: Thursday, September 23, 2004 1:33 AM
 To: [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED]
 Subject: Re: [R] Issue with predict() for glm models
 
 [EMAIL PROTECTED] wrote:
 
 
 Hello everyone,
 
 I am having a problem using the predict (or the
 
 predict.glm) function in R.
 
 Basically, I run the glm model on a training data set and try to 
 obtain predictions for a set of new predictors from a
 
 test data set
 
 (i.e., not the predictors that were utilized to obtain the
 
 glm parameter estimates).
 
 Unfortunately, every time that I attempt this, I obtain the 
 predictions for the predictors that were used to fit the
 
 glm model. I
 
 have looked at the R mailing list archives and don't believe I am 
 making the same mistakes that have been made in the past
 
 and also have
 
 tried to closely follow the predict.glm example in the help
 
 file. Here is an example

RE: [R] Issue with predict() for glm models

2004-09-23 Thread jrausch

Thanks to John Fox, Andy Liaw, and Uwe Ligges for their help with my problem
regarding the use of the predict()  function to obtain predictions for a new
set of predictor values. It appears that the bottom line (at least for my
purposes) is that the names and the setup for the data of the predictors in the
glm and the new data need to be consistent. The safest way that I know to do
this from reading  John, Andy, and Uwe's responses is to label each predictor
separately and place them into the glm model separately. Then, when creating a
new data frame to utilize in the predict() function, ensure to consistently
name the predictors. For illustrative examples, see the reply emails of John,
Andy, and Uwe.  

Thanks again, 

Joe  




Joe Rausch, M.A. 
Psychology Liaison 
Lab for Social Research 
917 Flanner Hall 
University of Notre Dame 
Notre Dame, IN 46556
(574) 631-3910
www.nd.edu/~jrausch

If we knew what it was we were doing, it would not be called research, would
it?
- Albert Einstein

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Issue with predict() for glm models

2004-09-23 Thread Uwe Ligges
John Fox wrote:
Dear Uwe, 


-Original Message-
From: Uwe Ligges [mailto:[EMAIL PROTECTED] 
Sent: Thursday, September 23, 2004 8:06 AM
To: John Fox
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: Re: [R] Issue with predict() for glm models

John Fox wrote:

Dear Uwe,
Unless I've somehow messed this up, as I mentioned 
yesterday, what you 

suggest doesn't seem to work when the predictor is a 
matrix. Here's a 

simplified example:

X - matrix(rnorm(200), 100, 2)
y - (X %*% c(1,2) + rnorm(100))  0
dat - data.frame(y=y, X=X)
mod - glm(y ~ X, family=binomial, data=dat) new - data.frame(X = 
matrix(rnorm(20),2)) predict(mod, new)
Dear John,
the questioner had a 2 column matrix with 40 and one with 50 
observations (not a 100 column matrix with 2 observation) and 
for those matrices it works ...


Indeed, and in my example the matrix predictor X has 2 columns and 100 rows;
I did screw up the matrix for the new data to be used for predictions (in
the example I sent today but not yesterday), but even when this is done
right -- where the new data has 10 rows and 2 columns -- there are 100 (not
10) predicted values:

X - matrix(rnorm(200), 100, 2)  # original predictor matrix with 100 rows
y - (X %*% c(1,2) + rnorm(100))  0
dat - data.frame(y=y, X=X)
mod - glm(y ~ X, family=binomial, data=dat)
John,
note that I used glm(y ~ .) (the dot!),
because the names are automatically chosen to be X.1 and X.2, hence you 
cannot use X in the formula in this case ...

Best,
Uwe

new - data.frame(X = matrix(rnorm(20),10, 2)) # corrected -- note 10 rows
predict(mod, new) # note 100 predicted values
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


followup: Re: [R] Issue with predict() for glm models

2004-09-23 Thread Paul Johnson
I have a follow up question that fits with this thread.
Can you force an overlaid plot showing predicted values to follow the 
scaling of the axes of the plot over which it is laid?

Here is an example based on linear regression, just for clarity.  I have 
followed the procedure described below to create predictions and now 
want to plot the predicted values on top of a small section of the x-y 
scatterplot.

x - rnorm(100, 10, 10)
e - rnorm(100, 0, 5)
y - 5 + 10 *x + e
myReg1 - lm (y~x)
plot(x,y)
newX - seq(1,10,1)
myPred - predict(myReg1,data.frame(x=newX))
Now, if I do this, I get 2 graphs overlaid but their axes do not line 
up.

par(new=T)
plot(newX,myPred$fit)
The problem is that the second one uses the whole width of the graph 
space, when I'd rather just have it go from the small subset where its x 
is defined, from 1 to 10.  Its stretching the range (1,10) for newX to 
use the same scale that goes from (-15, 35) where it plots x

I know abline() can do this for lm, but for some other kinds of models, 
no  lines() method is provided, and so I am doing this the old fashioned 
way.

pj
John Fox wrote:
Dear Uwe, 


-Original Message-
From: Uwe Ligges [mailto:[EMAIL PROTECTED] 
Sent: Thursday, September 23, 2004 8:06 AM
To: John Fox
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: Re: [R] Issue with predict() for glm models

John Fox wrote:

Dear Uwe,
Unless I've somehow messed this up, as I mentioned 
yesterday, what you 

suggest doesn't seem to work when the predictor is a 
matrix. Here's a 

simplified example:

X - matrix(rnorm(200), 100, 2)
y - (X %*% c(1,2) + rnorm(100))  0
dat - data.frame(y=y, X=X)
mod - glm(y ~ X, family=binomial, data=dat) new - data.frame(X = 
matrix(rnorm(20),2)) predict(mod, new)
Dear John,
the questioner had a 2 column matrix with 40 and one with 50 
observations (not a 100 column matrix with 2 observation) and 
for those matrices it works ...


Indeed, and in my example the matrix predictor X has 2 columns and 100 rows;
I did screw up the matrix for the new data to be used for predictions (in
the example I sent today but not yesterday), but even when this is done
right -- where the new data has 10 rows and 2 columns -- there are 100 (not
10) predicted values:

X - matrix(rnorm(200), 100, 2)  # original predictor matrix with 100 rows
y - (X %*% c(1,2) + rnorm(100))  0
dat - data.frame(y=y, X=X)
mod - glm(y ~ X, family=binomial, data=dat)
new - data.frame(X = matrix(rnorm(20),10, 2)) # corrected -- note 10 rows
predict(mod, new) # note 100 predicted values
   12345
6 
  5.75238091   0.31874587  -3.00515893  -3.77282121  -1.97511221
0.54712914 
   789   10   11
12 
  1.85091226   4.38465524  -0.41028694  -1.53942869   0.57613555
-1.82761518 

 . . .
  91   92   93   94   95
96 
  0.36210780   1.71358713  -9.63612775  -4.54257576  -5.29740468
2.64363405 
  97   98   99  100 
 -4.45478627  -2.44973209   2.51587537  -4.09584837 

Actually, I now see the source of the problem:
The data frames dat and new don't contain a matrix named X; rather the
matrix is split columnwise:

names(dat)
[1] y   X.1 X.2
names(new)
[1] X.1 X.2
Consequently, both glm and predict pick up the X in the global environment
(since there is none in dat or new), which accounts for why there are 100
predicted values.
Using list() rather than data.frame() produces the originally expected
behaviour:

new - list(X = matrix(rnorm(20),10, 2))
predict(mod, new)
 1  2  3  4  5  6  7
 5.9373064  0.3687360 -8.3793045  0.7645584 -2.6773842  2.4130547  0.7387318
 8  9 10 
-0.4347916  8.4678728 -0.8976054 

Regards,
 John

Best,
Uwe




  12345
6 
 1.81224443  -5.92955128   1.98718051 -10.05331521   2.6506
-2.50635812 
  789   10   11
12 
 5.63728698  -0.94845276  -3.61657377  -1.63141320   5.03417372
1.80400271 
 13   14   15   16   17
18 
 9.32876273  -5.32723406   5.29373023  -3.90822713 -10.95065186
4.90038016

. . .
  97   98   99  100 
-6.92509812   0.59357486  -1.17205723   0.04209578 

Note that there are 100 rather than 10 predicted values.
But with individuals predictors (rather than a matrix),

x1 - X[,1]
x2 - X[,2]
dat.2 - data.frame(y=y, x1=x1, x2=x2)
mod.2 - glm(y ~ x1 + x2, family=binomial, data=dat.2)
new.2 - data.frame(x1=rnorm(10), x2=rnorm(10)) 
predict(mod.2, new.2)
1  2  3  4  5  
   6  7
6.5723823  0.6356392  4.0291018 -4.7914650  2.1435485 -3.1738096 
-2.8261585

8  9 10 
-1.5255329 -4.7087592  4.0619290

works as expected (?).
Regards,
John


-Original Message-
From: [EMAIL

RE: followup: Re: [R] Issue with predict() for glm models

2004-09-23 Thread Austin, Matt
Could you just use

lines(newX, myPred, col=2)

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] Behalf Of Paul Johnson
Sent: Thursday, September 23, 2004 10:3 AM
To: r help
Subject: followup: Re: [R] Issue with predict() for glm models


I have a follow up question that fits with this thread.

Can you force an overlaid plot showing predicted values to follow the 
scaling of the axes of the plot over which it is laid?

Here is an example based on linear regression, just for clarity.  I have 
followed the procedure described below to create predictions and now 
want to plot the predicted values on top of a small section of the x-y 
scatterplot.

x - rnorm(100, 10, 10)
e - rnorm(100, 0, 5)
y - 5 + 10 *x + e

myReg1 - lm (y~x)
plot(x,y)
newX - seq(1,10,1)
myPred - predict(myReg1,data.frame(x=newX))

Now, if I do this, I get 2 graphs overlaid but their axes do not line 
up.

par(new=T)
plot(newX,myPred$fit)

The problem is that the second one uses the whole width of the graph 
space, when I'd rather just have it go from the small subset where its x 
is defined, from 1 to 10.  Its stretching the range (1,10) for newX to 
use the same scale that goes from (-15, 35) where it plots x

I know abline() can do this for lm, but for some other kinds of models, 
no  lines() method is provided, and so I am doing this the old fashioned 
way.

pj

John Fox wrote:
 Dear Uwe, 
 
 
-Original Message-
From: Uwe Ligges [mailto:[EMAIL PROTECTED] 
Sent: Thursday, September 23, 2004 8:06 AM
To: John Fox
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: Re: [R] Issue with predict() for glm models

John Fox wrote:


Dear Uwe,

Unless I've somehow messed this up, as I mentioned 

yesterday, what you 

suggest doesn't seem to work when the predictor is a 

matrix. Here's a 

simplified example:



X - matrix(rnorm(200), 100, 2)
y - (X %*% c(1,2) + rnorm(100))  0
dat - data.frame(y=y, X=X)
mod - glm(y ~ X, family=binomial, data=dat) new - data.frame(X = 
matrix(rnorm(20),2)) predict(mod, new)

Dear John,

the questioner had a 2 column matrix with 40 and one with 50 
observations (not a 100 column matrix with 2 observation) and 
for those matrices it works ...

 
 
 Indeed, and in my example the matrix predictor X has 2 columns and 100
rows;
 I did screw up the matrix for the new data to be used for predictions
(in
 the example I sent today but not yesterday), but even when this is done
 right -- where the new data has 10 rows and 2 columns -- there are 100
(not
 10) predicted values:
 
 
X - matrix(rnorm(200), 100, 2)  # original predictor matrix with 100 rows
y - (X %*% c(1,2) + rnorm(100))  0
dat - data.frame(y=y, X=X)
mod - glm(y ~ X, family=binomial, data=dat)
new - data.frame(X = matrix(rnorm(20),10, 2)) # corrected -- note 10 rows
predict(mod, new) # note 100 predicted values
 
12345
 6 
   5.75238091   0.31874587  -3.00515893  -3.77282121  -1.97511221
 0.54712914 
789   10   11
 12 
   1.85091226   4.38465524  -0.41028694  -1.53942869   0.57613555
 -1.82761518 
 
  . . .
 
   91   92   93   94   95
 96 
   0.36210780   1.71358713  -9.63612775  -4.54257576  -5.29740468
 2.64363405 
   97   98   99  100 
  -4.45478627  -2.44973209   2.51587537  -4.09584837 
 
 Actually, I now see the source of the problem:
 
 The data frames dat and new don't contain a matrix named X; rather the
 matrix is split columnwise:
 
 
names(dat)
 
 [1] y   X.1 X.2
 
names(new)
 
 [1] X.1 X.2
 
 Consequently, both glm and predict pick up the X in the global environment
 (since there is none in dat or new), which accounts for why there are 100
 predicted values.
 
 Using list() rather than data.frame() produces the originally expected
 behaviour:
 
 
new - list(X = matrix(rnorm(20),10, 2))
predict(mod, new)
 
  1  2  3  4  5  6
7
 
  5.9373064  0.3687360 -8.3793045  0.7645584 -2.6773842  2.4130547
0.7387318
 
  8  9 10 
 -0.4347916  8.4678728 -0.8976054 
 
 Regards,
  John
 
 
Best,
Uwe








   12345
6 
  1.81224443  -5.92955128   1.98718051 -10.05331521   2.6506
-2.50635812 
   789   10   11
12 
  5.63728698  -0.94845276  -3.61657377  -1.63141320   5.03417372
1.80400271 
  13   14   15   16   17
18 
  9.32876273  -5.32723406   5.29373023  -3.90822713 -10.95065186
4.90038016

 . . .

   97   98   99  100 
 -6.92509812   0.59357486  -1.17205723   0.04209578 


Note that there are 100 rather than 10 predicted values.

But with individuals predictors (rather than a matrix),



x1 - X[,1]
x2 - X[,2]
dat.2 - data.frame(y=y, x1=x1, x2=x2)
mod.2 - glm(y ~ x1 + x2, family=binomial, data=dat.2)
new

Re: followup: Re: [R] Issue with predict() for glm models

2004-09-23 Thread Marc Schwartz
On Thu, 2004-09-23 at 12:02, Paul Johnson wrote:
 I have a follow up question that fits with this thread.
 
 Can you force an overlaid plot showing predicted values to follow the 
 scaling of the axes of the plot over which it is laid?
 
 Here is an example based on linear regression, just for clarity.  I have 
 followed the procedure described below to create predictions and now 
 want to plot the predicted values on top of a small section of the x-y 
 scatterplot.
 
 x - rnorm(100, 10, 10)
 e - rnorm(100, 0, 5)
 y - 5 + 10 *x + e
 
 myReg1 - lm (y~x)
 plot(x,y)
 newX - seq(1,10,1)
 myPred - predict(myReg1,data.frame(x=newX))
 
 Now, if I do this, I get 2 graphs overlaid but their axes do not line 
 up.
 
 par(new=T)
 plot(newX,myPred$fit)
 
 The problem is that the second one uses the whole width of the graph 
 space, when I'd rather just have it go from the small subset where its x 
 is defined, from 1 to 10.  Its stretching the range (1,10) for newX to 
 use the same scale that goes from (-15, 35) where it plots x
 
 I know abline() can do this for lm, but for some other kinds of models, 
 no  lines() method is provided, and so I am doing this the old fashioned 
 way.

Paul,

Instead of using plot() for the second set of points, use points():

x - rnorm(100, 10, 10)
e - rnorm(100, 0, 5)
y - 5 + 10 * x + e

myReg1 - lm (y ~ x)
plot(x, y)

newX - seq(1, 10, 1)
myPred - predict(myReg1, data.frame(x = newX))

points(newX, myPred$fit, pch = 19)


This will preserve the axis scaling. If you use plot() without
explicitly indicating xlim and ylim, it will automatically scale the
axes based upon your new data, even if you indicated that the underlying
plot should not be cleared.

Alternatively, you could also use the lines() function, which will draw
point to point lines:

lines(newX, myPred$fit, col = red)

If you want fitted lines and prediction/confidence intervals, you could
use a function like matlines(), presuming that a predict method exists
for the model type you want to use.

There is an example of using this in R Help Desk in R News Vol 3
Number 2 (October 2003), in the first example, with a standard linear
regression model.

HTH,

Marc Schwartz

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


RE: [R] Issue with predict() for glm models

2004-09-23 Thread John Fox
Dear Uwe, 

 -Original Message-
 From: Uwe Ligges [mailto:[EMAIL PROTECTED] 
 Sent: Thursday, September 23, 2004 11:37 AM
 To: John Fox
 Cc: [EMAIL PROTECTED]
 Subject: Re: [R] Issue with predict() for glm models
 
. . .

 
 John,
 
 note that I used glm(y ~ .) (the dot!),
 because the names are automatically chosen to be X.1 and X.2, 
 hence you cannot use X in the formula in this case ...
 
 Best,
 Uwe

OK -- I see. I did notice that you used . in the formula but didn't make the
proper connection!

Thanks,
 John

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Issue with predict() for glm models

2004-09-23 Thread Uwe Ligges
John Fox wrote:
Dear Uwe, 


-Original Message-
From: Uwe Ligges [mailto:[EMAIL PROTECTED] 
Sent: Thursday, September 23, 2004 11:37 AM
To: John Fox
Cc: [EMAIL PROTECTED]
Subject: Re: [R] Issue with predict() for glm models

. . .

John,
note that I used glm(y ~ .) (the dot!),
because the names are automatically chosen to be X.1 and X.2, 
hence you cannot use X in the formula in this case ...

Best,
Uwe

OK -- I see. I did notice that you used . in the formula but didn't make the
proper connection!
Sorry, my first reply was too short and imprecisely.
Thank you to help clarifying things.
Uwe

Thanks,
 John
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Issue with predict() for glm models

2004-09-22 Thread John Fox
Dear Mark and Joe,

Actually, the problem here appears to be caused by the use of a matrix
on the RHS of the model formula. I'm not sure why this doesn't work (I
must be missing something -- perhaps someone else can say what), but
Joe can get the output he expects by specifying the columns of his
matrix as individual predictors in the model formula. BTW, it's better
form to call the generic predict() rather than the method predict.glm()
directly, though the latter will work here.

Editing the original input:

 x1 - predictors.train[,1]
 x2 - predictors.train[,2]
 
 log.reg - glm(train.class.var ~ x1 + x2,
+ family=binomial(link=logit))
 log.reg

Call:  glm(formula = train.class.var ~ x1 + x2, family = binomial(link
= logit)) 

Coefficients:
(Intercept)   x1   x2  
 0.5102  -0.6118  -0.3192  

Degrees of Freedom: 39 Total (i.e. Null);  37 Residual
Null Deviance:  55.45 
Residual Deviance: 46.49AIC: 52.49 
 New.Data - data.frame(x1=predictors.test[,1],
x2=predictors.test[,2])
 
 logreg.pred.prob.test - predict(log.reg,New.Data, type=response)
 logreg.pred.prob.test
 [1] 0.2160246 0.2706139 0.3536572 0.6206490 0.5218391 0.2363767
0.1072153
 [8] 0.6405459 0.443 0.6680043 0.3377492 0.5892127 0.3230353
0.7540425
[15] 0.2889855 0.5163141 0.6187335 0.1447511 0.5066670 0.4424428
0.4141701
[22] 0.3947212 0.4065674 0.6226195 0.5053101 0.4311552 0.4261810
0.4784102
[29] 0.5126050 0.6756437 0.6147516 0.7659146 0.5219031 0.3938457
0.6495470
[36] 0.5178400 0.8185613 0.7167129 0.5414552 0.8687371 0.5415976
0.8048741
[43] 0.7796451 0.5565636 0.6058371 0.7053130 0.1521769 0.7120320
0.4073465
[50] 0.6801101


I hope this helps,
 John



On Wed, 22 Sep 2004 15:17:23 -0300
 Fowler, Mark [EMAIL PROTECTED] wrote:
 Perhaps your approach reflects a method of producing a prediction
 dataframe
 that is just unfamiliar to me, but it looks to me like you have
 created two
 predictor variables based on the names of the levels of the original
 predictor (predictors.train1, predictors.train2). I don't know how
 the glm
 function would know that predictors.train1 and predictors.train2 are
 two
 subs for predictors.train. Maybe try just using one prediction
 variable, and
 give it the original variable name (predictors.train). If this works,
 just
 repeat for your second set of values.
 
  Mark Fowler
  Marine Fish Division
  Bedford Inst of Oceanography
  Dept Fisheries  Oceans
  Dartmouth NS Canada
  [EMAIL PROTECTED]
 
 
 
 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
 Sent: September 22, 2004 2:53 PM
 To: [EMAIL PROTECTED]
 Subject: [R] Issue with predict() for glm models
 
 
 
 Hello everyone, 
 
 I am having a problem using the predict (or the predict.glm) function
 in R.
 Basically, I run the glm model on a training data set and try to
 obtain
 predictions for a set of new predictors from a test data set (i.e.,
 not
 the predictors that were utilized to obtain the glm parameter
 estimates).
 Unfortunately, every time that I attempt this, I obtain the
 predictions for
 the predictors that were used to fit the glm model. I have looked at
 the R
 mailing list archives and don't believe I am making the same mistakes
 that
 have been made in the past and also have tried to closely follow the
 predict.glm example in the help file. Here is an example of what I am
 trying
 to do: 
 
 
 set.seed(545345)
 
 
 # Necessary Variables # 
 
 
 p - 2
 train.n - 20
 test.n - 25 
 mean.vec.1 - c(1,1)
 mean.vec.2 - c(0,0)
 
 Sigma.1 - matrix(c(1,.5,.5,1),p,p)
 Sigma.2 - matrix(c(1,.5,.5,1),p,p)
 
 ###
 # Load MASS Library #
 ###
 
 library(MASS)
 
 ###
 # Data to Parameters for Logistic Regression Model #
 ###
 
 train.data.1 - mvrnorm(train.n,mu=mean.vec.1,Sigma=Sigma.1)
 train.data.2 - mvrnorm(train.n,mu=mean.vec.2,Sigma=Sigma.2)
 train.class.var - as.factor(c(rep(1,train.n),rep(2,train.n)))
 predictors.train - rbind(train.data.1,train.data.2)
 
 ##
 # Test Data Where Predictions for Probabilities Using Logistic Reg.
  #
 # From Training Data are of Interest
 #
 ## 
 
 test.data.1 - mvrnorm(test.n,mu=mean.vec.1,Sigma=Sigma.1)
 test.data.2 - mvrnorm(test.n,mu=mean.vec.2,Sigma=Sigma.2)
 predictors.test - rbind(test.data.1,test.data.2)
 
 ##
 # Run Logistic Regression on Training Data #
 ##
 
 log.reg - glm(train.class.var~predictors.train,
 family=binomial(link=logit))
 log.reg
 
 # log.reg
 
 #Call:  glm(formula = train.class.var ~ predictors.train, family =
 #binomial(link = logit)) 
 #
 #Coefficients:
 #  (Intercept)  predictors.train1  predictors.train2  
 #   0.5105-0.2945-1.0811  
 #