Ok, I looked at what you sent me privately and saw your error. I'll reproduce and fix it just using a trivial example with lm(), for which the predict() semantics are identical. Before I do, I note that your claim:
"The predict.glm documentation says a warning will be given if the length of newdata is not the same as the training set used to create the model." is **completely wrong**. What predict.glm (and predict.lm) actually says is: "Variables are first looked for in newdata and then searched for in the usual way (which will include the environment of the formula used in the fit). A warning will be given if the variables found are not of the same length as those in newdata if it was supplied." This is *NOT AT ALL* what you claimed. The key point that you are missing is the phrase 'searched for in the usual way.' The details are a bit technical but in many ways fundamental. They can be found in any good tutorial or perhaps by searching on "scoping in R" or "function environments in R". It's about how R finds the objects that variable names point to. Section 10.7 of the Intro.R manual shipped with R (and available to you therefore) on "Scope" gives a brief overview. Anyway, here's the example that explains your error: > train <- data.frame( y = runif(10), x = runif(10)) ## 10 rows > test <- data.frame(x = runif(5)) ## 5 rows ## The following line is the source of your error ## You have specified your model incorrectly > mdl <- lm(train$y ~train$x, data = train) ## The model is properly fitted because the variables in it, "train$y" and "train$x" are found "in the usual way" in the global environment, the "enclosing environment" of the formula. (This is the technical bit). This leads to the sort of problem you saw with the predict call: > predict(mdl, newdat = test) 1 2 3 4 5 6 7 0.6089476 0.6385268 0.9075589 0.3403276 0.2709199 0.5876634 0.8668307 8 9 10 0.4689961 0.2571259 0.3281054 Warning message: 'newdata' had 5 rows but variables found have 10 rows ##Explanation: predict() is looking for a variable 'train$x', but test only has a variable 'x', not 'train$x'. Since it doesn't find it, it goes looking for 'train$x' "in the usual way" in the global environment and finds it -- all 10 values as before. The prediction is done using that data (the original fit) and the warning message is emitted as per the documentation. Predicting without the newdat argument does the same thing. The correct syntax for fitting the original model is: > mdl <- lm(y ~ x, data = train) ## and then the predict() call works fine using the newdat argument (as 'x' is found there) > predict(mdl, newdat = test) 1 2 3 4 5 0.5134899 0.4619013 0.2458162 0.0446871 0.3146897 All of this is documented and exampled in ?glm or even ?lm or in any tutorials on their use. Please spend the time to study these carefully. Trying to mimic examples you find, which seems to be what you are doing, is rarely sufficient. Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Wed, Feb 16, 2022 at 7:24 AM Bert Gunter <bgunter.4...@gmail.com> wrote: > > You should (almost) always reply to the list to maximize your opportunity for > useful help. Also, I don't do private consulting. > > See ?dput and ?str for ways to put code and data as plain text into a post > via copying and pasting from the R Console. You can also just type the code > directly, of course. The RHelp server will strip most attachments (I think > .png is OK for graphs, though. You can ask on list) if necessary). I don't > recall whether Word makes it through, but you really shouldn't need such > attachments anyway. > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along and > sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > On Wed, Feb 16, 2022 at 3:39 AM STEPHEN KAISLER <skaisl...@comcast.net> wrote: >> >> Bert: >> >> Please see the attached file which shows the approach I used. >> Thanks for any assistance that you can offer. >> >> Steve Kaisler >> >> On 02/15/2022 4:05 PM Bert Gunter <bgunter.4...@gmail.com> wrote: >> >> >> ?? >> Show us the error. Show us the call. >> >> >> On Tue, Feb 15, 2022, 12:14 PM STEPHEN KAISLER <skaisl...@comcast.net> wrote: >> >> Folks: >> >> I haved glm/lm to build a model on a training set derived from auto_mpg data >> of 274 records (70% sampling) >> >> The test data set has 118 records. >> >> I am trying to use predict.glm or predict.lm to predict the values of mpg >> from disp, hp,weight, accel, and cyl. >> >> However I get the following message: >> >> >> So, the resulting vector has 274 rows, when I believe it should have just >> 118 rows - the size of the test data set. >> >> I would appreciate it if someone could explain if am making the call >> in error. >> >> Steve Kaisler >> ______________________________________________ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.