[R] Cross validation tidyLPA
Are there available some cross-validation method for LPA object?? Linda Rispetta l’ambiente: non stampare questa mail se non è necessario. Respect the environment: print this email only if necessary. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cross validation multivariate kernel regression
> I am planning to implement Nadaraya-Watson regression model, with I'm not sure what you mean by "implement". Write a package, fit a model, or something else... Reading your whole post, I get the impression you want mid-level "building blocks", so you customize the model fitting process, in some way. But maybe I've got that wrong... If you want fine control over the model fitting process (including the cross validation), then you may have to write your own package, including your own building blocks. Otherwise, I think you should just use what's available. Also, I'm not familiar with every flavor of nonparametric regression available. If I wanted to fit a nonparametric regression model, I would start with the mgcv package, which is hard to beat. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cross validation multivariate kernel regression
Hi, This question is general- I have a data set of n observations, consisting of a single response variable y and p regressor variables.( n ~50, p~3 or 4). I am planning to implement Nadaraya-Watson regression model, with bandwidths optimized via cross-validation. For cross-validation, I will need to choose 10 outsample/test data sets of a given size ( =n/10 ) for each choice of the bandwidth vector, and then choose the optimum bandwidth vector (in terms of MSE or any reasonable loss function-we can take it to be MSE, as example). The difficulty is I can't find any code to do this under: A) multiple regressors (p>1) AND B) I'll get to choose to the outsample datasets. Thanks for any help/insight you can provide. Regards, Preetam [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cross-validation : can't get the predicted response on the testing data
Dear R-experts, Doing cross-validation for 2 robust regressions (HBR and fast Tau). I can't get the 2 errors rates (RMSE and MAPE). The problem is to predict the response on the testing data. I get 2 error messages. Here below the reproducible (fictional example) R code. #install.packages("MLmetrics") # install.packages( "robustbase" ) # install.packages( "MASS" ) # install.packages( "quantreg" ) # install.packages( "RobPer" ) # install.packages( "scatterplot3d" ) # install.packages("devtools") # library("devtools") # install_github("kloke/hbrfit") #install.packages('http://www.stat.wmich.edu/mckean/Stat666/Pkgs/npsmReg2_0.1.1.tar.gz') library(robustbase) library(MASS) library(quantreg) library(RobPer) library(scatterplot3d) library(hbrfit) library(MLmetrics) # numeric variables A=c(2,3,4,3,2,6,5,6,4,3,5,55,6,5,4,5,6,6,7,52) B= c(45,43,23,47,65,21,12,7,18,29,56,45,34,23,12,65,4,34,54,23) C=c(21,54,34,12,4,56,74,3,12,71,14,15,63,34,35,23,24,21,69,32) # Create a dataframe BIO<-data.frame(A,B,C) # randomize sampling seed set.seed(1) n=dim(BIO)[1] p=0.667 # Sample size sam=sample(1 :n,floor(p*n),replace=FALSE) # Sample training data Training =BIO [sam,] # Sample testing data Testing = BIO [-sam,] # Build the 2 models fit<- FastTau(model.matrix(~Training$A+Training$B),Training$C) HBR<-hbrfit(C ~ A+B) # Predict the response on the testing data ypred=predict(fit,newdata=Testing) ypred=predict(HBR,newdata=Testing) # Get the true response from testing data y=BIO[-sam,]$D # Get error rate RMSE=sqrt(mean((y-ypred)^2)) RMSE MAPE = mean(abs(y-ypred/y)) MAPE __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cross validation in random forest using rfcv functin
> On Aug 23, 2017, at 10:59 AM, Elahe chalabi via R-help> wrote: > > Any responds?! When I look at the original post a I see a question about a function named `rfcv` but do not see a `library` call to load such a function. I also see a reference to a help page or vignette, perhaps?, from that un-identified package. So it appears to me that you expect the rest of us to go searching for that function if we do not use it on a rtegular basis. You also apparently expect use to construct a dataset to reconstruct a dataset for testing. I'm not inclined to make all that effort, and from the crashing silence of the last 24 hours on this venue, it appears I am not alone in thinking you presume too much. Read the Posting Guide and try to better understand why your behavior might not be eliciting the level of interest you were hoping for. -- David. > > > > On Wednesday, August 23, 2017 5:50 AM, Elahe chalabi via R-help > wrote: > > > > Hi all, > > > I would like to do cross validation in random forest using rfcv function. As > the documentation for this package says: > > > rfcv(trainx, trainy, cv.fold=5, scale="log", step=0.5, mtry=function(p) > max(1, floor(sqrt(p))), recursive=FALSE, ...) > > > however I don't know how to build trianx and trainy for my data set, and I > could not understand the way trainx is built in the package documentation > example for iris data set. > > Here is my data set and I want to do cross validation to see accuracy in > classifying Alzheimer and Control Group: > > > str(data) > > 'data.frame':499 obs. of 606 variables: > > $ Gender: int 0 0 0 0 0 1 1 1 1 1 ... > > $ NumOfWords: num 157 111 163 176 100 124 201 100 76 101 > > $ NumofLivings : int 6 6 9 4 3 5 3 3 4 3 ... > > $ NumofStopWords: num 77 45 87 91 46 64 104 37 32 41 ... > > . > > . > > $ Group : Factor w/ 2 levels "Alzheimer","Control","Control"..: > > > So basically trainy should be data$Group but how about trainx? Could anyone > help me in this? > > > > Thanks for any help! > > Elahe > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. David Winsemius Alameda, CA, USA 'Any technology distinguishable from magic is insufficiently advanced.' -Gehm's Corollary to Clarke's Third Law __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cross validation in random forest using rfcv functin
Any responds?! On Wednesday, August 23, 2017 5:50 AM, Elahe chalabi via R-helpwrote: Hi all, I would like to do cross validation in random forest using rfcv function. As the documentation for this package says: rfcv(trainx, trainy, cv.fold=5, scale="log", step=0.5, mtry=function(p) max(1, floor(sqrt(p))), recursive=FALSE, ...) however I don't know how to build trianx and trainy for my data set, and I could not understand the way trainx is built in the package documentation example for iris data set. Here is my data set and I want to do cross validation to see accuracy in classifying Alzheimer and Control Group: str(data) 'data.frame':499 obs. of 606 variables: $ Gender: int 0 0 0 0 0 1 1 1 1 1 ... $ NumOfWords: num 157 111 163 176 100 124 201 100 76 101 $ NumofLivings : int 6 6 9 4 3 5 3 3 4 3 ... $ NumofStopWords: num 77 45 87 91 46 64 104 37 32 41 ... . . $ Group : Factor w/ 2 levels "Alzheimer","Control","Control"..: So basically trainy should be data$Group but how about trainx? Could anyone help me in this? Thanks for any help! Elahe __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cross validation in random forest using rfcv functin
Hi all, I would like to do cross validation in random forest using rfcv function. As the documentation for this package says: rfcv(trainx, trainy, cv.fold=5, scale="log", step=0.5, mtry=function(p) max(1, floor(sqrt(p))), recursive=FALSE, ...) however I don't know how to build trianx and trainy for my data set, and I could not understand the way trainx is built in the package documentation example for iris data set. Here is my data set and I want to do cross validation to see accuracy in classifying Alzheimer and Control Group: str(data) 'data.frame':499 obs. of 606 variables: $ Gender: int 0 0 0 0 0 1 1 1 1 1 ... $ NumOfWords: num 157 111 163 176 100 124 201 100 76 101 $ NumofLivings : int 6 6 9 4 3 5 3 3 4 3 ... $ NumofStopWords: num 77 45 87 91 46 64 104 37 32 41 ... . . $ Group : Factor w/ 2 levels "Alzheimer","Control","Control"..: So basically trainy should be data$Group but how about trainx? Could anyone help me in this? Thanks for any help! Elahe __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cross validation in random forest rfcv functin
Hi all, I would like to do cross validation in random forest using rfcv function. As the documentation for this package says: rfcv(trainx, trainy, cv.fold=5, scale="log", step=0.5, mtry=function(p) max(1, floor(sqrt(p))), recursive=FALSE, ...) however I don't know how to build trianx and trainy for my data set, and I could not understand the way trainx is built in the package documentation example for iris data set. Here is my data set and I want to do cross validation to see accuracy in classifying Alzheimer and Control Group: str(data) 'data.frame': 499 obs. of 606 variables: $ Gender: int 0 0 0 0 0 1 1 1 1 1 ... $ NumOfWords: num 157 111 163 176 100 124 201 100 76 101 $ NumofLivings : int 6 6 9 4 3 5 3 3 4 3 ... $ NumofStopWords: num 77 45 87 91 46 64 104 37 32 41 ... . . $ Group : Factor w/ 2 levels "Alzheimer","Control","Control"..: So basically trainy should be data$Group but how about trainx? Could anyone help me in this? Thanks for any help! Elahe __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cross-Validation for Zero-Inflated Models
1) Helpdesk implies people whose job it is to provide support. R-help is a mailing list in which users help each other when they have spare time. 2) You sent an email to the R-help mailing list, not to Lara, whoever that is. I suggest you figure out what her email address is and send your question to her directly, or read the Posting Guide mentioned below and then pose an entirely new question of your own to the list. There is a lot of existing research and packages related to cross-validation, but you are going to need to illustrate why you think the usual tools are not sufficient. Have you looked at the CRAN Task Views? 3) Email only has linkage to other email when they follow as replies... you did not reply to her email, so no one reading your email (quite likely even Lara, if she is even still on the list) has any idea what question you are referring to. -- Sent from my phone. Please excuse my brevity. On June 21, 2017 7:16:49 AM PDT, Eric Weinewrote: >Lara: > >I see you sent this email to the R helpdesk a really long time ago, but >I was just wondering if you ever got an answer to this question. I was >just thinking that I would build my own cross validation function, but >if you figured out a way to do this automatically, could you let me >know? > >Thanks, > >Eric Weine. >__ >R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cross-Validation for Zero-Inflated Models
Lara: I see you sent this email to the R helpdesk a really long time ago, but I was just wondering if you ever got an answer to this question. I was just thinking that I would build my own cross validation function, but if you figured out a way to do this automatically, could you let me know? Thanks, Eric Weine. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cross validation with variables which have one factor only
Dear R-team I did a model selection by AIC which explain me the habitat use of my animals in six different study sites (See attached files: cross_val_CORINE04032014.csv and cross_val_CORINE04032014.r). Sites were used as random factor because they are distributed over the Alps and so very different. In this way I also removed variables which exist in one study area only to do the model selection. In next, I tried to do a cross validation with the estimated best model for its prediction per site. That means I used model of five sites togehther against the remaining site. In this step I received an error: val_10_fold_minger - cv.glm(data= minger, glmfit = best_model_year, K = 10) Error in `contrasts-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels So for some of the model variables used in the model formula below there are actually not two factor levels (example=C324F where absence :153 but presence: 0 ) best_model_year - glm(dung1_b ~ C231F+C324F+C332F, family=binomial(logit), minger) Does somebody know is there a possibility in cross validation methods which can deal with variables which have one factor only? Kindly Maik __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cross validation in R
Guys, I select 70% of my data and keep 30% of it for model validation. mydata - read.csv(file.choose(), header=TRUE) select - sample(nrow(mydata), nrow(mydata) * .7) data70 - mydata[select,] # select data30 - mydata[-select,] # testing temp.glm - glm(Death~Temperature, data=data70, family=binomial(link=logit)) library(ROCR) # ROC curve and assessment of my prediction pred - prediction(data30$pred, data30$Death) perf - performance(pred,tpr,fpr) plot(perf); abline(0, 1, col=red) attributes(performance(pred, 'auc'))$y.values[[1]] # area under the ROC How do i make a loop so that the process could be repeated several time, producing randomly ROC curve and under ROC values? [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cross validation in R
This code is untested, since you did not provide any example data. But it may help you get started. Jean mydata - read.csv(file.choose(), header=TRUE) library(ROCR) # ROC curve and assessment of my prediction plot(0:1, 0:1, type=n, xlab=False positive rate, ylab=True positive rate) abline(0, 1, col=red) nsim - 5 auc - rep(NA, nsim) for(i in 1:nsim) { select - sample(nrow(mydata), round(nrow(mydata)*0.7)) data70 - mydata[select, ] # train data30 - mydata[-select, ] # test temp.glm - glm(Death ~ Temperature, data=data70, family=binomial) pred - prediction(data30$pred, data30$Death) perf - performance(pred, tpr, fpr) plot(perf, add=TRUE) auc[i] - attributes(performance(pred, auc))$y.values[[1]] # area under the ROC } auc On Tue, Jul 2, 2013 at 3:25 AM, Eddie Smith eddie...@gmail.com wrote: Guys, I select 70% of my data and keep 30% of it for model validation. mydata - read.csv(file.choose(), header=TRUE) select - sample(nrow(mydata), nrow(mydata) * .7) data70 - mydata[select,] # select data30 - mydata[-select,] # testing temp.glm - glm(Death~Temperature, data=data70, family=binomial(link=logit)) library(ROCR) # ROC curve and assessment of my prediction pred - prediction(data30$pred, data30$Death) perf - performance(pred,tpr,fpr) plot(perf); abline(0, 1, col=red) attributes(performance(pred, 'auc'))$y.values[[1]] # area under the ROC How do i make a loop so that the process could be repeated several time, producing randomly ROC curve and under ROC values? [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cross validation in R
How do i make a loop so that the process could be repeated several time, producing randomly ROC curve and under ROC values? Using the caret package http://caret.r-forge.r-project.org/ -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cross validation for Naive Bayes and Bayes Networks
Hi Guilherme, On Sun, Apr 14, 2013 at 11:48 PM, Guilherme Ferraz de Arruda gu...@yahoo.com.br wrote: Hi, I need to classify, using Naive Bayes and Bayes Networks, and estimate their performance using cross validation. How can I do this? I tried the bnlearn package for Bayes Networks, althought I need to get more indexes, not only the error rate (precision, sensitivity, ...). You can do that using the object returned by bn.cv(), because it contains the predicted values and the indexes of the corresponding observations in the original data, for each fold. It's just a matter to reassemble observed and predicted class labels and compute your metrics. I also tried the *e1071* package, but I could not find a way to do cross-validation. You might be able to trick the tune() function to do it, but I am not sure. Marco -- Marco Scutari, Ph.D. Research Associate, Genetics Institute (UGI) University College London (UCL), United Kingdom __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cross validation for Naive Bayes and Bayes Networks
Hi, I need to classify, using Naive Bayes and Bayes Networks, and estimate their performance using cross validation. How can I do this? I tried the bnlearn package for Bayes Networks, althought I need to get more indexes, not only the error rate (precision, sensitivity, ...). I also tried the *e1071* package, but I could not find a way to do cross-validation. Thanks for everyone. Guilherme. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cross Validation with SVM
Good morning. I am using package e1071 to develop a SVM model. My code is: x - subset(dataset, select = -Score) y - dataset$Score model - svm(x, y,cross=10) print(model) summary(model) As 10-CV produces 10 models, I need two things: 1) To have access to each model from 10-CV. 2) To predict new instances with each model to know which one does the best performance. Anyone can help me? Thanks! [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cross validation for nls function
Hi, I've written a logistic function using nls and I'd like to do cross validation for this. Is there a package for that? Below is an example of my data and the function. N terms are presence/absence data and the response is succesful/failed data. y1-sample(0:1,100,replace=T) N1-sample(0:1,100,replace=T) N2-sample(0:1,100,replace=T) N3-sample(0:1,100,replace=T) N4-sample(0:1,100,replace=T) Sw- function(y1, N1,N2,N3,N4) { SA - nls(y1~exp(c+(a1*N1)+(a2*N2)+(a3*N3)+(a4*N4) )/ (1+exp(c+(a1*N1)+(a2*N2)+(a3*N3)+(a4*N4))) ,start=list(a1=-0.2,a2=-0.2,a3=-0.2,a4=-0.2,c=0.2)) SA } model- Sw(y1, N1,N2,N3,N4) summary(model) Thanks for any help! /Anna -- View this message in context: http://r.789695.n4.nabble.com/Cross-validation-for-nls-function-tp4638630.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cross validation in glmnet
I am using cv.glmnet from glmnet package for logistic regression. my dataset is very imbalanced, 5% sample from one group, the rest from the other. I'm wondering when doing cv.glmnet for choosing lambda, is every fold having the same ratio for two groups(every fold has 5% sample from one group, the rest from the other in my case), or just random? many thanks yan -- View this message in context: http://r.789695.n4.nabble.com/cross-validation-in-glmnet-tp4629919.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cross validation in rvm not working? (kernlab package)
Hi, according to ?rvm the relevance vector machine function as implemented in the kernlab-package has an argument 'cross' with which you can perform k-fold cross validation. However, when I try to add a 10-fold cross validation I get the following error message: Error in match.arg(type, c(C-svc, nu-svc, kbb-svc, spoc-svc, C-bsvc, : 'arg' should be one of “C-svc”, “nu-svc”, “kbb-svc”, “spoc-svc”, “C-bsvc”, “one-svc”, “eps-svr”, “eps-bsvr”, “nu-svr” code-example: # create data x - seq(-20,20,0.1) y - sin(x)/x + rnorm(401,sd=0.05) # train relevance vector machine foo - rvm(x, y, cross=10) So, does that mean that cross-validation is not working for rvm at the moment? (since the type argument only allows support vector regression or classification) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cross validation in rvm not working? (kernlab package)
Please report bugs in packages to the corresponding package maintainer (perhaps suggesting a fix if you have an idea how to do that). Uwe Ligges On 14.02.2012 12:42, Martin Batholdy wrote: Hi, according to ?rvm the relevance vector machine function as implemented in the kernlab-package has an argument 'cross' with which you can perform k-fold cross validation. However, when I try to add a 10-fold cross validation I get the following error message: Error in match.arg(type, c(C-svc, nu-svc, kbb-svc, spoc-svc, C-bsvc, : 'arg' should be one of “C-svc”, “nu-svc”, “kbb-svc”, “spoc-svc”, “C-bsvc”, “one-svc”, “eps-svr”, “eps-bsvr”, “nu-svr” code-example: # create data x- seq(-20,20,0.1) y- sin(x)/x + rnorm(401,sd=0.05) # train relevance vector machine foo- rvm(x, y, cross=10) So, does that mean that cross-validation is not working for rvm at the moment? (since the type argument only allows support vector regression or classification) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cross-validation error with tune and with rpart
Hello list, I'm trying to generate classifiers for a certain task using several methods, one of them being decision trees. The doubts come when I want to estimate the cross-validation error of the generated tree: tree - rpart(y~., data=data.frame(xsel, y), cp=0.1) ptree - prune(tree, cp=tree$cptable[which.min(tree$cptable[,xerror]),CP]) ptree$cptable CP nsplit rel error xerror xstd 1 0.3312 01. 1. 0.02856022 2 0.0864 10.6688 0.6704 0.02683544 3 0.02986667 20.5824 0.5856 0.02584564 4 0.0288 50.4928 0.5760 0.02571738 5 0.0192 60.4640 0.5168 0.02484761 6 0.0144 80.4256 0.5056 0.02466708 7 0.0096 120.3552 0.5024 0.02461452 8 0.0088 150.3264 0.4944 0.02448120 9 0.0080 170.3088 0.4768 0.02417800 10 0.0048 250.2448 0.4672 0.02400673 If I got it right, xerror stands for the cross-validation error (using 10-fold by default), this is pretty high (0.4672 over 1). However, if I do something similar using tune from e1071 I get a much lower error: treetune - tune(rpart, y~., data=data.frame(xsel, y), predict.func = treeClassPrediction, cp=0.0048) treetune$best.performance[1] 0.2243049 I'm also assuming that the performance returned by tune is the cross-validation error (also 10-fold by default). So where does this enormous difference come from? What am I missing? Also, rel error is the relative error in the training set? The documentation is not very descriptive: cptable- the table of optimal prunings based on a complexity parameter. Thanks and happy pre-new year, -- israel [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cross-validation error with tune and with rpart
On 31/12/2011 12:34, Israel Saeta Pérez wrote: Hello list, I'm trying to generate classifiers for a certain task using several methods, one of them being decision trees. The doubts come when I want to estimate the cross-validation error of the generated tree: tree- rpart(y~., data=data.frame(xsel, y), cp=0.1) ptree- prune(tree, cp=tree$cptable[which.min(tree$cptable[,xerror]),CP]) ptree$cptable CP nsplit rel error xerror xstd 1 0.3312 01. 1. 0.02856022 2 0.0864 10.6688 0.6704 0.02683544 3 0.02986667 20.5824 0.5856 0.02584564 4 0.0288 50.4928 0.5760 0.02571738 5 0.0192 60.4640 0.5168 0.02484761 6 0.0144 80.4256 0.5056 0.02466708 7 0.0096 120.3552 0.5024 0.02461452 8 0.0088 150.3264 0.4944 0.02448120 9 0.0080 170.3088 0.4768 0.02417800 10 0.0048 250.2448 0.4672 0.02400673 If I got it right, xerror stands for the cross-validation error (using 10-fold by default), this is pretty high (0.4672 over 1). You didn't get it right. Please read the documentation, or contemplate why the first line is exactly one. In any case, that table is not about error rates for the final tree: it is part of the model selection step (to cross-validate the final tree you would need to include the choice of pruning inside the cross-validation) Did you look up the rpart technical report or one of the books explaining its output? Google 'rpart technical report' if you need to find it. [...] -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cross-validation complex model AUC Nagelkerke R squared code
Hi there, I really tried hard to understand and find my own solution, but now I think I have to ask for your help. I already developed some script code for my problem but I doubt that it is correct. I have the following problem: Image you develop a logistic regression model with a binary outcome Y (0/1) with possible preditors (X1,X2,X3..). The development of the final model would be quite complex and undertake several steps (stepwise forward selection with LR-Test statistics, incorporating interaction effects etc.). The final prediction at the end however would be through a glm object (called fit.glm). Then, I think so, it would be no problem to calculate a Nagelkerke R squared measure and an AUC value (for example with the pROC package) following the script: BaseRate - table(Data$Y[[1]])/sum(table(Data$Y)) L(0)=Likelihood(Null-Model)= (BaseRate*log(BaseRate)+(1-BaseRate)*log(1-BaseRate))*sum(table(Data$Y)) LIKM - predict(fit.glm, type=response) L(M)=Likelihood(FittedModell)=sum(Data$Y*log(LIKM)+(1-Data$Y)*log(1-LIKM)) R2 = 1-(L(0)/L(M))^2/n R2_max=1-(L(0))^2/n R2_Nagelkerke=R2/R2max library(pROC) AUC - auc(Data$Y,LIKM) I checked this kind of caculation of R2_Nagelkerke and AUC-Value with the built-in calculation in package Design and got consistent results. Now I implement a cross validation procedure, dividing the sample randomly into k-subsamples with equal size. Afterwards I calculate the predicted probabilities for each k-th subsample with a model (fit.glm_s) developed taking the same algorithm as for the whole data model (stepwise forward selection selection etc.) but using all but the k-th subsample. I store the predicted probabilities subsequently and build up my LIKM vector (see above) the following way. LIKM[sub] - predict(fit.glm_s, data=Data[-sub,], type=response). Now I use the same formula/script as above, the only change therefore consists in the calculation of the LIKM vector. BaseRate - table(Data$Y[[1]])/sum(table(Data$Y)) L(0)=Likelihood(Null-Model)= (BaseRate*log(BaseRate)+(1-BaseRate)*log(1-BaseRate))*sum(table(Data$Y)) ...calculation of the cross-validated LIKM, see above ... L(M)=Likelihood(FittedModell)=sum(Data$Y*log(LIKM)+(1-Data$Y)*log(1-LIKM)) R2 = 1-(L(0)/L(M))^2/n R2_max=1-(L(0))^2/n R2_Nagelkerke=R2/R2max AUC - auc(Data$Y,LIKM) When I compare my results (using more simply developed models) with the validate method in package Design (method=cross,B=10), it seems to me that I consistently underestimate the true expected Nagelkerke R Squared. Additionally, I'm very unsure about the way I try to calculate a cross-validated AUC. Do I have an error in my thoughts of how to obtain easily cross-validated AUC and R Squared for a model developed to predict a binary outcome? I hope my problem is understandable and you could try to help me. Best regards, Jürgen -- --- Jürgen Biedermann Bergmannstraße 3 10961 Berlin-Kreuzberg Mobil: +49 176 247 54 354 Home: +49 30 250 11 713 e-mail: juergen.biederm...@gmail.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cross-validation in rpart
I am trying to find out what type of sampling scheme is used to select the 10 subsets in 10-fold cross-validation process used in rpart to choose the best tree. Is it simple random sampling? Is there any documentation available on this? Thanks, Penny. -- View this message in context: http://r.789695.n4.nabble.com/cross-validation-in-rpart-tp3389329p3389329.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cross-validation in rpart
I assume you mean rpart::xpred.rpart ? The beauty of R means that you can look at the source. For the simple case (where xval is a single number) the code does indeed do simple random sampling xgroups- sample(rep(1:xval, length = nobs), nobs, replace = FALSE) If you want another sampling, then you simply pass a vector as the xval parameter, as the documentation says: “This may also be an explicit list of integers that define the cross-validation groups”. Hope this helps a little. Allan On 19/03/11 09:21, Penny B wrote: I am trying to find out what type of sampling scheme is used to select the 10 subsets in 10-fold cross-validation process used in rpart to choose the best tree. Is it simple random sampling? Is there any documentation available on this? Thanks, Penny. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cross-validation in rpart
On Sat, 19 Mar 2011, Penny B wrote: I am trying to find out what type of sampling scheme is used to select the 10 subsets in 10-fold cross-validation process used in rpart to choose the best tree. Is it simple random sampling? Is there any documentation available on this? Not SRS (and least in its conventional meaning), as it is partitioning: the 10 folds are disjoint. Note that this happens in two places, in rpart() and in xpred.rpart(), but the (default) method is the same. I presume you asked about the first, but it wasn't clear. There is a lot of documentation on the meaning of '10-fold cross-validation', e.g. in my 1996 book. There are a few slightly different ways to do it, and you can read the rpart sources if you want to know the details. -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cross validation? when rlm, lmrob or lmRob
Dear community, I have fitted a model using comands above, (rlm, lmrob or lmRob). I don't have new data to validate de models obtained. I was wondering if exists something similar to CVlm in robust regression. In case there isn't, any suggestion for validation would be appreciated. Thanks, u...@host.com -- View this message in context: http://r.789695.n4.nabble.com/cross-validation-when-rlm-lmrob-or-lmRob-tp3382189p3382189.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cross validation for Ordinary Kriging
Pearl, The error suggests that there is something wrong with x2, and that there is a difference between the row names of the coordinates and the data. If you call str(x2) see if the first element of @coords is different from NULL, as this can cause some problems when cross-validating. If it is, try to figure out why. You can also set the row.names equal to NULL directly: row.names(x...@coords) = NULL although I dont think such manipulation of the slots of an object is usually recommended. Cheers, Jon BTW, you will usually get more response to questions about spatial data handling using the list r-sig-geo (https://stat.ethz.ch/mailman/listinfo/r-sig-geo) On 1/6/2011 4:00 PM, pearl may dela cruz wrote: ear ALL, The last part of my thesis analysis is the cross validation. Right now I am having difficulty using the cross validation of gstat. Below are my commands with the tsport_ace as the variable: nfold- 3 part- sample(1:nfold, 69, replace = TRUE) sel- (part != 1) m.model- x2[sel, ] m.valid- x2[-sel, ] t- fit.variogram(v,vgm(0.0437, Exp, 26, 0)) cv69- krige.cv(tsport_ace ~ 1, x2, t, nfold = nrow(x2)) The last line gives an error saying: Error in SpatialPointsDataFrame(coordinates(data), data.frame(matrix(as.numeric(NA), : row.names of data and coords do not match I don't know what is wrong. The x2 data is a SpatialPointsdataframe that is why i did not specify the location (as it will take it from the data). Here is the usage of the function krige.cv: krige.cv(formula, locations, data, model = NULL, beta = NULL, nmax = Inf, nmin = 0, maxdist = Inf, nfold = nrow(data), verbose = TRUE, ...) I hope you can help me on this. Thanks a lot. Best regards, Pearl [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cross validation for Ordinary Kriging
On 1/7/2011 12:40 PM, Jon Olav Skoien wrote: Pearl, The error suggests that there is something wrong with x2, and that there is a difference between the row names of the coordinates and the data. If you call str(x2) see if the first element of @coords is different from NULL, as this can cause some problems when cross-validating. If it is, try to figure out why. You can also set the row.names equal to NULL directly: row.names(x...@coords) = NULL although I dont think such manipulation of the slots of an object is usually recommended. Pearl, It seems the problem was caused by a recent change in sp without updating gstat, the maintainer has fixed it and submitted new version of gstat to CRAN. So you should be able to use your original script after downloading the new version, probably available in a couple of days. In the mean time the suggestion above should still work. Cheers, Jon Cheers, Jon BTW, you will usually get more response to questions about spatial data handling using the list r-sig-geo (https://stat.ethz.ch/mailman/listinfo/r-sig-geo) On 1/6/2011 4:00 PM, pearl may dela cruz wrote: ear ALL, The last part of my thesis analysis is the cross validation. Right now I am having difficulty using the cross validation of gstat. Below are my commands with the tsport_ace as the variable: nfold- 3 part- sample(1:nfold, 69, replace = TRUE) sel- (part != 1) m.model- x2[sel, ] m.valid- x2[-sel, ] t- fit.variogram(v,vgm(0.0437, Exp, 26, 0)) cv69- krige.cv(tsport_ace ~ 1, x2, t, nfold = nrow(x2)) The last line gives an error saying: Error in SpatialPointsDataFrame(coordinates(data), data.frame(matrix(as.numeric(NA), : row.names of data and coords do not match I don't know what is wrong. The x2 data is a SpatialPointsdataframe that is why i did not specify the location (as it will take it from the data). Here is the usage of the function krige.cv: krige.cv(formula, locations, data, model = NULL, beta = NULL, nmax = Inf, nmin = 0, maxdist = Inf, nfold = nrow(data), verbose = TRUE, ...) I hope you can help me on this. Thanks a lot. Best regards, Pearl [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cross validation for Ordinary Kriging
ear ALL, The last part of my thesis analysis is the cross validation. Right now I am having difficulty using the cross validation of gstat. Below are my commands with the tsport_ace as the variable: nfold - 3 part - sample(1:nfold, 69, replace = TRUE) sel - (part != 1) m.model - x2[sel, ] m.valid - x2[-sel, ] t- fit.variogram(v,vgm(0.0437, Exp, 26, 0)) cv69 - krige.cv(tsport_ace ~ 1, x2, t, nfold = nrow(x2)) The last line gives an error saying: Error in SpatialPointsDataFrame(coordinates(data), data.frame(matrix(as.numeric(NA), : row.names of data and coords do not match I don't know what is wrong. The x2 data is a SpatialPointsdataframe that is why i did not specify the location (as it will take it from the data). Here is the usage of the function krige.cv: krige.cv(formula, locations, data, model = NULL, beta = NULL, nmax = Inf, nmin = 0, maxdist = Inf, nfold = nrow(data), verbose = TRUE, ...) I hope you can help me on this. Thanks a lot. Best regards, Pearl [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cross validation using e1071:SVM
thank you so much for your help. if i am not wrong then createDataPartition can be used to create stratified random splits of a data set. is there other way to do that? Thank you -- View this message in context: http://r.789695.n4.nabble.com/cross-validation-using-e1071-SVM-tp3055335p3057684.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cross validation using e1071:SVM
Hi everyone I am trying to do cross validation (10 fold CV) by using e1071:svm method. I know that there is an option (“cross”) for cross validation but still I wanted to make a function to Generate cross-validation indices using pls: cvsegments method. # Code (at the end) Is working fine but sometime caret:confusionMatrix gives following error: stat_result- confusionMatrix(pred_true1,species_test) Error in confusionMatrix.default(pred_true1, species_test) : The data and reference factors must have the same number of levels My data: total number=260 Class = 6 # Sorry if I missed some previous discussion about this problem. It would be nice if anyone explain or point out the mistake I am doing in this following code. Is there another way to do this? As I wanted to check my result based on Accuracy and Kappa value generated by caret:confusionMatrix. ## Code # x-NULL index-cvsegments(nrow(data),10) for(i in 1:length(index)) { x-matrix(index[i]) testset-data[x[[1]],] trainset-data[-x[[1]],] species-as.factor(trainset[,ncol(trainset)]) train1-trainset[,-ncol(trainset)] train1-train1[,-(1)] test_t-testset[,-ncol(testset)] species_test-as.factor(testset[,ncol(testset)]) test_t-test_t[,-(1)] model_true1 - svm(train1,species) pred_true1-predict(model_true1,test_t) stat_result- confusionMatrix(pred_true1,species_test) stat_true[[i]]-as.matrix(stat_result,what=overall) kappa_true[i]-stat_true[[i]][2,1] accuracy_true[i]-stat_true[[i]][1,1] } -- View this message in context: http://r.789695.n4.nabble.com/cross-validation-using-e1071-SVM-tp3055335p3055335.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cross validation using e1071:SVM
Hi everyone, Can you help me to plot Gamma(x/h+1) and Beta(x/h+1,(1-x)/h+1)?I want write x-seq(0,3,0.1) thank 2010/11/23 Neeti nikkiha...@gmail.com Hi everyone I am trying to do cross validation (10 fold CV) by using e1071:svm method. I know that there is an option (cross) for cross validation but still I wanted to make a function to Generate cross-validation indices using pls: cvsegments method. # Code (at the end) Is working fine but sometime caret:confusionMatrix gives following error: stat_result- confusionMatrix(pred_true1,species_test) Error in confusionMatrix.default(pred_true1, species_test) : The data and reference factors must have the same number of levels My data: total number=260 Class = 6 # Sorry if I missed some previous discussion about this problem. It would be nice if anyone explain or point out the mistake I am doing in this following code. Is there another way to do this? As I wanted to check my result based on Accuracy and Kappa value generated by caret:confusionMatrix. ## Code # x-NULL index-cvsegments(nrow(data),10) for(i in 1:length(index)) { x-matrix(index[i]) testset-data[x[[1]],] trainset-data[-x[[1]],] species-as.factor(trainset[,ncol(trainset)]) train1-trainset[,-ncol(trainset)] train1-train1[,-(1)] test_t-testset[,-ncol(testset)] species_test-as.factor(testset[,ncol(testset)]) test_t-test_t[,-(1)] model_true1 - svm(train1,species) pred_true1-predict(model_true1,test_t) stat_result- confusionMatrix(pred_true1,species_test) stat_true[[i]]-as.matrix(stat_result,what=overall) kappa_true[i]-stat_true[[i]][2,1] accuracy_true[i]-stat_true[[i]][1,1] } -- View this message in context: http://r.789695.n4.nabble.com/cross-validation-using-e1071-SVM-tp3055335p3055335.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Francial Giscard LIBENGUE Doctorant en Mathématiques Appliquées ;Option : Statistique Université de Franche-Comté - UFR Sciences et Techniques Laboratoire de Mathématiques de Besançon UMR 6623 CNRS 16, route de Gray - 25030 Besançon cedex, France. Tel. +333.81.66.63.98 ; Fax +33 381 666 623 ; Bureau B 328. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cross validation using e1071:SVM
@Francial Giscard LIBENGUE please post your query again so that with different subject -- View this message in context: http://r.789695.n4.nabble.com/cross-validation-using-e1071-SVM-tp3055335p3055831.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cross validation using e1071:SVM
could anyone help me with my last problem. if the question is not clear please let me know thank you Hi everyone I am trying to do cross validation (10 fold CV) by using e1071:svm method. I know that there is an option (cross) for cross validation but still I wanted to make a function to Generate cross-validation indices using pls: cvsegments method. # Code (at the end) Is working fine but sometime caret:confusionMatrix gives following error: stat_result- confusionMatrix(pred_true1,species_test) Error in confusionMatrix.default(pred_true1, species_test) : The data and reference factors must have the same number of levels My data: total number=260 Class = 6 # Sorry if I missed some previous discussion about this problem. It would be nice if anyone explain or point out the mistake I am doing in this following code. Is there another way to do this? As I wanted to check my result based on Accuracy and Kappa value generated by caret:confusionMatrix. ## Code # x-NULL index-cvsegments(nrow(data),10) for(i in 1:length(index)) { x-matrix(index[i]) testset-data[x[[1]],] trainset-data[-x[[1]],] species-as.factor(trainset[,ncol(trainset)]) train1-trainset[,-ncol(trainset)] train1-train1[,-(1)] test_t-testset[,-ncol(testset)] species_test-as.factor(testset[,ncol(testset)]) test_t-test_t[,-(1)] model_true1 - svm(train1,species) pred_true1-predict(model_true1,test_t) stat_result- confusionMatrix(pred_true1,species_test) stat_true[[i]]-as.matrix(stat_result,what=overall) kappa_true[i]-stat_true[[i]][2,1] accuracy_true[i]-stat_true[[i]][1,1] } -- View this message in context: http://r.789695.n4.nabble.com/cross-validation-using-e1071-SVM-tp3055335p3055836.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cross validation using e1071:SVM
Neeti, I'm pretty sure that the error is related to the confusionMAtrix call, which is in the caret package, not e1071. The error message is pretty clear: you need to pas in two factor objects that have the same levels. You can check by running the commands: str(pred_true1) str(species_test) Also, caret can do the resampling for you instead of you writing the loop yourself. Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cross-validation for choosing regression trees
Forgive me if I misunderstand your goals but I have no idea what you are trying to determine or what your data is. I can say, however, that setting mindev to 0 has always overfit data for me, and that you are more than likely looking at a situation in which that 1 node tree is more accurate. Also, if you look at ?cv.tree, the default function to use is prune.tree(). Perhaps prune.tree() is trimming down to that terminal node? If you want an alternative look at CART methods that may account for some of your issues, I would recommend the packages 'rpart' and 'party', as they may be more informative. -- Jonathan P. Daily Technician - USGS Leetown Science Center 11649 Leetown Road Kearneysville WV, 25430 (304) 724-4480 Is the room still a room when its empty? Does the room, the thing itself have purpose? Or do we, what's the word... imbue it. - Jubal Early, Firefly From: Shiyao Liu lsy...@iastate.edu To: r-help@r-project.org Date: 11/03/2010 09:04 PM Subject: [R] cross-validation for choosing regression trees Sent by: r-help-boun...@r-project.org Dear All, We came across a problem when using the tree package to analyze our data set. First, in the tree function, if we use the default value mindev=0.01, the resulting regression tree has a single node. So, we set mindev=0, and obtain a tree with 931 terminal nodes. However, when we further use the cv.tree function to run a 10-fold cross-validation, the error message is: Error in prune.tree(list(frame = list(var = 1L, n = 6676, dev = 3.28220789569792, : can not prune singlenode tree. Is the cv.tree function respecting the mindev chosen in the tree function or what else might be wrong? Thanks, Shiyao [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cross-validation for choosing regression trees
Dear All, We came across a problem when using the tree package to analyze our data set. First, in the tree function, if we use the default value mindev=0.01, the resulting regression tree has a single node. So, we set mindev=0, and obtain a tree with 931 terminal nodes. However, when we further use the cv.tree function to run a 10-fold cross-validation, the error message is: Error in prune.tree(list(frame = list(var = 1L, n = 6676, dev = 3.28220789569792, : can not prune singlenode tree. Is the cv.tree function respecting the mindev chosen in the tree function or what else might be wrong? Thanks, Shiyao [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cross validation of SVM
From ?svm: cross if a integer value k0 is specified, a k-fold cross validation on the training data is performed to assess the quality of the model: the accuracy rate for classification and the Mean Squared Error for regression Uwe Ligges On 15.06.2010 23:14, Amy Hessen wrote: hi, could you please tell me what kind of cross validation that SVM of e1071 uses? Cheers, Amy _ View photos of singles in your area! Looking for a hot date? [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cross validation of SVM
hi, could you please tell me what kind of cross validation that SVM of e1071 uses? Cheers, Amy _ View photos of singles in your area! Looking for a hot date? [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cross-validation
Hi I want to do leave-one-out cross-validation for multinomial logistic regression in R. I did multinomial logistic reg. by package nnet in R. How I do validation? by which function? response variable has 7 levels please help me Thanks alot Azam [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cross-validation
As far as my knowledge goes, nnet doesn't have a built-in function for crossvalidation. Coding it yourself is not hard though. Nnet is used in this book : http://www.stats.ox.ac.uk/pub/MASS4/ , which contains enough examples on how to do so. See also the crossval function in the bootstrap package. http://sekhon.berkeley.edu/library/bootstrap/html/crossval.html Cheers Joris On Tue, Jun 8, 2010 at 11:34 AM, azam jaafari azamjaaf...@yahoo.com wrote: Hi I want to do leave-one-out cross-validation for multinomial logistic regression in R. I did multinomial logistic reg. by package nnet in R. How I do validation? by which function? response variable has 7 levels please help me Thanks alot Azam [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Joris Meys Statistical consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control tel : +32 9 264 59 87 joris.m...@ugent.be --- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cross-validation
Install the caret package and see ?train. There is also: http://cran.r-project.org/web/packages/caret/vignettes/caretTrain.pdf http://www.jstatsoft.org/v28/i05/paper Max On Tue, Jun 8, 2010 at 5:34 AM, azam jaafari azamjaaf...@yahoo.com wrote: Hi I want to do leave-one-out cross-validation for multinomial logistic regression in R. I did multinomial logistic reg. by package nnet in R. How I do validation? by which function? response variable has 7 levels please help me Thanks alot Azam [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cross-validation for parameter selection (glm/logit)
If my aim is to select a good subset of parameters for my final logit model built using glm(). What is the best way to cross-validate the results so that they are reliable? Let's say that I have a large dataset of 1000's of observations. I split this data into two groups, one that I use for training and another for validation. First I use the training set to build a model, and the the stepAIC() with a Forward-Backward search. BUT, if I base my parameter selection purely on this result, I suppose it will be somewhat skewed due to the 1-time data split (I use only 1 training dataset) What is the correct way to perform this variable selection? And are the readily available packages for this? Similarly, when I have my final parameter set, how should I go about and make the final assessment of the models predictability? CV? What package? Thank you in advance, Jay __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cross-validation for parameter selection (glm/logit)
Jay Unless I have misunderstood some statistical subtleties, you can use the AIC in place of actual cross-validation, as the AIC is asymptotically equivalent to leave-out-one cross-validation under MLE. Joe Stone, M. An asymptotic equivalence of choice of model by cross-validation and Akaike's criterion Journal of the Royal Statistical Society. Series B (Methodological), 1977, 39, 44-47 Abstract: A logarithmic assessment of the performance of a predicting density is found to lead to asymptotic equivalence of choice of model by cross-validation and Akaike's criterion, when maximum likelihood estimation is used within each model. Jay josip.2...@gmail.com Sent by: r-help-boun...@r-project.org 04/02/2010 09:14 AM To r-help@r-project.org cc Subject [R] Cross-validation for parameter selection (glm/logit) If my aim is to select a good subset of parameters for my final logit model built using glm(). What is the best way to cross-validate the results so that they are reliable? Let's say that I have a large dataset of 1000's of observations. I split this data into two groups, one that I use for training and another for validation. First I use the training set to build a model, and the the stepAIC() with a Forward-Backward search. BUT, if I base my parameter selection purely on this result, I suppose it will be somewhat skewed due to the 1-time data split (I use only 1 training dataset) What is the correct way to perform this variable selection? And are the readily available packages for this? Similarly, when I have my final parameter set, how should I go about and make the final assessment of the models predictability? CV? What package? Thank you in advance, Jay __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cross-validation for parameter selection (glm/logit)
Hi, On Fri, Apr 2, 2010 at 9:14 AM, Jay josip.2...@gmail.com wrote: If my aim is to select a good subset of parameters for my final logit model built using glm(). What is the best way to cross-validate the results so that they are reliable? Let's say that I have a large dataset of 1000's of observations. I split this data into two groups, one that I use for training and another for validation. First I use the training set to build a model, and the the stepAIC() with a Forward-Backward search. BUT, if I base my parameter selection purely on this result, I suppose it will be somewhat skewed due to the 1-time data split (I use only 1 training dataset) Another approach would be to use penalized regression models. The glment package has lasso and elasticnet models for both logistic and normal regression models. Intuitively: in addition to minimizing (say) the squared loss, the model has to pay some cost (lambda) for including a non-zero parameter in your model, which in turn provides sparse models. You ca use CV to fine tune the value for lambda. If you're not familiar with these penalized models, the glmnet package has a few references to get you started. -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cross-validation for parameter selection (glm/logit)
Inline below: Bert Gunter Genentech Nonclinical Statistics -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Steve Lianoglou Sent: Friday, April 02, 2010 2:34 PM To: Jay Cc: r-help@r-project.org Subject: Re: [R] Cross-validation for parameter selection (glm/logit) Hi, On Fri, Apr 2, 2010 at 9:14 AM, Jay josip.2...@gmail.com wrote: If my aim is to select a good subset of parameters for my final logit model built using glm(). -- Define good What is the best way to cross-validate the -- Define best results so that they are reliable? -- Define reliable Answers depend on what you mean by these terms. I suggest you consult a statistician to work with you. These are huge issues for which you would profit by some guidance. Cheers, Bert Let's say that I have a large dataset of 1000's of observations. I split this data into two groups, one that I use for training and another for validation. First I use the training set to build a model, and the the stepAIC() with a Forward-Backward search. BUT, if I base my parameter selection purely on this result, I suppose it will be somewhat skewed due to the 1-time data split (I use only 1 training dataset) Another approach would be to use penalized regression models. The glment package has lasso and elasticnet models for both logistic and normal regression models. Intuitively: in addition to minimizing (say) the squared loss, the model has to pay some cost (lambda) for including a non-zero parameter in your model, which in turn provides sparse models. You ca use CV to fine tune the value for lambda. If you're not familiar with these penalized models, the glmnet package has a few references to get you started. -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cross-validation in plsr package
Peter Tillmann peter.tillm...@t-online.de writes: can anyone give an example how to use cross-validation in the plsr package. There are examples in the references cited on http://mevik.net/work/software/pls.html I miss to find the number of factors proposed by cross-validation as optimum. The cross-validation in the pls package does not propose a number of factors as optimum, you have to select this yourself. (The reason for this is that there is AFAIK no theoretically founded and widely accepted way of doing this automatically. I'd be happy to learn otherwise.) -- Regards, Bjørn-Helge Mevik __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cross-validation in plsr package
Kjaere Bjørn-Helge, can anyone give an example how to use cross-validation in the plsr package. There are examples in the references cited on http://mevik.net/work/software/pls.html I miss to find the number of factors proposed by cross-validation as optimum. The cross-validation in the pls package does not propose a number of factors as optimum, you have to select this yourself. (The reason for this is that there is AFAIK no theoretically founded and widely accepted way of doing this automatically. I'd be happy to learn otherwise.) tusend takk. Vi i NIRS bruker CV for a bestemme antall faktorer i PLS, derfor lurer jeg paa en foreslag fra CV. Men klart vi er bare brukerer ikke statistiker i samenheng med PLS. Hilsen Peter * Espenauer Str. 28, D-34246 Vellmar, Deutschland -- View this message in context: http://n4.nabble.com/cross-validation-in-plsr-package-tp1563815p1564131.html Sent from the R help mailing list archive at Nabble.com. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cross-validation in plsr package
The cross-validation in the pls package does not propose a number of factors as optimum, you have to select this yourself. (The reason for this is that there is AFAIK no theoretically founded and widely accepted way of doing this automatically. I'd be happy to learn otherwise.) The caret package has a wrapper for pls and multiple resampling methods (cv, bootstrap, repeated test/train splits etc). There are a few modules that can be used for automatically determining the optimal number of components. I agree that there is no uniformly best technique. The only thing that I know of that is widely accepted is the 1 stardard error rule in CART. In this case, that would mean that you find the value of ncomp with the smallest error and choose the final ncomp value based of the smallest value within one standard error of the optimal. caret can do this or use any other rule that you think is appropriate. Thanks, Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cross-validation in plsr package
Dear readers, can anyone give an example how to use cross-validation in the plsr package. I miss to find the number of factors proposed by cross-validation as optimum. Thank you Peter -- View this message in context: http://n4.nabble.com/cross-validation-in-plsr-package-tp1563815p1563815.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cross validation function translated from stata
Hi, everyone: I ask for help about translating a stata program into R. The program perform cross validation as it stated. #1. Randomly divide the data set into 10 sets of equal size, ensuring equal numbers of events in each set #2. Fit the model leaving out the 1st set #3. Apply the fitted model in (2) to the 1st set to obtain the predicted probability of a prostate cancer diagnosis. #4. Repeat steps (2) to (3) leaving out and then applying the fitted model to the ith group, i = 2, 3... 10. Every subject now has a predicted probability of a prostate cancer diagnosis. #5. Using the predicted probabilities, compute the net benefit at various threshold probabilities. #6. Repeat steps (1) to (5) 200 times. The corrected net benefit for each threshold probability is the mean across the 200 replications. = First is stata code. forvalues i=1(1)200 { local event=cancer local predictors1 = total_psa local predictors2 = total_psa free_psa local prediction1 = base local prediction2 = full g `prediction1'=. g `prediction2'=. quietly g u = uniform() sort `event' u g set = mod(_n, 10) + 1 forvalues j=1(1)10{ quietly logit `event' `predictors1' if set~=`j' quietly predict ptemp if set==`j' quietly replace `prediction1' = ptemp if set==`j' drop ptemp quietly logit `event' `predictors2' if set~=`j' quietly predict ptemp if set==`j' quietly replace `prediction2' = ptemp if set==`j' drop ptemp } tempfile dca`i' quietly dca `event' `prediction1' `prediction2', graphoff saving(`dca`i'') drop u set `prediction1' `prediction2' } use `dca1', clear forvalues i=2(1)200 { append using `dca`i'' } collapse all none modelp1 modelp2, by(threshold) save cross validation dca output.dta, replace twoway(line none all modelp1 modelp2 threshold, sort) = Here is my draft of R code. cMain is my dataset. predca-rep(0,4) dim(predca)-c(200,200) for (i in 1:200) { cvgroup-rep(1:10,length=110) cvgroup-sample(cvgroup) cvpre-rep(0,length=110) cvMain-cbind(cMain,cvgroup,cvpre) for (j in 1:10) { cvdev-cvMain[cvMain$cvgroup!=j,] cvval-cvMain[cvMain$cvgroup==j,] cvfit-lrm(Y~X,data=cvdev,x=T,y=T) cvprej-predict(cvfit,cvval,type=fitted) #put the fitted value in dataset cvMain[cvgroup==j,]$cvpre-prej } cvdcaop-dca(cvMain$Y,cvMain$cvpre,prob=(Y)) cvnb-100*(cvdcaop[,1]-cvdcaop[,2]) cvtpthres-cvdcaop[,4]/(100-cvdcaop[,4]) cvnr-cvnb/cvtpthres predca[cvn,1:99]-cvnb predca[cvn,101:199]-cvnr } = My questions are 1. How to ensure equal numbers of events in each set in R? 2. A part of stata code is forvalues j=1(1)10{ quietly logit `event' `predictors1' if set~=`j' quietly predict ptemp if set==`j' quietly replace `prediction1' = ptemp if set==`j' drop ptemp quietly logit `event' `predictors2' if set~=`j' quietly predict ptemp if set==`j' quietly replace `prediction2' = ptemp if set==`j' drop ptemp } I don't understand what's the difference between prediction1 and prediction2 3. Is my code right? Thanks ! Yao Zhu Department of Urology Fudan University Shanghai Cancer Center Shanghai, China [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cross validation function translated from stata
Hi, On Thu, Jan 21, 2010 at 8:55 AM, zhu yao mailzhu...@gmail.com wrote: Hi, everyone: I ask for help about translating a stata program into R. The program perform cross validation as it stated. #1. Randomly divide the data set into 10 sets of equal size, ensuring equal numbers of events in each set #2. Fit the model leaving out the 1st set #3. Apply the fitted model in (2) to the 1st set to obtain the predicted probability of a prostate cancer diagnosis. #4. Repeat steps (2) to (3) leaving out and then applying the fitted model to the ith group, i = 2, 3... 10. Every subject now has a predicted probability of a prostate cancer diagnosis. #5. Using the predicted probabilities, compute the net benefit at various threshold probabilities. #6. Repeat steps (1) to (5) 200 times. The corrected net benefit for each threshold probability is the mean across the 200 replications. = First is stata code. forvalues i=1(1)200 { local event=cancer local predictors1 = total_psa local predictors2 = total_psa free_psa local prediction1 = base local prediction2 = full g `prediction1'=. g `prediction2'=. quietly g u = uniform() sort `event' u g set = mod(_n, 10) + 1 forvalues j=1(1)10{ quietly logit `event' `predictors1' if set~=`j' quietly predict ptemp if set==`j' quietly replace `prediction1' = ptemp if set==`j' drop ptemp quietly logit `event' `predictors2' if set~=`j' quietly predict ptemp if set==`j' quietly replace `prediction2' = ptemp if set==`j' drop ptemp } tempfile dca`i' quietly dca `event' `prediction1' `prediction2', graphoff saving(`dca`i'') drop u set `prediction1' `prediction2' } use `dca1', clear forvalues i=2(1)200 { append using `dca`i'' } collapse all none modelp1 modelp2, by(threshold) save cross validation dca output.dta, replace twoway(line none all modelp1 modelp2 threshold, sort) = Here is my draft of R code. cMain is my dataset. predca-rep(0,4) dim(predca)-c(200,200) for (i in 1:200) { cvgroup-rep(1:10,length=110) cvgroup-sample(cvgroup) cvpre-rep(0,length=110) cvMain-cbind(cMain,cvgroup,cvpre) for (j in 1:10) { cvdev-cvMain[cvMain$cvgroup!=j,] cvval-cvMain[cvMain$cvgroup==j,] cvfit-lrm(Y~X,data=cvdev,x=T,y=T) cvprej-predict(cvfit,cvval,type=fitted) #put the fitted value in dataset cvMain[cvgroup==j,]$cvpre-prej } cvdcaop-dca(cvMain$Y,cvMain$cvpre,prob=(Y)) cvnb-100*(cvdcaop[,1]-cvdcaop[,2]) cvtpthres-cvdcaop[,4]/(100-cvdcaop[,4]) cvnr-cvnb/cvtpthres predca[cvn,1:99]-cvnb predca[cvn,101:199]-cvnr } = My questions are 1. How to ensure equal numbers of events in each set in R? I just wanted to point you to the createFolds and createDataPartition functions in the caret package ... they try to do something similar, so perhaps you can see how others have tried to solve this problem: http://cran.r-project.org/web/packages/caret/index.html For example, from their help page: For other data splitting, the random sampling is done within the levels of y when y is a factor in an attempt to balance the class distributions within the splits. For numeric y, the sample is split into groups sections based on quantiles and sampling is done within these subgroups. Also, for very small class sizes (= 3) the classes may not show up in both the training and test data -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cross validation function translated from stata
Take a look at the validate.lrm function in the rms package. Note that the use of threshold probabilities results in an improper scoring rule which will mislead you. Also note that you need to repeat 10-fold CV 50-100 times for precision, and that at each repeat you have to start from zero in analyzing associations. Frank zhu yao wrote: Hi, everyone: I ask for help about translating a stata program into R. The program perform cross validation as it stated. #1. Randomly divide the data set into 10 sets of equal size, ensuring equal numbers of events in each set #2. Fit the model leaving out the 1st set #3. Apply the fitted model in (2) to the 1st set to obtain the predicted probability of a prostate cancer diagnosis. #4. Repeat steps (2) to (3) leaving out and then applying the fitted model to the ith group, i = 2, 3... 10. Every subject now has a predicted probability of a prostate cancer diagnosis. #5. Using the predicted probabilities, compute the net benefit at various threshold probabilities. #6. Repeat steps (1) to (5) 200 times. The corrected net benefit for each threshold probability is the mean across the 200 replications. = First is stata code. forvalues i=1(1)200 { local event=cancer local predictors1 = total_psa local predictors2 = total_psa free_psa local prediction1 = base local prediction2 = full g `prediction1'=. g `prediction2'=. quietly g u = uniform() sort `event' u g set = mod(_n, 10) + 1 forvalues j=1(1)10{ quietly logit `event' `predictors1' if set~=`j' quietly predict ptemp if set==`j' quietly replace `prediction1' = ptemp if set==`j' drop ptemp quietly logit `event' `predictors2' if set~=`j' quietly predict ptemp if set==`j' quietly replace `prediction2' = ptemp if set==`j' drop ptemp } tempfile dca`i' quietly dca `event' `prediction1' `prediction2', graphoff saving(`dca`i'') drop u set `prediction1' `prediction2' } use `dca1', clear forvalues i=2(1)200 { append using `dca`i'' } collapse all none modelp1 modelp2, by(threshold) save cross validation dca output.dta, replace twoway(line none all modelp1 modelp2 threshold, sort) = Here is my draft of R code. cMain is my dataset. predca-rep(0,4) dim(predca)-c(200,200) for (i in 1:200) { cvgroup-rep(1:10,length=110) cvgroup-sample(cvgroup) cvpre-rep(0,length=110) cvMain-cbind(cMain,cvgroup,cvpre) for (j in 1:10) { cvdev-cvMain[cvMain$cvgroup!=j,] cvval-cvMain[cvMain$cvgroup==j,] cvfit-lrm(Y~X,data=cvdev,x=T,y=T) cvprej-predict(cvfit,cvval,type=fitted) #put the fitted value in dataset cvMain[cvgroup==j,]$cvpre-prej } cvdcaop-dca(cvMain$Y,cvMain$cvpre,prob=(Y)) cvnb-100*(cvdcaop[,1]-cvdcaop[,2]) cvtpthres-cvdcaop[,4]/(100-cvdcaop[,4]) cvnr-cvnb/cvtpthres predca[cvn,1:99]-cvnb predca[cvn,101:199]-cvnr } = My questions are 1. How to ensure equal numbers of events in each set in R? 2. A part of stata code is forvalues j=1(1)10{ quietly logit `event' `predictors1' if set~=`j' quietly predict ptemp if set==`j' quietly replace `prediction1' = ptemp if set==`j' drop ptemp quietly logit `event' `predictors2' if set~=`j' quietly predict ptemp if set==`j' quietly replace `prediction2' = ptemp if set==`j' drop ptemp } I don't understand what's the difference between prediction1 and prediction2 3. Is my code right? Thanks ! Yao Zhu Department of Urology Fudan University Shanghai Cancer Center Shanghai, China [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and ChairmanSchool of Medicine Department of Biostatistics Vanderbilt University __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cross validation function translated from stata
Thanks Frank and Steve. I rewrite the R code as follows. # m is the number of fold to split sample, n is the loop number of cross validation library(caret) calcvnb-function(formula,dat,m,n) { cvnb-rep(0,2) dim(cvnb)-c(200,100) for (i in 1:n) { group-rep(0,length=110) sg-createFolds(dat$LN,k=m) for (k in 1:m) { group[sg[[k]]]-k } pre-rep(0,length=110) data1-cbind(dat,group,pre) for (j in 1:m) { dev-data1[data1$group!=j,] val-data1[data1$group==j,] fit-lrm(formula,data=dev,x=T,y=T) pre1-predict(fit,val,type=fitted) data1[group==j,]$pre-pre1 } dcaop-dca(data1$LN,data1$pre,prob=(Y)) nb-100*(dcaop[,1]-dcaop[,2]) cvnb[i,1:99]-nb } mcvnb-colMeans(cvnb) return(mcvnb) } # apply the function in main code optnb1-calcvnb(formula=LN~factor(MTSTAGE)+factor(GRADE)+LVINVAS+P53,dat=cMain,m=10,n=200) However, applied to my data, a error occurred after several loops Error in contrasts - '('*tmp*',value=contr.treatment): contrasts can be applied only to factors with 2 or more levels. Whats wrong with my code and how to handle it ? Yao Zhu Department of Urology Fudan University Shanghai Cancer Center Shanghai, China Yao Zhu Department of Urology Fudan University Shanghai Cancer Center Shanghai, China 2010/1/21 zhu yao mailzhu...@gmail.com Hi, everyone: I ask for help about translating a stata program into R. The program perform cross validation as it stated. #1. Randomly divide the data set into 10 sets of equal size, ensuring equal numbers of events in each set #2. Fit the model leaving out the 1st set #3. Apply the fitted model in (2) to the 1st set to obtain the predicted probability of a prostate cancer diagnosis. #4. Repeat steps (2) to (3) leaving out and then applying the fitted model to the ith group, i = 2, 3... 10. Every subject now has a predicted probability of a prostate cancer diagnosis. #5. Using the predicted probabilities, compute the net benefit at various threshold probabilities. #6. Repeat steps (1) to (5) 200 times. The corrected net benefit for each threshold probability is the mean across the 200 replications. = First is stata code. forvalues i=1(1)200 { local event=cancer local predictors1 = total_psa local predictors2 = total_psa free_psa local prediction1 = base local prediction2 = full g `prediction1'=. g `prediction2'=. quietly g u = uniform() sort `event' u g set = mod(_n, 10) + 1 forvalues j=1(1)10{ quietly logit `event' `predictors1' if set~=`j' quietly predict ptemp if set==`j' quietly replace `prediction1' = ptemp if set==`j' drop ptemp quietly logit `event' `predictors2' if set~=`j' quietly predict ptemp if set==`j' quietly replace `prediction2' = ptemp if set==`j' drop ptemp } tempfile dca`i' quietly dca `event' `prediction1' `prediction2', graphoff saving(`dca`i'') drop u set `prediction1' `prediction2' } use `dca1', clear forvalues i=2(1)200 { append using `dca`i'' } collapse all none modelp1 modelp2, by(threshold) save cross validation dca output.dta, replace twoway(line none all modelp1 modelp2 threshold, sort) = Here is my draft of R code. cMain is my dataset. predca-rep(0,4) dim(predca)-c(200,200) for (i in 1:200) { cvgroup-rep(1:10,length=110) cvgroup-sample(cvgroup) cvpre-rep(0,length=110) cvMain-cbind(cMain,cvgroup,cvpre) for (j in 1:10) { cvdev-cvMain[cvMain$cvgroup!=j,] cvval-cvMain[cvMain$cvgroup==j,] cvfit-lrm(Y~X,data=cvdev,x=T,y=T) cvprej-predict(cvfit,cvval,type=fitted) #put the fitted value in dataset cvMain[cvgroup==j,]$cvpre-prej } cvdcaop-dca(cvMain$Y,cvMain$cvpre,prob=(Y)) cvnb-100*(cvdcaop[,1]-cvdcaop[,2]) cvtpthres-cvdcaop[,4]/(100-cvdcaop[,4]) cvnr-cvnb/cvtpthres predca[cvn,1:99]-cvnb predca[cvn,101:199]-cvnr } = My questions are 1. How to ensure equal numbers of events in each set in R? 2. A part of stata code is forvalues j=1(1)10{ quietly logit `event' `predictors1' if set~=`j' quietly predict ptemp if set==`j' quietly replace `prediction1' = ptemp if set==`j' drop ptemp quietly logit `event' `predictors2' if set~=`j' quietly predict ptemp if set==`j' quietly replace `prediction2' = ptemp if set==`j' drop ptemp } I don't understand what's the difference between prediction1 and prediction2 3. Is my code right? Thanks ! Yao Zhu Department of Urology Fudan University Shanghai Cancer Center Shanghai, China [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cross validation for species distribution
Dear, Thanks for the warmful help on New Year's EVE. Cross-validation is used to validate the predictive quality of the training data with testing data. As for the amount, the cross-validation (cv) is supposed to be based on k-fold cross-validation, k-1 for the training and 1 for the testing. The cross-validation will be repeated for k times. Is it the same with the function inside caret, ipred, and e1071 package? Elaine On Fri, Jan 1, 2010 at 4:02 AM, Max Kuhn mxk...@gmail.com wrote: You might want to be more specific about what you (exactly) intend to do. Reading the posting guide might help you get better answers. There are a few packages and functions to do what (I think) you desire. There is the train function in the caret package, the errorest function in ipred and a few in e1071. Max On Dec 31, 2009, at 12:13 AM, elaine kuo elaine.kuo...@gmail.com wrote: Dear, I wanna make cross-validation for the species data of species distribution models. Please kindly suggest any package containing cross validation suiting the purpose. Thank you. Elaine [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.htmlhttp://www.r-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cross validation for species distribution
Elaine, That's a fair answer, but completely not what I meant. I was hoping that you would elaborate on the species data of species distribution models. What types of inputs and output for this particular modeling application etc. Is it the same with the function inside caret, ipred, and e1071 package? Yes and there are other resampling options other than k-fold CV. For caret, you might start with this paper: www.jstatsoft.org/v28/i05/ That should tell you most of what you need to know. Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cross validation for species distribution
You might want to be more specific about what you (exactly) intend to do. Reading the posting guide might help you get better answers. There are a few packages and functions to do what (I think) you desire. There is the train function in the caret package, the errorest function in ipred and a few in e1071. Max On Dec 31, 2009, at 12:13 AM, elaine kuo elaine.kuo...@gmail.com wrote: Dear, I wanna make cross-validation for the species data of species distribution models. Please kindly suggest any package containing cross validation suiting the purpose. Thank you. Elaine [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cross validation for species distribution
Dear, I wanna make cross-validation for the species data of species distribution models. Please kindly suggest any package containing cross validation suiting the purpose. Thank you. Elaine [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cross validation/GAM/package Daim
Dear r-helpers, I estimated a generalized additive model (GAM) using Hastie's package GAM. Example: gam1 - gam(vegetation ~ s(slope), family = binomial, data=aufnahmen_0708, trace=TRUE) pred - predict(gam1, type = response) vegetation is a categorial, slope a numerical variable. Now I want to assess the accurancy of the model using k-fold cross validation. I found the package Daim with function Daim for estimation of prediction error based on cross-validation (CV) or various bootstrap techniques. But I am not able to run it properly. I tried the following 3 versions: 1. accurancy - Daim (vegetation ~ s(slope), model=gam1, data=aufnahmen_0708, labpos=alpine mats) -- error: could not find function model 2. accurancy - Daim (vegetation ~ s(slope), model=gam, data=aufnahmen_0708, labpos=alpine mats) -- error in model(formula, train, test) : `family' not recognized 3. accurancy - Daim (vegetation ~ s(slope), model=gam(family=binomial), data=aufnahmen_0708, labpos=alpine mats) -- error in environment(formula) : Element 1 is empty; Der Teil der Argumentliste '.Internal' der berechnet wurde war: (fun) Can anybody help me? Any advice is greatly appreciated! Thanks Kim -- Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 - sicherer, schneller und einfacher! http://portal.gmx.net/de/go/atbrowser __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cross-Validation for Zero-Inflated Models
Hi all I have developed a zero-inflated negative binomial model using the zeroinfl function from the pscl package, which I have carried out model selection based on AIC and have used likelihood ratio tests (lrtest from the lmtest package) to compare the nested models [My end model contains 2 factors and 4 continuous variables in the count model plus one continuous variable in the zero-inflated portion]. But for model assessment I would like to carry out some form of internal cross-validation along the lines of leave one out cv etc, to gauge the predictive ability of my final model just wondering if there is any technique within r for doing this with zero-inflated models/negative binomial models. n.b. my data set is not large enough to split the data at the start and only fit the model to a subset of data. I am using r 2.8.1 Many Thanks in Advance Lara __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cross-validation
I have reviewed all the scripts that appear http://cran.es.r-project.org/ and I cann´t find any suitable for cross-validation with a model of the form y = aX^(b). exp(cZ). Please can someone help me? Thanks, a lot of!! -- View this message in context: http://www.nabble.com/cross-validation-tp22674165p22674165.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cross-validation - lift curve
This may be somewhat useful, but I might have more later. http://florence.acadiau.ca/collab/hugh_public/index.php?title=R:CheckBinFit (the code below is copied from the URL above) CheckBinFit - function(y,phat,nq=20,new=T,...) { if(is.factor(y)) y - as.double(y) y - y-mean(y) y[y0] - 1 y[y=0] - 0 quants - quantile(phat,probs=(1:nq)/(nq+1)) names(quants) - NULL quants - c(0,quants,1) phatD - rep(0,nq+1) phatF - rep(0,nq+1) for(i in 1:(nq+1)) { which - ((phat=quants[i+1])(phatquants[i])) phatF[i] - mean(phat[which]) phatD[i] - mean(y[which]) } if (new) plot(phatF,phatD,xlab=phat,ylab=data, main=paste('R^2=',cor(phatF,phatD)^2),...) else points(phatF,phatD,...) abline(0,1) return(invisible(list(phat=phatF,data=phatD))) } On Thu, Mar 12, 2009 at 1:30 PM, Eric Siegel e...@predictionimpact.comwrote: Hi all, I'd like to do cross-validation on lm and get the resulting lift curve/table (or, alternatively, the estimates on 100% of my data with which I can get lift). If such a thing doesn't exist, could it be derived using cv.lm, or would we need to start from scratch? Thanks! -- Eric Siegel, Ph.D. President Prediction Impact, Inc. Predictive Analytics World Conference More info: www.predictiveanalyticsworld.com LinkedIn Group: www.linkedin.com/e/gis/1005097 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cross-validation - lift curve
Hi all, I'd like to do cross-validation on lm and get the resulting lift curve/table (or, alternatively, the estimates on 100% of my data with which I can get lift). If such a thing doesn't exist, could it be derived using cv.lm, or would we need to start from scratch? Thanks! -- Eric Siegel, Ph.D. President Prediction Impact, Inc. Predictive Analytics World Conference More info: www.predictiveanalyticsworld.com LinkedIn Group: www.linkedin.com/e/gis/1005097 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cross-validation question
Hello everyone, I have a data set that looks like the following: Year Days to the beginning of YearValue 1 30 100 1 60200 1.. ... 1 360 2 30... 2 60... 2 ... ... 2 360 ... Then I used a linear regression to fit Value ~ Days to the beginning of the year with a polynomial. Now I want to use cross-validation to detect over-fitting. But I am not sure if I want to leave out 1/k random data points or leave out 1/k random years. What do you think? Thanks, Geoffrey [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cross-validation
Hi, I was trying to do cross-validation using the crossval function (bootstrap package), with the following code: - theta.fit - function(x,y){ model - svm(x,y,kernel = linear) } theta.predict - function(fit,x){ prediction - predict(fit,x) return(prediction) } x - matrix(rnorm(5100),102,50) rownames(x) - paste('a',1:102,sep='') colnames(x) - paste('b',1:50,sep='') y - factor(sample(1:2,102,replace=T)) results - crossval(x,y,theta.fit,theta.predict) # LOOCV --- I get the following error: Error in scale(newdata[, object$scaled, drop = FALSE], center = object$x.scale$scaled:center, : (subscript) logical subscript too long It seems to work alright if I use 10 fold cross validation (e.g. results - crossval(x,y,theta.fit,theta.predict,ngroup = 10), but gives the error for LOOCV. What am I doing wrong? thanks! My session info is: sessionInfo() R version 2.7.1 (2008-06-23) i386-pc-mingw32 locale: LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] rpart_3.1-41 lattice_0.17-8 ROCR_1.0-2 gplots_2.6.0 [5] gdata_2.4.2 gtools_2.5.0 e1071_1.5-18 class_7.2-42 [9] bootstrap_1.0-21 loaded via a namespace (and not attached): [1] grid_2.7.1 tools_2.7.1 - [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cross Validation output
Good Day All, I have a negative binomial model that I created using the function glm.nb() with the MASS library and I am performing a cross-validation using the function cv.glm() from the boot library. I am really interested in determining the performance of this model so I can have confidence (or not) when it might be applied elsewhere If I understand the cv.glm() procedure correctly, the default cost function is the average squared error and by running run cv.glm() in a loop many times I understand that I can calculate PRESS (PRedictive Error Sum of Squares = 1/n*Sum(all PEs) from the default output. When I run a loop that is 10 times, my PRESS ~25 I have a few questions: 1) I must now confess my ignorance, how does one interpret my PRESS of 25 ? Are there some internet resources that someone could point me to to help in the interpretation ? I've spent most of yesterday studying up on things but feel like I am chasing my tail. Most of the resources are either way so heavy in theory that I can't puzzle them out or are a couple of paragraphs long and don't have example with data in them. Is my PRESS in essence saying that my model performance is ~ 75% ? (I suspect not, but I don't know thus I ask) 2) All my observations are spatial in nature and thus I would like to plot out spatially where the model is performing well and where it is not. This would be somewhat akin to inspecting residuals in OLS. Is there a way to output from cv.glm() the PEs for individual data points ? 3) My previous idea was to look at AIC, BIC, McFaddenR2 and PseudoR2 as Goodness of Fit measures of each subset model. It appears that I can modify the cost function of cv.glm() but I am not to confident in my ability to write the correct cost function. Are there other valid measures of GOF for my negative binomial model that I can substitute into the cost function of cv.glm() ? Would anyone care to recommend one (or many) ? Thanks in advance for your patience ! -Don PS - if you've seen my previous posts, I've abandoned my 80/20 split validation scheme. -- -Don Don Catanzaro, PhD Landscape Ecologist [EMAIL PROTECTED] 16144 Sigmond Lane 479-751-3616Lowell, AR 72745 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cross validation for lme
Hello, We would like to perform a cross validation on a linear mixed model (lme) and wonder if anyone has found something analogous to cv.glm for such models? Thanks, Mark [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cross-validation in rpart
-- begin included message I'm having a problem with custom functions in rpart, and before I tear my hair out trying to fix it, I want to make sure it's actually a problem. It seems that, when you write custom functions for rpart (init, split and eval) then rpart no longer cross-validates the resulting tree to return errors. A simple test is to use the usersplits.R function to get a simple, custom rpart function, and then change fit1 and fit2 so that the both have xvals of 10. The problem occurs in that the cptable for fit1 doesn't have xerror or xstd, despite the fact that the cross-validation is set to 10-fold. I guess I just need conformation that cross-validation doesn't work with custom functions, and if someone could explain to me why that is the case it would be greatly appreciated. Thanks, Sam Stewart end inclusion You are right, cross-validation does not happen automatically with user-written split functions. We can think of cross-validation as having two steps: 1. Get the predicted values for each observation, when that obs (or a group) is left out of the data set. There is actually a vector of predicted values, one for each level of model complexity. This step can be done using xpred.rpart, which does work for user-defined splits. It returns a matrix with n rows (one per obs) and one column for each of the target cp values. Call this matrix yhat. 2. Summarize each column of the above matrix yhat into a single goodness value. For anova fitting, for instance, this is just colMeans((y-yhat)^2). For classification models it is a bit more complex, we have to add up the expected loss L(y, hat) for each column using the loss matrix and the priors. The reason that rpart does not do this step for a user-written function is that rpart does not know what summary is appropriate. For some splitting rules, e.g. survival data split using a log-rank test, I'm not sure that \italics{I} know what summation is appropriate. Terry Therneau __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cross-validation in rpart
Hello list, I'm having a problem with custom functions in rpart, and before I tear my hair out trying to fix it, I want to make sure it's actually a problem. It seems that, when you write custom functions for rpart (init, split and eval) then rpart no longer cross-validates the resulting tree to return errors. A simple test is to use the usersplits.R function to get a simple, custom rpart function, and then change fit1 and fit2 so that the both have xvals of 10. The problem occurs in that the cptable for fit1 doesn't have xerror or xstd, despite the fact that the cross-validation is set to 10-fold. I guess I just need conformation that cross-validation doesn't work with custom functions, and if someone could explain to me why that is the case it would be greatly appreciated. Thanks, Sam Stewart -- Sam Stewart, MMath Research Statistician, Diagnostic Imaging Rm 3016, 3 South Victoria Building VG Site, QEII Health Sciences Centre 1278 Tower Rd, Halifax, NS, B3H 2Y9 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cross-validation in R
1) cv.glm is not 'in R', it is part of contributed package 'boot'. Please give credit where it is due. 2) There is nothing 'cross' about your 'home-made cross validation'. cv.glm is support software for a book, so please consult it for the definition used of cross-validation, or MASS (the book: see the posting guide) or another reputable source. 3) If you want to know how a function works please consult a) its help page and b) its code. Here a) answers at least your first question, and your fundamental misunderstanding of 'cross-validation' answers the other two. On Mon, 9 Jun 2008, Luis Orlindo Tedeschi wrote: Folks; I am having a problem with the cv.glm and would appreciate someone shedding some light here. It seems obvious but I cannot get it. I did read the manual, but I could not get more insight. This is a database containing 3363 records and I am trying a cross-validation to understand the process. When using the cv.glm, code below, I get mean of perr1 of 0.2336 and SD of 0.000139. When using a home-made cross validation, code below, I get mean of perr2 of 0.2338 and SD of 0.02184. The means are similar but SD are different. You are comparing apples and oranges. Questions are: (1) how the $delta is computed in the cv.glm? In the home-made version, I simply use ((Yobs - Ypred)^2)/n. The equation might be correct because the mean is similar. (2) in the cv.glm, I have the impression the system is using glm0.dmi that was generated using all the data points whereas in my homemade version I only use the test database. I am confused if the cv.glm generates new glm models for each simulation of if it uses the one provided? (3) is the cv.glm sampling using replacement = TRUE or not? Thanks in advance. LOT * cv.glm method glm0.dmi-glm(DMI_kg~Sex+DOF+Avg_Nem+In_Wt) # Simulation for 50 re-samplings... perr1.vect-vector() for (j in 1:50) { print(j) cv.dmi-cv.glm(data.dmi, glm0.dmi, K = 10) perr1-cv.dmi$delta[2] perr1.vect-c(perr1.vect,perr1) } x11() hist(perr1.vect) mean(perr1.vect) sd(perr1.vect) * homemade method # Brute-force cross-validation. This should be similar to the cv.glm perr2.vect - vector() for(j in 1:50) { print(j) select.dmi - sample(1:nrow(data.dmi), 0.9*nrow(data.dmi)) train.dmi - data.dmi[select.dmi,] #Selecting 90% of the data for training purpose test.dmi - data.dmi[-select.dmi,] #Selecting 10% (remaining) of the data for testing purpose glm1.dmi - glm(DMI_kg~Sex+DOF+Avg_Nem+In_Wt, na.action=na.omit, data = train.dmi) #Create fitted values using test.dmi data dmi_pred - predict.glm(glm1.dmi, test.dmi) dmi_obs-test.dmi[,DMI_kg] # Get the prediction error = MSE perr2 - t(dmi_obs - dmi_pred)%*%(dmi_obs - dmi_pred)/nrow(test.dmi) perr2.vect - c(perr2.vect, perr2) } x11() hist(perr2.vect) mean(perr2.vect) sd(perr2.vect) -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cross-validation in R
Folks; I am having a problem with the cv.glm and would appreciate someone shedding some light here. It seems obvious but I cannot get it. I did read the manual, but I could not get more insight. This is a database containing 3363 records and I am trying a cross-validation to understand the process. When using the cv.glm, code below, I get mean of perr1 of 0.2336 and SD of 0.000139. When using a home-made cross validation, code below, I get mean of perr2 of 0.2338 and SD of 0.02184. The means are similar but SD are different. Questions are: (1) how the $delta is computed in the cv.glm? In the home-made version, I simply use ((Yobs - Ypred)^2)/n. The equation might be correct because the mean is similar. (2) in the cv.glm, I have the impression the system is using glm0.dmi that was generated using all the data points whereas in my homemade version I only use the test database. I am confused if the cv.glm generates new glm models for each simulation of if it uses the one provided? (3) is the cv.glm sampling using replacement = TRUE or not? Thanks in advance. LOT * cv.glm method glm0.dmi-glm(DMI_kg~Sex+DOF+Avg_Nem+In_Wt) # Simulation for 50 re-samplings... perr1.vect-vector() for (j in 1:50) { print(j) cv.dmi-cv.glm(data.dmi, glm0.dmi, K = 10) perr1-cv.dmi$delta[2] perr1.vect-c(perr1.vect,perr1) } x11() hist(perr1.vect) mean(perr1.vect) sd(perr1.vect) * homemade method # Brute-force cross-validation. This should be similar to the cv.glm perr2.vect - vector() for(j in 1:50) { print(j) select.dmi - sample(1:nrow(data.dmi), 0.9*nrow(data.dmi)) train.dmi - data.dmi[select.dmi,] #Selecting 90% of the data for training purpose test.dmi - data.dmi[-select.dmi,] #Selecting 10% (remaining) of the data for testing purpose glm1.dmi - glm(DMI_kg~Sex+DOF+Avg_Nem+In_Wt, na.action=na.omit, data = train.dmi) #Create fitted values using test.dmi data dmi_pred - predict.glm(glm1.dmi, test.dmi) dmi_obs-test.dmi[,DMI_kg] # Get the prediction error = MSE perr2 - t(dmi_obs - dmi_pred)%*%(dmi_obs - dmi_pred)/nrow(test.dmi) perr2.vect - c(perr2.vect, perr2) } x11() hist(perr2.vect) mean(perr2.vect) sd(perr2.vect) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cross Validation
Hi, I am trying to find out the best way to calculate the average LOOCV in R for several classifier for, KNN, centroid classification, DLDA and SVM. I have four types of diseases and 62 samples. Is there a R code available to do this? -- View this message in context: http://www.nabble.com/Cross-Validation-tp15912818p15912818.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cross Validation
an example from my R table will calculating the average LOOCV for two treatments ALL and AML table ALL AML 11.2 .3 2.87.3 31.1.5 41.2.7 53.21.2 61.11.1 7.90 .99 81.1.32 92.1 1.2 JStainer wrote: Hi, I am trying to find out the best way to calculate the average LOOCV in R for several classifier for, KNN, centroid classification, DLDA and SVM. I have four types of diseases and 62 samples. Is there a R code available to do this? -- View this message in context: http://www.nabble.com/Cross-Validation-tp15912818p15912854.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cross Validation
JStainer wrote: Hi, I am trying to find out the best way to calculate the average LOOCV in R for several classifier for, KNN, centroid classification, DLDA and SVM. I have four types of diseases and 62 samples. Is there a R code available to do this? -- View this message in context: http://www.nabble.com/Cross-Validation-tp15912818p15912856.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cross validation
Hi, I must have accidentally deleted my previous post. I am having a really difficult time calculating the LOOCV (leave out cross validation). table in excel genes ALL AML p.value 1 1.2 .3 .01 2 .87.3 .03 3 1.1.5 .05 4 1.2.7.01 5 3.21.2 .02 6 1.11.1 .5 Do i need to import them into R as a matrix? Is there any script available where i can calculate the LOOCV? thanks, John -- View this message in context: http://www.nabble.com/cross-validation-tp15913006p15913006.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cross Validation
Hello, How can I do a cross validation in R? Thank You! __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cross Validation
http://www.burns-stat.com/pages/Tutor/bootstrap_resampling.html may be of some use to you. Patrick Burns [EMAIL PROTECTED] +44 (0)20 8525 0696 http://www.burns-stat.com (home of S Poetry and A Guide for the Unwilling S User) Carla Rebelo wrote: Hello, How can I do a cross validation in R? Thank You! __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cross Validation in rpart
Hello All, I'm writing a custom rpart function, and I'm wondering about cross-validation. Specifically, why isn't my splitting function being called more often with the xval increased? One would expect that, with xval=10 compared to xval=1, that the prior would call the splitting function more than the other, but they both produce the exact same thing. Is there something I'm missing about the cross-validation process for rpart? Thanks, Sam Stewart __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.