[R] Cross-validation for logistic regression with lasso2
Hello, I am trying to shrink the coefficients of a logistic regression for a sparse dataset, I am using the lasso (lasso2) and I am trying to determine the shrinkinage factor by cross-validation. I would like please some of the experts here to tell me whether i'm doing it correctly or not. Below is my dataset and the functions I use w= a b c d e P A 0 0 0 0 0 1 879 1 0 0 0 0 1 3 0 1 0 0 0 7 7 0 0 1 0 0 230 2 0 0 0 1 0 450 7 0 0 0 0 1 4 #The GLM output shows that the coefficients c and d are larger than 10: resp=cbind(w$P,w$A) summary(glm(resp~a+b+c+d+e,data=w,family=binomial)) Coefficients: Estimate Std. Error z value Pr(|z|) (Intercept) -6.779 1.001 -6.775 1.24e-11 *** a 5.680 1.528 3.718 0.000201 *** b 6.779 1.134 5.976 2.29e-09 *** c 11.524 1.227 9.392 2e-16 *** d 10.942 1.071 10.220 2e-16 *** e 3.688 1.124 3.282 0.001031 ** #so I wrote this below using the lasso2 package to determine the best shrinkage factor using the gcv cross-validation: for (i in seq(1,40,1)) { glmba=gl1ce(resp~a+b+c+d+e, data = w, family = binomial(),bound=i) ecco=round(gcv(glmba,type=Tibshirani,gen.inverse.diag =1e11),digits=3) print(ecco) } #and it gives me 21 with the lowest gcv. #then I determine the shrunken coefficients: gl1ce( resp ~ a + b + c + d + e, data = w, family = binomial(), bound = 21) Coefficients: (Intercept) a b c d e -4.7498162.7762154.3426618.9565838.6615931.264660 Family: Family: binomial Link function: logit The absolute L1 bound was : 21 The Lagrangian for the bound is : 1.843283 Thanks -- View this message in context: http://www.nabble.com/Cross-validation-for-logistic-regression-with-lasso2-tf3777173.html#a10680591 Sent from the R help mailing list archive at Nabble.com. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cross-validation for count data
Hi everybody, I'm trying to use cross-validation (cv.glm) for count data. Does someone know which is the appropriate cost function for Poisson distribution? Thank you in advance. Valerio. Conservation Biology Unit Department of Environmental and Territory Sciences University of Milano-Bicocca Piazza della Scienza,1 20126 Milano, Italy. -- Scopri se hai Vinto un Tv Color LCD! Clicca qui http://click.libero.it/webnation15nov06 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cross-validation for count data
On Wed, 15 Nov 2006, [EMAIL PROTECTED] wrote: I'm trying to use cross-validation (cv.glm) for count data. Does someone know which is the appropriate cost function for Poisson distribution? It depends on the scientific problem, not the distribution. You could use the deviance but it may well not be appropriate for your context, so please seek statistical advice. BTW, this is off-topic (see the posting guide) which is why your previous https://stat.ethz.ch/pipermail/r-help/2006-November/116948.html went unanswered. Please don't clog the list with repeats like this. And cv.glm is part of package boot (I presume) which you did not mention and if so is support software for a book that may help you Thank you in advance. Valerio. Conservation Biology Unit Department of Environmental and Territory Sciences University of Milano-Bicocca Piazza della Scienza,1 20126 Milano, Italy. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Cross-validation in SVM
Dear David, Dear R Users, Calculation of Cross-Validation for SVM, with thoese time series which include negative and positive values ( for example return of a stock exchange index) must be different from a calculation of Cross-Validation with time series which includes just absolute values( for example a stock exchange index). How is it calculated for a return time series? Thank you very much for any help. Amir - [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Cross-validation in SVM
On Thu, 23 Feb 2006, Amir Safari wrote: Calculation of Cross-Validation for SVM, with thoese time series which include negative and positive values ( for example return of a stock exchange index) must be different from a calculation of Cross-Validation with time series which includes just absolute values( for example a stock exchange index). Not necessarily, depends on the type of data. How is it calculated for a return time series? From the man page of svm(): cross: if a integer value k0 is specified, a k-fold cross validation on the training data is performed to assess the quality of the model: the accuracy rate for classification and the Mean Squared Error for regression i.e., MSE will be used. Z __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Cross-validation
Dear R-help, I was wondering if somebody has a strong opinion on the following matter: Would you see appropriate to apply the leave-one-out cross validation techinque in time series modelling? Thanks in advance, Tom __ [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Cross-validation
I would hesitate long before doing that. People do similar things, but: Cross-validation and bootstrapping become considerably more complicated for time series data; see Hjorth (1994) and Snijders (1988). http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-12.html I just tried www.r-project.org - search - R site search for time series cross validation, jackknife time series and bootstrap time series. I found the above using Google for the same terms. spencer graves Werner Bier wrote: Dear R-help, I was wondering if somebody has a strong opinion on the following matter: Would you see appropriate to apply the leave-one-out cross validation techinque in time series modelling? Thanks in advance, Tom __ [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Spencer Graves, PhD Senior Development Engineer PDF Solutions, Inc. 333 West San Carlos Street Suite 700 San Jose, CA 95110, USA [EMAIL PROTECTED] www.pdf.com http://www.pdf.com Tel: 408-938-4420 Fax: 408-280-7915 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] cross validation and parameter determination
On Wednesday 20 April 2005 00:17, array chip wrote: Hi all, In Tibshirani's PNAS paper about nearest shrunken centroid analysis of microarrays (PNAS vol 99:6567), they used cross validation to choose the amount of shrinkage used in the model, and then test the performance of the model with the cross-validated shrinkage in separate independent testing set. If I don't have the luxury of having independent testing set, can I just use the cross validation performance as the performance estimate? In other words, can I use the same single cross-validation to both choose the value of the parameter (amount of shrinkage in this case) and estimate the performance which was based on the value of the parameter chosen by the same cross-validation? I kind of feel awkward by getting both on a single cross validation, because it seems like I used the dataset in training set manner. Am I wrong/right? That error rate is probably optimistic, because as you say cross-validation? I kind of feel awkward by getting both on a single cross validation, because it seems like I used the dataset in training set manner. Am I However, you can easily wrap the whole pam procedure within an outer-loop of cross validation or bootstrap. (This problem is not that different from, say, using knn and selecting k using cross-validation; or selecting the number of genes to use with cross-validation, etc. You should then assess the error rate of your procedure). R. Thanks! __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Ramón Díaz-Uriarte Bioinformatics Unit Centro Nacional de Investigaciones Oncológicas (CNIO) (Spanish National Cancer Center) Melchor Fernández Almagro, 3 28029 Madrid (Spain) Fax: +-34-91-224-6972 Phone: +-34-91-224-6900 http://ligarto.org/rdiaz PGP KeyID: 0xE89B3462 (http://ligarto.org/rdiaz/0xE89B3462.asc) **NOTA DE CONFIDENCIALIDAD** Este correo electrónico, y en su caso los ficheros adjuntos, pueden contener información protegida para el uso exclusivo de su destinatario. Se prohíbe la distribución, reproducción o cualquier otro tipo de transmisión por parte de otra persona que no sea el destinatario. Si usted recibe por error este correo, se ruega comunicarlo al remitente y borrar el mensaje recibido. **CONFIDENTIALITY NOTICE** This email communication and any attachments may contain confidential and privileged information for the sole use of the designated recipient named above. Distribution, reproduction or any other use of this transmission by any party other than the intended recipient is prohibited. If you are not the intended recipient please contact the sender and delete all copies. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] cross validation and parameter determination
Hi all, In Tibshirani's PNAS paper about nearest shrunken centroid analysis of microarrays (PNAS vol 99:6567), they used cross validation to choose the amount of shrinkage used in the model, and then test the performance of the model with the cross-validated shrinkage in separate independent testing set. If I don't have the luxury of having independent testing set, can I just use the cross validation performance as the performance estimate? In other words, can I use the same single cross-validation to both choose the value of the parameter (amount of shrinkage in this case) and estimate the performance which was based on the value of the parameter chosen by the same cross-validation? I kind of feel awkward by getting both on a single cross validation, because it seems like I used the dataset in training set manner. Am I wrong/right? Thanks! __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
RE: [R] cross validation and parameter determination
In all likelihood, you'll get an overly optimistic estimate of performance that way. Andy From: array chip Hi all, In Tibshirani's PNAS paper about nearest shrunken centroid analysis of microarrays (PNAS vol 99:6567), they used cross validation to choose the amount of shrinkage used in the model, and then test the performance of the model with the cross-validated shrinkage in separate independent testing set. If I don't have the luxury of having independent testing set, can I just use the cross validation performance as the performance estimate? In other words, can I use the same single cross-validation to both choose the value of the parameter (amount of shrinkage in this case) and estimate the performance which was based on the value of the parameter chosen by the same cross-validation? I kind of feel awkward by getting both on a single cross validation, because it seems like I used the dataset in training set manner. Am I wrong/right? Thanks! __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] cross validation and CART
Hello, I would like to know if the classification trees i built with my data are predictive or not. Could you explain me how to do that? Thanks Laure Maton __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
RE: [R] cross validation and CART
From: Laure Maton Hello, I would like to know if the classification trees i built with my data are predictive or not. Could you explain me how to do that? Thanks Laure Maton If you are talking about the particular tree models that you built from the data, you will need independent test set to evaluate prediction performance. If you want to know if the _algorithm_ can produce models that are predictive, you can use something like cross validation. See the errorest() function in the `ipred' package, for example. Andy __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Cross validation, one more time (hopefully the last)
I apologize for posting on this question again, but unfortunately, I don't have and can't get access to MASS for at least three weeks. I have found some code on the web however which implements the prediction error algorithm in cv.glm. http://www.bioconductor.org/workshops/NGFN03/modelsel-exercise.pdf Now I've tried to adapt it to my purposes, but since I'm not deeply familiar with R programming, I don't know why it doesn't work. Now checking the r-help list faq it seems this is an appropriate question. I've included my attempted function below. The error I get is: logcv(basp.data, form, 'basp', 'recordyear') Error in order(na.last, decreasing, ...) : Argument 1 is not a vector My questions are, why doesn't this work, and how do I fix it. I'm using the formula function to create the formula that I'm sending to my function. And the mdata is a data.frame. I'm assumed that if I passed the column names as strings (response variable - rvar, fold variable - fvar) this would work. Apparently however it doesn't. Lastly, since I don't have access to MASS and there are apparently many examples of doing this kind of thing in MASS, could someone tell me if this function looks approximately correct? Thanks T logcv - function(mdata, formula, rvar, fvar) { require(Hmisc) # sort by fold variable sorted - mdata[order(mdata$fvar), ] # get fold values and count for each group vardesc - describe(sorted$fvar)$values fvarlist - as.integer(dimnames(vardesc)[[2]]) k - length(fvarlist) countlist - vardesc[1,1] for (i in 2:k) { countlist[i] - vardesc[1,i] } n - length(sorted$fvar) # fit to all the mdata fit.all - glm(formula, sorted, family=binomial) pred.all - ifelse( predict(fit.all, type=response) 0.5, 0, 1) #setup pred.c - list() error.i - vector(length=k) for (i in 1:k) { fit.i - glm(formula, subset(sorted, sorted$fvar != fvarlist[i]), family=binomial) pred.i - ifelse(predict(fit.i, newdata=subset(sorted, sorted$fvar == fvarlist[i]), type=response) 0.5, 0, 1) pred.c[[i]] = pred.i pred.all.i - ifelse(predict(fit.i, newdata=sorted, type=response) 0.5, 0, 1) error.i[i] - sum(sorted$rvar != pred.all.i)/n } pred.cc - unlist(pred.c) delta.cv.k - sum(sorted$rvar != pred.cc)/n p.k - countlist/n delta.app - mean(sorted$rvar != pred.all)/n delta.acv.k - delta.cv.k + delta.app - sum(p.k*error.i) print(delta.acv.k) } -- Trevor Wiens [EMAIL PROTECTED] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Cross validation, one more time (hopefully the last)
On Wed, 16 Mar 2005 17:59:01 -0700 Trevor Wiens [EMAIL PROTECTED] wrote: I apologize for posting on this question again, but unfortunately, I don't have and can't get access to MASS for at least three weeks. I have found some code on the web however which implements the prediction error algorithm in cv.glm. http://www.bioconductor.org/workshops/NGFN03/modelsel-exercise.pdf Now I've tried to adapt it to my purposes, but since I'm not deeply familiar with R programming, I don't know why it doesn't work. Now checking the r-help list faq it seems this is an appropriate question. OK. I've determined why that didn't work. But I'm still unsure if I've implemented the algorithm correctly. Any suggestions for testing would be appreciated. The corrected function is attached. Thanks for your assistance. logcv - function(mdata, formula, rvar, fvar) { require(Hmisc) # determine index of variables rpos - match(rvar, names(mdata)) fpos - match(fvar, names(mdata)) # sort by fold variable sorted - mdata[order(mdata[[fpos]]), ] # get fold values and count for each group vardesc - describe(sorted[[fpos]])$values fvarlist - as.integer(dimnames(vardesc)[[2]]) k - length(fvarlist) countlist - vardesc[1,1] for (i in 2:k) { countlist[i] - vardesc[1,i] } n - length(sorted[[fpos]]) # fit to all the mdata fit.all - glm(formula, sorted, family=binomial) pred.all - ifelse( predict(fit.all, type=response) 0.5, 0, 1) #setup pred.c - list() error.i - vector(length=k) for (i in 1:k) { fit.i - glm(formula, subset(sorted, sorted[[fpos]] != fvarlist[i]), family=binomial) pred.i - ifelse(predict(fit.i, newdata=subset(sorted, sorted[[fpos]] == fvarlist[i]), type=response) 0.5, 0, 1) pred.c[[i]] = pred.i pred.all.i - ifelse(predict(fit.i, newdata=sorted, type=response) 0.5, 0, 1) error.i[i] - sum(sorted[[rpos]] != pred.all.i)/n } pred.cc - unlist(pred.c) delta.cv.k - sum(sorted[[rpos]] != pred.cc)/n p.k - countlist/n delta.app - mean(sorted[[rpos]] != pred.all)/n delta.acv.k - delta.cv.k + delta.app - sum(p.k*error.i) return(delta.acv.k) } -- T -- Trevor Wiens [EMAIL PROTECTED] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] cross-validation
I've been looking at the base and Design libraries and it is unclear to me the best way to approach doing cross-validation. I'm interested in using temporal (I have five years of data), spatial (I've divided my data set up into 5 blocks that make sense and have a block variable attached to my data) and I was also thinking of doing a random cross-validation to look at general model stability. For the third options I can use cross-validation or bootstrapping. If someone can type out a code example, that would be very helpful to me. Thanks in advance. T -- Trevor Wiens [EMAIL PROTECTED] The significant problems that we face cannot be solved at the same level of thinking we were at when we created them. (Albert Einstein) __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] cross-validation
On Sun, 13 Mar 2005 15:28:54 -0700 Trevor Wiens [EMAIL PROTECTED] wrote: I've been looking at the stats and Design libraries and it is unclear to me the best way to approach doing cross-validation. I'm interested in using temporal (I have five years of data), spatial (I've divided my data set up into 5 blocks that make sense and have a block variable attached to my data) and I was also thinking of doing a random cross-validation to look at general model stability. For the third option I think either cross-validation or bootstrapping would be appropriate If someone can type out a really simple code example ( or point me to one), that would be very helpful. I realized I should have mentioned this is for logistic regression using either glm or the Design lrm models. Thanks T -- Trevor Wiens [EMAIL PROTECTED] The significant problems that we face cannot be solved at the same level of thinking we were at when we created them. (Albert Einstein) __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] cross validation
How to select training data set and test data set from the original data for performing cross-validation - [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] cross validation
you could something like this (based on VR's S Programming, pp. 175): dat - data.frame(matrix(rnorm(100*6), 100, 6)) # n - nrow(dat) V - 10 # number of folds samps - sample(rep(1:V, length=n), n, replace=FALSE) # # Using the first fold: train - dat[samps!=1,] # fit the model test - dat[samps==1,] # predict I hope it helps. Best, Dimitris Dimitris Rizopoulos Ph.D. Student Biostatistical Centre School of Public Health Catholic University of Leuven Address: Kapucijnenvoer 35, Leuven, Belgium Tel: +32/16/336899 Fax: +32/16/337015 Web: http://www.med.kuleuven.ac.be/biostat http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm - Original Message - From: kolluru ramesh [EMAIL PROTECTED] To: Rpackage help r-help@stat.math.ethz.ch Sent: Friday, January 21, 2005 11:19 AM Subject: [R] cross validation How to select training data set and test data set from the original data for performing cross-validation - [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
RE: [R] cross validation
One way is to create an indicator vector that indicate which `fold' a case should belong to. Something like: fold - 10 idx - sample(fold, n, replace=TRUE) for (I in 1:fold) { train.dat - dat[idx != i,] test.dat - dat[idx == i,] ... } Also see the errorest() function in the ipred package. It is more careful in making sure the folds are as close in size as possible, and can do stratified splits. Andy From: kolluru ramesh How to select training data set and test data set from the original data for performing cross-validation - [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] cross validation
Dimitris Rizopoulos wrote: you could something like this (based on VR's S Programming, pp. 175): dat - data.frame(matrix(rnorm(100*6), 100, 6)) # n - nrow(dat) V - 10 # number of folds samps - sample(rep(1:V, length=n), n, replace=FALSE) # # Using the first fold: train - dat[samps!=1,] # fit the model test - dat[samps==1,] # predict Or see ?errorest in the ipred package. Uwe Ligges I hope it helps. Best, Dimitris Dimitris Rizopoulos Ph.D. Student Biostatistical Centre School of Public Health Catholic University of Leuven Address: Kapucijnenvoer 35, Leuven, Belgium Tel: +32/16/336899 Fax: +32/16/337015 Web: http://www.med.kuleuven.ac.be/biostat http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm - Original Message - From: kolluru ramesh [EMAIL PROTECTED] To: Rpackage help r-help@stat.math.ethz.ch Sent: Friday, January 21, 2005 11:19 AM Subject: [R] cross validation How to select training data set and test data set from the original data for performing cross-validation - [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Fwd: RE: [R] cross validation
Note: forwarded message attached. - ---BeginMessage--- Authentication-Results: mta283.mail.scd.yahoo.com from=merck.com; domainkeys=neutral (no sig) X-Originating-IP: [155.91.6.40] Return-Path: [EMAIL PROTECTED] Received: from 155.91.6.40 (EHLO usryim07.merck.com) (155.91.6.40) by mta283.mail.scd.yahoo.com with SMTP; Fri, 21 Jan 2005 03:29:25 -0800 Received: from 155.91.2.6 by usryim07.merck.com with ESMTP (SMTP Relay); Fri, 21 Jan 2005 06:29:17 -0500 X-Server-Uuid: D9643136-67E2-4E6E-A13C-387E6A236A69 Received: from 54.3.102.166 by usrytw32.merck.com with ESMTP (Tumbleweed MMS SMTP Relay (MMS v5.6.1)); Fri, 21 Jan 2005 06:29:10 -0500 X-Server-Uuid: 79464537-DA04-46B1-B103-B8820195B307 Received: by usrygw30.merck.com with Internet Mail Service (5.5.2657.72) id DKTGTC4H; Fri, 21 Jan 2005 06:29:10 -0500 From: Liaw, Andy [EMAIL PROTECTED] Rpackage help r-help@stat.math.ethz.ch Subject: RE: [R] cross validation Date: Fri, 21 Jan 2005 06:29:03 -0500 Return-Receipt-To: Liaw, Andy [EMAIL PROTECTED] MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2657.72) X-WSS-ID: 6DEE380C1H05785560-01-01 X-WSS-ID: 6DEE380719C62869-01-01 Content-Type: text/plain Content-Transfer-Encoding: 7bit Content-Length: 884 One way is to create an indicator vector that indicate which `fold' a case should belong to. Something like: fold - 10 idx - sample(fold, n, replace=TRUE) for (I in 1:fold) { train.dat - dat[idx != i,] test.dat - dat[idx == i,] ... } Also see the errorest() function in the ipred package. It is more careful in making sure the folds are as close in size as possible, and can do stratified splits. Andy From: kolluru ramesh How to select training data set and test data set from the original data for performing cross-validation - [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- -- ---End Message--- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Cross-validation accuracy in SVM
Hi all - I am trying to tune an SVM model by optimizing the cross-validation accuracy. Maximizing this value doesn't necessarily seem to minimize the number of misclassifications. Can anyone tell me how the cross-validation accuracy is defined? In the output below, for example, cross-validation accuracy is 92.2%, while the number of correctly classified samples is (1476+170)/(1476+170+4) = 99.7% !? Thanks for any help. Regards - Ton --- Parameters: SVM-Type: C-classification SVM-Kernel: radial cost: 8 gamma: 0.007 Number of Support Vectors: 1015 ( 148 867 ) Number of Classes: 2 Levels: false true 5-fold cross-validation on training data: Total Accuracy: 92.24242 Single Accuracies: 90 93.3 94.84848 92.72727 90.30303 Contingency Table predclasses origclasses false true false 1476 0 true 4 170 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
RE: [R] Cross-validation accuracy in SVM
The 99.7% accuracy you quoted, I take it, is the accuracy on the training set. If so, that number hardly means anything (other than, perhaps, self-fulfilling prophecy). Usually what one would want is for the model to be able to predict data that weren't used to train the model with high accuracy. That's what cross-validation tries to emulate. It gives you an estimate of how well you can expect your model to do on data that the model has not seen. Andy From: Ton van Daelen Hi all - I am trying to tune an SVM model by optimizing the cross-validation accuracy. Maximizing this value doesn't necessarily seem to minimize the number of misclassifications. Can anyone tell me how the cross-validation accuracy is defined? In the output below, for example, cross-validation accuracy is 92.2%, while the number of correctly classified samples is (1476+170)/(1476+170+4) = 99.7% !? Thanks for any help. Regards - Ton --- Parameters: SVM-Type: C-classification SVM-Kernel: radial cost: 8 gamma: 0.007 Number of Support Vectors: 1015 ( 148 867 ) Number of Classes: 2 Levels: false true 5-fold cross-validation on training data: Total Accuracy: 92.24242 Single Accuracies: 90 93.3 94.84848 92.72727 90.30303 Contingency Table predclasses origclasses false true false 1476 0 true 4 170 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Cross-validation accuracy in SVM
Ton van Daelen wrote: Hi all - I am trying to tune an SVM model by optimizing the cross-validation accuracy. Maximizing this value doesn't necessarily seem to minimize the number of misclassifications. Can anyone tell me how the cross-validation accuracy is defined? In the output below, for example, cross-validation accuracy is 92.2%, while the number of correctly classified samples is (1476+170)/(1476+170+4) = 99.7% !? Thanks for any help. Regards - Ton Percent correctly classified is an improper scoring rule. The percent is maximized when the predicted values are bogus. In addition, one can add a very important predictor and have the % actually decrease. Frank Harrell --- Parameters: SVM-Type: C-classification SVM-Kernel: radial cost: 8 gamma: 0.007 Number of Support Vectors: 1015 ( 148 867 ) Number of Classes: 2 Levels: false true 5-fold cross-validation on training data: Total Accuracy: 92.24242 Single Accuracies: 90 93.3 94.84848 92.72727 90.30303 Contingency Table predclasses origclasses false true false 1476 0 true 4 170 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Cross-validation for Linear Discrimitant Analysis
Hello: I am new to R and statistics and I have two questions. First I need help to interpret the cross-validation result from the R linear discriminant analysis function lda. I did the following: lda (group ~ Var1 + Var2, CV=T) where CV=T tells the lda to do cross-validation. The output of lda are the posterior probabilities among other things, but I can't find an error term (like delta returned by cv.glm). My question is how to get such an error term from the output? Can I just simply calculate the prediction accuracy using the posterior probabilities from the cross-validation, and use that to measure the quality of the model? Another question is more basic: how to determine if a lda model is significant? (There is no p-value.) Thanks, Yu Shao Wadsworth Research Center Department of Health of New York State Albany, NY 12208 __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Cross-validation for Linear Discrimitant Analysis
On Wed, 15 Sep 2004, Yu Shao wrote: I am new to R and statistics and I have two questions. Perhaps then you need to start by explaining why you are using LDA. Please take a good look at the posting guide. First I need help to interpret the cross-validation result from the R linear discriminant analysis function lda. You mean Professor Ripley's function lda in package MASS, I guess. I did the following: lda (group ~ Var1 + Var2, CV=T) R allows you to use meaningful names, so please do so. where CV=T tells the lda to do cross-validation. The output of lda are the posterior probabilities among other things, but I can't find an error term (like delta returned by cv.glm). My question is how to get such an error term from the output? Can I just simply calculate the prediction accuracy using the posterior probabilities from the cross-validation, and use that to measure the quality of the model? cv.glm as in Dr Canty's package boot? If you are trying to predict classifications, LDA is not the right tool, and LOO CV probably is not either. There is no unique definition of `error term' (true for cv.glm as well), and people have written whole books about how to assess classifiers. LDA is about `discrimination' not `allocation' in the jargon used ca 1960. Another question is more basic: how to determine if a lda model is significant? (There is no p-value.) Thanks, Please do read the references on the ?lda page. It's not a useful question, as LDA is about discriminating between populations and makes the unrealistic assumption of multivariate normality. (Analogously for linear regression, there are ways to test if that is (statistically) `significant', but knowledgable users almost never do so.) Perhaps more realistic advice is to suggest you seek some statistical consultancy. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] cross-validation with count-data
Hello, possibly it is a stupid question but after few hours of trying and searching, perhaps I used the wrong key words, I decided to post it. I have the output of a glm() of count data (poisson). I would like to get the prediction error (cross-validation). cv.glm() does not work with poisson error family data. Or I have to transform the output error pred. in some way. are there method especially for glm() with different error families to receive the model goodness / the model badness? thanks in advance, Martin __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help