[R] Cross-validation for logistic regression with lasso2

2007-05-18 Thread francogrex

Hello, I am trying to shrink the coefficients of a logistic regression for a
sparse dataset, I am using the lasso (lasso2) and I am trying to determine
the shrinkinage factor by cross-validation. I would like please some of the
experts here to tell me whether i'm doing it correctly or not. Below is my
dataset and the functions I use

w=
a   b   c   d   e   P   A
0   0   0   0   0   1   879
1   0   0   0   0   1   3
0   1   0   0   0   7   7
0   0   1   0   0   230 2
0   0   0   1   0   450 7
0   0   0   0   1   4   

#The GLM output shows that the coefficients c and d are larger than 10:
resp=cbind(w$P,w$A)
summary(glm(resp~a+b+c+d+e,data=w,family=binomial))
Coefficients:
Estimate Std. Error z value Pr(|z|)
(Intercept)   -6.779  1.001  -6.775 1.24e-11 ***
a  5.680  1.528   3.718 0.000201 ***
b  6.779  1.134   5.976 2.29e-09 ***
c 11.524  1.227   9.392   2e-16 ***
d 10.942  1.071  10.220   2e-16 ***
e  3.688  1.124   3.282 0.001031 ** 

#so I wrote this below using the lasso2 package to determine the best
shrinkage factor using the gcv cross-validation:

for (i in seq(1,40,1)) {
glmba=gl1ce(resp~a+b+c+d+e, data = w, family = binomial(),bound=i) 
ecco=round(gcv(glmba,type=Tibshirani,gen.inverse.diag =1e11),digits=3)
print(ecco)
}
#and it gives me 21 with the lowest gcv.

#then I determine the shrunken coefficients:
gl1ce( resp ~ a + b + c + d + e, data = w, family = binomial(),  bound =
21)
Coefficients:
(Intercept)   a   b   c d   
 
e 
  -4.7498162.7762154.3426618.9565838.6615931.264660 
Family:
Family: binomial 
Link function: logit 
The absolute L1 bound was   :  21 
The Lagrangian for the bound is :  1.843283 

Thanks

-- 
View this message in context: 
http://www.nabble.com/Cross-validation-for-logistic-regression-with-lasso2-tf3777173.html#a10680591
Sent from the R help mailing list archive at Nabble.com.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] cross-validation for count data

2006-11-15 Thread [EMAIL PROTECTED]
Hi everybody,
I'm trying to use cross-validation (cv.glm) for count data. Does someone know 
which is the appropriate cost function for Poisson distribution?
Thank you in advance.

Valerio. 
Conservation Biology Unit
Department of Environmental and Territory Sciences
University of Milano-Bicocca
Piazza della Scienza,1
20126 Milano, Italy.



--
Scopri se hai Vinto un Tv Color LCD! Clicca qui
http://click.libero.it/webnation15nov06

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cross-validation for count data

2006-11-15 Thread Brian Ripley
On Wed, 15 Nov 2006, [EMAIL PROTECTED] wrote:

 I'm trying to use cross-validation (cv.glm) for count data. Does someone 
 know which is the appropriate cost function for Poisson distribution?

It depends on the scientific problem, not the distribution.
You could use the deviance but it may well not be appropriate for your 
context, so please seek statistical advice.

BTW, this is off-topic (see the posting guide) which is why your previous

https://stat.ethz.ch/pipermail/r-help/2006-November/116948.html

went unanswered.  Please don't clog the list with repeats like this.

And cv.glm is part of package boot (I presume) which you did not mention 
and if so is support software for a book that may help you 

 Thank you in advance.

 Valerio.
 Conservation Biology Unit
 Department of Environmental and Territory Sciences
 University of Milano-Bicocca
 Piazza della Scienza,1
 20126 Milano, Italy.

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Cross-validation in SVM

2006-02-23 Thread Amir Safari
 
   
 Dear David, Dear R Users,
   
   Calculation of Cross-Validation for SVM, with thoese time series which 
include negative and positive values ( for example return of a stock exchange 
index) must be different from a calculation of Cross-Validation with time 
series which includes just absolute values( for example a stock exchange 
index). 
  How is it calculated for a return time series? 
  Thank you very much for any help.
  Amir 
   
   
   


-


[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Cross-validation in SVM

2006-02-23 Thread Achim Zeileis
On Thu, 23 Feb 2006, Amir Safari wrote:

 Calculation of Cross-Validation for SVM, with thoese time series which
 include negative and positive values ( for example return of a stock
 exchange index) must be different from a calculation of Cross-Validation
 with time series which includes just absolute values( for example a
 stock exchange index).

Not necessarily, depends on the type of data.

 How is it calculated for a return time series?

From the man page of svm():

   cross: if a integer value k0 is specified, a k-fold cross
  validation on the training data is performed to assess the
  quality of the model: the accuracy rate for classification
  and the Mean Squared Error for regression

i.e., MSE will be used.
Z

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Cross-validation

2005-06-25 Thread Werner Bier
Dear R-help,
 
I was wondering if somebody has a strong opinion on the following matter:
 
Would you see appropriate to apply the leave-one-out cross validation techinque 
in time series modelling?
 
Thanks in advance,
Tom

__



[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Cross-validation

2005-06-25 Thread Spencer Graves
  I would hesitate long before doing that.  People do similar things, 
but:

Cross-validation and bootstrapping become considerably more complicated 
for time series data; see Hjorth (1994) and Snijders (1988).
http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-12.html

  I just tried www.r-project.org - search - R site search for time 
series cross validation, jackknife time series and bootstrap time 
series.  I found the above using Google for the same terms.

  spencer graves

Werner Bier wrote:

 Dear R-help,
  
 I was wondering if somebody has a strong opinion on the following matter:
  
 Would you see appropriate to apply the leave-one-out cross validation 
 techinque in time series modelling?
  
 Thanks in advance,
 Tom
 
 __
 
 
 
   [[alternative HTML version deleted]]
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

-- 
Spencer Graves, PhD
Senior Development Engineer
PDF Solutions, Inc.
333 West San Carlos Street Suite 700
San Jose, CA 95110, USA

[EMAIL PROTECTED]
www.pdf.com http://www.pdf.com
Tel:  408-938-4420
Fax: 408-280-7915

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] cross validation and parameter determination

2005-04-20 Thread Ramon Diaz-Uriarte
On Wednesday 20 April 2005 00:17, array chip wrote:
 Hi all,

 In Tibshirani's PNAS paper about nearest shrunken
 centroid analysis of microarrays (PNAS vol 99:6567),
 they used cross validation to choose the amount of
 shrinkage used in the model, and then test the
 performance of the model with the cross-validated
 shrinkage in separate independent testing set. If I
 don't have the luxury of having independent testing
 set, can I just use the cross validation performance
 as the performance estimate? In other words, can I use
 the same single cross-validation to both choose the
 value of the parameter (amount of shrinkage in this
 case) and estimate the performance which was based on
 the value of the parameter chosen by the same
 cross-validation? I kind of feel awkward by getting
 both on a single cross validation, because it seems
 like I used the dataset in training set manner. Am I
 wrong/right?


That error rate is probably optimistic, because as you say
 cross-validation? I kind of feel awkward by getting
 both on a single cross validation, because it seems
 like I used the dataset in training set manner. Am I

However, you can easily wrap the whole pam procedure within an outer-loop of 
cross validation or bootstrap. (This problem is not that different from, say, 
using knn and selecting k using cross-validation; or selecting the number of 
genes to use with cross-validation, etc. You should then assess the error 
rate of your procedure).

R.

 Thanks!

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide!
 http://www.R-project.org/posting-guide.html

-- 
Ramón Díaz-Uriarte
Bioinformatics Unit
Centro Nacional de Investigaciones Oncológicas (CNIO)
(Spanish National Cancer Center)
Melchor Fernández Almagro, 3
28029 Madrid (Spain)
Fax: +-34-91-224-6972
Phone: +-34-91-224-6900

http://ligarto.org/rdiaz
PGP KeyID: 0xE89B3462
(http://ligarto.org/rdiaz/0xE89B3462.asc)




**NOTA DE CONFIDENCIALIDAD** Este correo electrónico, y en su caso los ficheros 
adjuntos, pueden contener información protegida para el uso exclusivo de su 
destinatario. Se prohíbe la distribución, reproducción o cualquier otro tipo de 
transmisión por parte de otra persona que no sea el destinatario. Si usted 
recibe por error este correo, se ruega comunicarlo al remitente y borrar el 
mensaje recibido. 
**CONFIDENTIALITY NOTICE** This email communication and any attachments may 
contain confidential and privileged information for the sole use of the 
designated recipient named above. Distribution, reproduction or any other use 
of this transmission by any party other than the intended recipient is 
prohibited. If you are not the intended recipient please contact the sender and 
delete all copies.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] cross validation and parameter determination

2005-04-19 Thread array chip
Hi all,

In Tibshirani's PNAS paper about nearest shrunken
centroid analysis of microarrays (PNAS vol 99:6567),
they used cross validation to choose the amount of
shrinkage used in the model, and then test the
performance of the model with the cross-validated
shrinkage in separate independent testing set. If I
don't have the luxury of having independent testing
set, can I just use the cross validation performance
as the performance estimate? In other words, can I use
the same single cross-validation to both choose the
value of the parameter (amount of shrinkage in this
case) and estimate the performance which was based on
the value of the parameter chosen by the same
cross-validation? I kind of feel awkward by getting
both on a single cross validation, because it seems
like I used the dataset in training set manner. Am I
wrong/right?

Thanks!

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


RE: [R] cross validation and parameter determination

2005-04-19 Thread Liaw, Andy
In all likelihood, you'll get an overly optimistic estimate of performance
that way.

Andy

 From: array chip
 
 Hi all,
 
 In Tibshirani's PNAS paper about nearest shrunken
 centroid analysis of microarrays (PNAS vol 99:6567),
 they used cross validation to choose the amount of
 shrinkage used in the model, and then test the
 performance of the model with the cross-validated
 shrinkage in separate independent testing set. If I
 don't have the luxury of having independent testing
 set, can I just use the cross validation performance
 as the performance estimate? In other words, can I use
 the same single cross-validation to both choose the
 value of the parameter (amount of shrinkage in this
 case) and estimate the performance which was based on
 the value of the parameter chosen by the same
 cross-validation? I kind of feel awkward by getting
 both on a single cross validation, because it seems
 like I used the dataset in training set manner. Am I
 wrong/right?
 
 Thanks!
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! 
 http://www.R-project.org/posting-guide.html
 
 


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] cross validation and CART

2005-04-15 Thread Laure Maton
Hello,
I would like to know if the classification trees i built with my data are 
predictive or not.
Could you explain me how to do that?
Thanks
Laure Maton

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


RE: [R] cross validation and CART

2005-04-15 Thread Liaw, Andy
 From: Laure Maton
 
 Hello,
 I would like to know if the classification trees i built with 
 my data are 
 predictive or not.
 Could you explain me how to do that?
 Thanks
 Laure Maton

If you are talking about the particular tree models that you built from the
data, you will need independent test set to evaluate prediction performance.
If you want to know if the _algorithm_ can produce models that are
predictive, you can use something like cross validation.  See the errorest()
function in the `ipred' package, for example.

Andy

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Cross validation, one more time (hopefully the last)

2005-03-16 Thread Trevor Wiens
I apologize for posting on this question again, but unfortunately, I don't have 
and can't get access to MASS for at least three weeks. I have found some code 
on the web however which implements the prediction error algorithm in cv.glm.

http://www.bioconductor.org/workshops/NGFN03/modelsel-exercise.pdf

Now I've tried to adapt it to my purposes, but since I'm not deeply familiar 
with R programming, I don't know why it doesn't work. Now checking the r-help 
list faq it seems this is an appropriate question. 

I've included my attempted function below. The error I get is:

logcv(basp.data, form, 'basp', 'recordyear')
Error in order(na.last, decreasing, ...) : 
Argument 1 is not a vector

My questions are, why doesn't this work, and how do I fix it.

I'm using the formula function to create the formula that I'm sending to my 
function. And the mdata is a data.frame. I'm assumed that if I passed the 
column names as strings (response variable - rvar, fold variable - fvar) this 
would work. Apparently however it doesn't.

Lastly, since I don't have access to MASS and there are apparently many 
examples of doing this kind of thing in MASS, could someone tell me if this 
function looks approximately correct?

Thanks

T



logcv - function(mdata, formula, rvar, fvar) {
require(Hmisc)

# sort by fold variable
sorted - mdata[order(mdata$fvar), ]

# get fold values and count for each group
vardesc - describe(sorted$fvar)$values
fvarlist - as.integer(dimnames(vardesc)[[2]])
k - length(fvarlist)
countlist - vardesc[1,1]
for (i in 2:k)
{
countlist[i] - vardesc[1,i]
}
n - length(sorted$fvar)

# fit to all the mdata
fit.all - glm(formula, sorted, family=binomial)
pred.all - ifelse( predict(fit.all, type=response)  0.5, 0, 1)

#setup
pred.c - list()
error.i - vector(length=k)

for (i in 1:k) 
{
fit.i - glm(formula, subset(sorted, sorted$fvar != fvarlist[i]), 
family=binomial)
pred.i - ifelse(predict(fit.i, newdata=subset(sorted, sorted$fvar == 
fvarlist[i]), type=response)  0.5, 0, 1)
pred.c[[i]] = pred.i
pred.all.i - ifelse(predict(fit.i, newdata=sorted, type=response)  0.5, 0, 
1)
error.i[i] - sum(sorted$rvar != pred.all.i)/n
}
pred.cc - unlist(pred.c)
delta.cv.k - sum(sorted$rvar != pred.cc)/n
p.k - countlist/n
delta.app - mean(sorted$rvar != pred.all)/n

delta.acv.k - delta.cv.k + delta.app - sum(p.k*error.i)

print(delta.acv.k)
}


-- 
Trevor Wiens 
[EMAIL PROTECTED]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Cross validation, one more time (hopefully the last)

2005-03-16 Thread Trevor Wiens
On Wed, 16 Mar 2005 17:59:01 -0700
Trevor Wiens [EMAIL PROTECTED] wrote:

 I apologize for posting on this question again, but unfortunately, I don't 
 have and can't get access to MASS for at least three weeks. I have found some 
 code on the web however which implements the prediction error algorithm in 
 cv.glm.
 
 http://www.bioconductor.org/workshops/NGFN03/modelsel-exercise.pdf
 
 Now I've tried to adapt it to my purposes, but since I'm not deeply familiar 
 with R programming, I don't know why it doesn't work. Now checking the r-help 
 list faq it seems this is an appropriate question. 
 

OK. I've determined why that didn't work. But I'm still unsure if I've 
implemented the algorithm correctly. Any suggestions for testing would be 
appreciated. The corrected function is attached.

Thanks for your assistance.


logcv - function(mdata, formula, rvar, fvar) {
require(Hmisc)

# determine index of variables
rpos - match(rvar, names(mdata))
fpos - match(fvar, names(mdata))

# sort by fold variable
sorted - mdata[order(mdata[[fpos]]), ]

# get fold values and count for each group
vardesc - describe(sorted[[fpos]])$values
fvarlist - as.integer(dimnames(vardesc)[[2]])
k - length(fvarlist)
countlist - vardesc[1,1]
for (i in 2:k)
{
countlist[i] - vardesc[1,i]
}
n - length(sorted[[fpos]])

# fit to all the mdata
fit.all - glm(formula, sorted, family=binomial)
pred.all - ifelse( predict(fit.all, type=response)  0.5, 0, 1)

#setup
pred.c - list()
error.i - vector(length=k)

for (i in 1:k) 
{
fit.i - glm(formula, subset(sorted, sorted[[fpos]] != fvarlist[i]), 
family=binomial)
pred.i - ifelse(predict(fit.i, newdata=subset(sorted, sorted[[fpos]] == 
fvarlist[i]), type=response)  0.5, 0, 1)
pred.c[[i]] = pred.i
pred.all.i - ifelse(predict(fit.i, newdata=sorted, type=response)  0.5, 0, 
1)
error.i[i] - sum(sorted[[rpos]] != pred.all.i)/n
}
pred.cc - unlist(pred.c)
delta.cv.k - sum(sorted[[rpos]] != pred.cc)/n
p.k - countlist/n
delta.app - mean(sorted[[rpos]] != pred.all)/n

delta.acv.k - delta.cv.k + delta.app - sum(p.k*error.i)

return(delta.acv.k)
}

--

T
-- 
Trevor Wiens 
[EMAIL PROTECTED]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] cross-validation

2005-03-13 Thread Trevor Wiens
I've been looking at the base and Design libraries and it is unclear to me the 
best way to approach doing cross-validation. I'm interested in using temporal 
(I have five years of data), spatial (I've divided my data set up into 5 blocks 
that make sense and have a block variable attached to my data) and I was also 
thinking of doing a random cross-validation to look at general model stability. 
For the third options I can use cross-validation or bootstrapping.

If someone can type out a code example, that would be very helpful to me. 

Thanks in advance.

T
-- 
Trevor Wiens 
[EMAIL PROTECTED]

The significant problems that we face cannot be solved at the same 
level of thinking we were at when we created them. 
(Albert Einstein)

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] cross-validation

2005-03-13 Thread Trevor Wiens
On Sun, 13 Mar 2005 15:28:54 -0700
Trevor Wiens [EMAIL PROTECTED] wrote:

 I've been looking at the stats and Design libraries and it is unclear to me 
 the best way to approach doing cross-validation. I'm interested in using 
 temporal (I have five years of data), spatial (I've divided my data set up 
 into 5 blocks that make sense and have a block variable attached to my data) 
 and I was also thinking of doing a random cross-validation to look at general 
 model stability. For the third option I think either cross-validation or 
 bootstrapping would be appropriate
 
 If someone can type out a really simple code example ( or point me to one), 
 that would be very helpful. 
 
I realized I should have mentioned this is for logistic regression using either 
glm or the Design lrm models.

Thanks

T
-- 
Trevor Wiens 
[EMAIL PROTECTED]

The significant problems that we face cannot be solved at the same 
level of thinking we were at when we created them. 
(Albert Einstein)

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] cross validation

2005-01-21 Thread kolluru ramesh
How to select training data set and test data set from the original data for 
performing cross-validation


-


[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] cross validation

2005-01-21 Thread Dimitris Rizopoulos
you could something like this (based on VR's S Programming, pp. 175):
dat - data.frame(matrix(rnorm(100*6), 100, 6))
#
n - nrow(dat)
V - 10 # number of folds
samps - sample(rep(1:V, length=n), n, replace=FALSE)
#
# Using the first fold:
train - dat[samps!=1,] # fit the model
test - dat[samps==1,] # predict
I hope it helps.
Best,
Dimitris

Dimitris Rizopoulos
Ph.D. Student
Biostatistical Centre
School of Public Health
Catholic University of Leuven
Address: Kapucijnenvoer 35, Leuven, Belgium
Tel: +32/16/336899
Fax: +32/16/337015
Web: http://www.med.kuleuven.ac.be/biostat
http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm
- Original Message - 
From: kolluru ramesh [EMAIL PROTECTED]
To: Rpackage help r-help@stat.math.ethz.ch
Sent: Friday, January 21, 2005 11:19 AM
Subject: [R] cross validation


How to select training data set and test data set from the original 
data for performing cross-validation

-
[[alternative HTML version deleted]]
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! 
http://www.R-project.org/posting-guide.html

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


RE: [R] cross validation

2005-01-21 Thread Liaw, Andy
One way is to create an indicator vector that indicate which `fold' a case
should belong to.  Something like:

fold - 10
idx - sample(fold, n, replace=TRUE)
for (I in 1:fold) {
   train.dat - dat[idx != i,]
   test.dat - dat[idx == i,]
   ...
}

Also see the errorest() function in the ipred package.  It is more careful
in making sure the folds are as close in size as possible, and can do
stratified splits.

Andy


 From: kolluru ramesh
 
 How to select training data set and test data set from the 
 original data for performing cross-validation
 
   
 -
 
 
   [[alternative HTML version deleted]]
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! 
 http://www.R-project.org/posting-guide.html
 


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] cross validation

2005-01-21 Thread Uwe Ligges
Dimitris Rizopoulos wrote:
you could something like this (based on VR's S Programming, pp. 175):
dat - data.frame(matrix(rnorm(100*6), 100, 6))
#
n - nrow(dat)
V - 10 # number of folds
samps - sample(rep(1:V, length=n), n, replace=FALSE)
#
# Using the first fold:
train - dat[samps!=1,] # fit the model
test - dat[samps==1,] # predict

Or see ?errorest in the ipred package.
Uwe Ligges

I hope it helps.
Best,
Dimitris

Dimitris Rizopoulos
Ph.D. Student
Biostatistical Centre
School of Public Health
Catholic University of Leuven
Address: Kapucijnenvoer 35, Leuven, Belgium
Tel: +32/16/336899
Fax: +32/16/337015
Web: http://www.med.kuleuven.ac.be/biostat
http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm
- Original Message - From: kolluru ramesh [EMAIL PROTECTED]
To: Rpackage help r-help@stat.math.ethz.ch
Sent: Friday, January 21, 2005 11:19 AM
Subject: [R] cross validation

How to select training data set and test data set from the original 
data for performing cross-validation

-
[[alternative HTML version deleted]]
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! 
http://www.R-project.org/posting-guide.html

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! 
http://www.R-project.org/posting-guide.html
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Fwd: RE: [R] cross validation

2005-01-21 Thread kolluru ramesh


Note: forwarded message attached.


-

---BeginMessage---
Authentication-Results: mta283.mail.scd.yahoo.com
  from=merck.com; domainkeys=neutral (no sig)
X-Originating-IP: [155.91.6.40]
Return-Path: [EMAIL PROTECTED]
Received: from 155.91.6.40  (EHLO usryim07.merck.com) (155.91.6.40)
  by mta283.mail.scd.yahoo.com with SMTP; Fri, 21 Jan 2005 03:29:25 -0800
Received: from 155.91.2.6 by usryim07.merck.com with ESMTP (SMTP Relay);
 Fri, 21 Jan 2005 06:29:17 -0500
X-Server-Uuid: D9643136-67E2-4E6E-A13C-387E6A236A69
Received: from 54.3.102.166 by usrytw32.merck.com with ESMTP (Tumbleweed
 MMS SMTP Relay (MMS v5.6.1)); Fri, 21 Jan 2005 06:29:10 -0500
X-Server-Uuid: 79464537-DA04-46B1-B103-B8820195B307
Received: by usrygw30.merck.com with Internet Mail Service (5.5.2657.72)
 id DKTGTC4H; Fri, 21 Jan 2005 06:29:10 -0500
From: Liaw, Andy [EMAIL PROTECTED]

Rpackage help r-help@stat.math.ethz.ch
Subject: RE: [R] cross validation
Date: Fri, 21 Jan 2005 06:29:03 -0500
Return-Receipt-To: Liaw, Andy [EMAIL PROTECTED]
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2657.72)
X-WSS-ID: 6DEE380C1H05785560-01-01
X-WSS-ID: 6DEE380719C62869-01-01
Content-Type: text/plain
Content-Transfer-Encoding: 7bit
Content-Length: 884

One way is to create an indicator vector that indicate which `fold' a case
should belong to.  Something like:

fold - 10
idx - sample(fold, n, replace=TRUE)
for (I in 1:fold) {
   train.dat - dat[idx != i,]
   test.dat - dat[idx == i,]
   ...
}

Also see the errorest() function in the ipred package.  It is more careful
in making sure the folds are as close in size as possible, and can do
stratified splits.

Andy


 From: kolluru ramesh
 
 How to select training data set and test data set from the 
 original data for performing cross-validation
 
   
 -
 
 
   [[alternative HTML version deleted]]
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! 
 http://www.R-project.org/posting-guide.html
 
 


--

--
---End Message---
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

[R] Cross-validation accuracy in SVM

2005-01-20 Thread Ton van Daelen
Hi all -

I am trying to tune an SVM model by optimizing the cross-validation
accuracy. Maximizing this value doesn't necessarily seem to minimize the
number of misclassifications. Can anyone tell me how the
cross-validation accuracy is defined? In the output below, for example,
cross-validation accuracy is 92.2%, while the number of correctly
classified samples is (1476+170)/(1476+170+4) = 99.7% !?

Thanks for any help.

Regards - Ton

---
Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
   cost:  8 
  gamma:  0.007 

Number of Support Vectors:  1015

 ( 148 867 )

Number of Classes:  2 

Levels: 
 false true

5-fold cross-validation on training data:

Total Accuracy: 92.24242 
Single Accuracies:
 90 93.3 94.84848 92.72727 90.30303 

Contingency Table
   predclasses
origclasses false true
  false 1476 0
  true 4   170

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


RE: [R] Cross-validation accuracy in SVM

2005-01-20 Thread Liaw, Andy
The 99.7% accuracy you quoted, I take it, is the accuracy on the training
set.  If so, that number hardly means anything (other than, perhaps,
self-fulfilling prophecy).  Usually what one would want is for the model to
be able to predict data that weren't used to train the model with high
accuracy.  That's what cross-validation tries to emulate.  It gives you an
estimate of how well you can expect your model to do on data that the model
has not seen.

Andy

 From: Ton van Daelen
 
 Hi all -
 
 I am trying to tune an SVM model by optimizing the cross-validation
 accuracy. Maximizing this value doesn't necessarily seem to 
 minimize the
 number of misclassifications. Can anyone tell me how the
 cross-validation accuracy is defined? In the output below, 
 for example,
 cross-validation accuracy is 92.2%, while the number of correctly
 classified samples is (1476+170)/(1476+170+4) = 99.7% !?
 
 Thanks for any help.
 
 Regards - Ton
 
 ---
 Parameters:
SVM-Type:  C-classification 
  SVM-Kernel:  radial 
cost:  8 
   gamma:  0.007 
 
 Number of Support Vectors:  1015
 
  ( 148 867 )
 
 Number of Classes:  2 
 
 Levels: 
  false true
 
 5-fold cross-validation on training data:
 
 Total Accuracy: 92.24242 
 Single Accuracies:
  90 93.3 94.84848 92.72727 90.30303 
 
 Contingency Table
predclasses
 origclasses false true
   false 1476 0
   true 4   170
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! 
 http://www.R-project.org/posting-guide.html
 


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Cross-validation accuracy in SVM

2005-01-20 Thread Frank E Harrell Jr
Ton van Daelen wrote:
Hi all -
I am trying to tune an SVM model by optimizing the cross-validation
accuracy. Maximizing this value doesn't necessarily seem to minimize the
number of misclassifications. Can anyone tell me how the
cross-validation accuracy is defined? In the output below, for example,
cross-validation accuracy is 92.2%, while the number of correctly
classified samples is (1476+170)/(1476+170+4) = 99.7% !?
Thanks for any help.
Regards - Ton
Percent correctly classified is an improper scoring rule.  The percent 
is maximized when the predicted values are bogus.  In addition, one can 
add a very important predictor and have the % actually decrease.

Frank Harrell
---
Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
   cost:  8 
  gamma:  0.007 

Number of Support Vectors:  1015
 ( 148 867 )
Number of Classes:  2 

Levels: 
 false true

5-fold cross-validation on training data:
Total Accuracy: 92.24242 
Single Accuracies:
 90 93.3 94.84848 92.72727 90.30303 

Contingency Table
   predclasses
origclasses false true
  false 1476 0
  true 4   170
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Cross-validation for Linear Discrimitant Analysis

2004-09-15 Thread Yu Shao
Hello:

I am new to R and statistics and I have two questions.

First I need help to interpret the cross-validation result from the R
linear discriminant analysis function lda. I did the following:

lda (group ~ Var1 + Var2, CV=T)

where CV=T tells the lda to do cross-validation. The output of lda are
the posterior probabilities among other things, but I can't find an error
term (like delta returned by cv.glm). My question is how to get such an
error term from the output? Can I just simply calculate the prediction
accuracy using the posterior probabilities from the cross-validation, and
use that to measure the quality of the model?

Another question is more basic: how to determine if a lda model is
significant? (There is no p-value.) Thanks,

Yu Shao

Wadsworth Research Center
Department of Health of New York State
Albany, NY 12208

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Cross-validation for Linear Discrimitant Analysis

2004-09-15 Thread Prof Brian Ripley
On Wed, 15 Sep 2004, Yu Shao wrote:

 I am new to R and statistics and I have two questions.

Perhaps then you need to start by explaining why you are using LDA.
Please take a good look at the posting guide.

 First I need help to interpret the cross-validation result from the R
 linear discriminant analysis function lda. 

You mean Professor Ripley's function lda in package MASS, I guess.

 I did the following:
 
 lda (group ~ Var1 + Var2, CV=T)

R allows you to use meaningful names, so please do so.

 where CV=T tells the lda to do cross-validation. The output of lda are
 the posterior probabilities among other things, but I can't find an error
 term (like delta returned by cv.glm). My question is how to get such an
 error term from the output? Can I just simply calculate the prediction
 accuracy using the posterior probabilities from the cross-validation, and
 use that to measure the quality of the model?

cv.glm as in Dr Canty's package boot?  If you are trying to predict
classifications, LDA is not the right tool, and LOO CV probably is not
either.  There is no unique definition of `error term' (true for cv.glm as
well), and people have written whole books about how to assess
classifiers.  LDA is about `discrimination' not `allocation' in the jargon 
used ca 1960.

 Another question is more basic: how to determine if a lda model is
 significant? (There is no p-value.) Thanks,

Please do read the references on the ?lda page.  It's not a useful
question, as LDA is about discriminating between populations and makes the
unrealistic assumption of multivariate normality.  (Analogously for linear
regression, there are ways to test if that is (statistically)
`significant', but knowledgable users almost never do so.)

Perhaps more realistic advice is to suggest you seek some statistical 
consultancy.

-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] cross-validation with count-data

2003-11-21 Thread Martin Wegmann
Hello, 

possibly it is a stupid question but after few hours of trying and searching, 
perhaps I used the wrong key words, I decided to post it. 

I have the output of a glm() of count data (poisson).
I would like to get the prediction error (cross-validation).

cv.glm() does not work with poisson error family data. Or I have to transform 
the output error pred. in some way. 

are there method especially for glm() with different error families to receive 
the model goodness / the model badness?

thanks in advance, Martin

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help