Re: [R] logistic regression model with non-integer weights
Michael Dewey wrote: At 17:12 09/04/06, Ramón Casero Cañas wrote: I am not sure what the problem you really want to solve is but it seems that a) abnormality is rare b) the logistic regression predicts it to be rare. If you want a prediction system why not try different cut-offs (other than 0.5 on the probability scale) and perhaps plot sensitivity and specificity to help to choose a cut-off? Thanks for your suggestions, Michael. It took me some time to figure out how to do this in R (as trivial as it may be for others). Some comments about what I've done follow, in case anyone is interested. The problem is a) abnormality is rare (Prevalence=14%) and b) there is not much difference in the independent variable between abnormal and normal. So the logistic regression model predicts that P(abnormal) = 0.4. I got confused with this, as I expected a cut-off point of P=0.5 to decide between normal/abnormal. But you are right, in that another cut-off point can be chosen. For a cut-off of e.g. P(abnormal)=0.15, Sensitivity=65% and Specificity=52%. They are pretty bad, although for clinical purposes I would say that Positive/Negative Predictive Values are more interesting. But then PPV=19% and NPV=90%, which isn't great. As an overall test of how good the model is for classification I have computed the area under the ROC, from your suggestion of using Sensitivity and Specificity. I couldn't find how to do this directly with R, so I implemented it myself (it's not difficult but I'm new here). I tried with package ROCR, but apparently it doesn't cover binary outcomes. The area under the ROC is 0.64, so I would say that even though the model seems to fit the data, it just doesn't allow acceptable discrimination, not matter what the cut-off point. I have also studied the effect of low prevalence. For this, I used option ran.gen in the boot function (package boot) to define a function that resamples the data so that it balances abnormal and normal cases. A logistic regression model is fitted to each replicate, to a parametric bootstrap, and thus compute the bias of the estimates of the model coefficients, beta0 and beta1. This shows very small bias for beta1, but a rather large bias for beta0. So I would say that prevalence has an effect on beta0, but not beta1. This is good, because a common measure like the odds ratio depends only on beta1. Cheers, -- Ramón Casero Cañas http://www.robots.ox.ac.uk/~rcasero/wiki http://www.robots.ox.ac.uk/~rcasero/blog __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] logistic regression model with non-integer weights
Frank E Harrell Jr wrote: This makes me think you are trying to go against maximum likelihood to optimize an improper criterion. Forcing a single cutpoint to be chosen seems to be at the heart of your problem. There's nothing wrong with using probabilities and letting the utility possessor make the final decision. I agree, and in fact I was thinking along those lines, but I also needed a way of evaluating how good is the model to discriminate between abnormal and normal cases, as opposed to e.g. GOF. The only way I know of is using area under ROC (thus setting cut-off points), which also followed neatly from Michael Dewey comments. Any alternatives would be welcome :) -- Ramón Casero Cañas http://www.robots.ox.ac.uk/~rcasero/wiki http://www.robots.ox.ac.uk/~rcasero/blog __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] permutation of rows of a matrix
javargas wrote: How can I generate a random permutation between rows of a matrix M (of m rows and n columns)? maybe M[ sample(1:m), ] -- Ramón Casero Cañas http://www.robots.ox.ac.uk/~rcasero/wiki http://www.robots.ox.ac.uk/~rcasero/blog __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] editor for Ubuntu
camille wrote: I am new on Ubuntu. I would like to use R, but I tried Kate and Scite. The first one keeps trying to use KDE applications,while the other does not understand the language. I have searched for another editor for hours, in vain. Which editor should work with Ubuntu? I use emacs (which is a general purpose editor and much more) and ess (Emacs statistics mode, supporting R,S and others). To install ess in ubuntu do $ sudo apt-get install ess There's a bit of a learning curve to emacs, but it's a good tool, especially if you are writing documents in LaTex too. -- Ramón Casero Cañas http://www.robots.ox.ac.uk/~rcasero/wiki http://www.robots.ox.ac.uk/~rcasero/blog __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] how to count the columns of a data.frame
giacomo moro wrote: I would like to count the columns of a data.frame. I know how to count the rows, but not the columns. Can someone tell me how to do it? ncol(data.frame) -- Ramón Casero Cañas http://www.robots.ox.ac.uk/~rcasero/wiki http://www.robots.ox.ac.uk/~rcasero/blog __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Bootstrap and Jackknife Bias using Survey Package
Carlos Creva Singano (M2004078) wrote: 1. How to compute Bootstrap and Jackknife Bias of estimates, like mean? Have you had a look at packages boot and bootstrap? E.g. you can compute the bias and s.e. of an estimate theta using bootstrap library(boot) a - boot( data, theta, R=1000 ) where boot is a function of the boot package, and theta is a function that you can define yourself. In your case, it is not necessary and you can use mean instead of theta. 2. How to see each replicate estimate of the parameter? In the example above, they can be found in the boot object a in: a$t -- Ramón Casero Cañas http://www.robots.ox.ac.uk/~rcasero/wiki http://www.robots.ox.ac.uk/~rcasero/blog __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] logistic regression model with non-integer weights
When fitting a logistic regression model using weights I get the following warning data.model.w - glm(ABN ~ TR, family=binomial(logit), weights=WEIGHT) Warning message: non-integer #successes in a binomial glm! in: eval(expr, envir, enclos) Details follow *** I have a binary dependent variable of abnormality ABN = T, F, T, T, F, F, F... and a continous predictor TR = 1.962752 1.871123 1.893543 1.685001 2.121500, ... As the number of abnormal cases (ABN==T) is only 14%, and there is large overlapping between abnormal and normal cases, the logistic regression found by glm is always much closer to the normal cases than for the abnormal cases. In particular, the probability of abnormal is at most 0.4. Coefficients: Estimate Std. Error z value Pr(|z|) (Intercept) 0.7607 0.7196 1.057 0.2905 TR2 -1.4853 0.4328 -3.432 0.0006 *** --- I would like to compensate for the fact that the a priori probability of abnormal cases is so low. I have created a weight vector WEIGHT - ABN WEIGHT[ ABN == TRUE ] - 1 / na / 2 WEIGHT[ ABN == FALSE ] - 1 / nn / 2 so that all weights add up to 1, where ``na'' is the number of abnormal cases, and ``nn'' is the number of normal cases. That is, normal cases have less weight in the model fitting because there are so many. But then I get the warning message at the beginning of this email, and I suspect that I'm doing something wrong. Must weights be integers, or at least greater than one? Regards, -- Ramón Casero Cañas http://www.robots.ox.ac.uk/~rcasero/wiki http://www.robots.ox.ac.uk/~rcasero/blog __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] More Logistic Regression Tools?
Frank E Harrell Jr wrote: Eric Rescorla [EMAIL PROTECTED] wrote: (2) I'd like to compute goodness-of-fit statistics for my fit (Hosmer-Lemeshow, Pearson, etc.). I didn't see a package that did this. Have I missed one? Hosmer-Lemeshow has low power and relies on arbitrary binning of predicted probabilities. The Hosmer-Le Cessie omnibus test is unique and has more power usually. To get it: f - update(f, x=T, y=T) resid(f, 'gof') # uses residuals.lrm The documentation of the Design package, section residuals.lrm, says CITE Details For the goodness-of-fit test, the le Cessie-van Houwelingen normal test statistic for the unweighted sum of squared errors (Brier score times n) is used. /CITE It is not clear to me whether the test implemented is for the statistic with a constant kernel described in S. le Cessie and J.C. van Houwelingen. A goodness-of-fit test for binary regression models, based on smoothing methods. Biometrics, 47:12671282, Dec 1991. or the variation in D.W. Hosmer, T. Hosmer, S. Le Cessie, and S. Lemeshow. A comparison of goodness-of-fit tests for the logistic regression model. Statistics in Medicine, 16:965980, 1997. Sorry, I couldn't get this from looking at the code either (I'm quite new to R). Is the statistic tested the T defined by le Cessie and van Houwelingen? Regards, -- Ramón Casero Cañas http://www.robots.ox.ac.uk/~rcasero/wiki http://www.robots.ox.ac.uk/~rcasero/blog __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] rate instead of scale in ?ks.test
I am not sure whether I'm doing something wrong or there is a bug in the documentation of ks.test. Following the posting guide, as I'm not sure, I haven't found this in the bug tracker, and the R FAQ says that stats is an Add-on package of R, I think this is the place to send it. ?ks.test provides the example QUOTE # Does x come from a shifted gamma distribution with shape 3 and scale 2? ks.test(x+2, pgamma, 3, 2) # two-sided /QUOTE Maybe it should say ``with shape 3 and rate 2''? If I do x - rgamma(1e6, shape = 3, scale = 2) ks.test( x, 'pgamma', 3, 2 ) One-sample Kolmogorov-Smirnov test data: x D = 0.7513, p-value 2.2e-16 alternative hypothesis: two.sided whereas with y - rgamma(1e6, shape = 3, rate = 2) ks.test( y, 'pgamma', 3, 2 ) One-sample Kolmogorov-Smirnov test data: y D = 5e-04, p-value = 0.9469 alternative hypothesis: two.sided If forcing the parameter meaning: ks.test( x, 'pgamma', shape = 3, scale = 2 ) One-sample Kolmogorov-Smirnov test data: x D = 9e-04, p-value = 0.3645 alternative hypothesis: two.sided version platform i386-pc-linux-gnu arch i386 os linux-gnu system i386, linux-gnu status major2 minor1.1 year 2005 month06 day 20 language R packageDescription( stats ) Package: stats Version: 2.1.1 Priority: base Title: The R Stats Package Author: R Development Core Team and contributors worldwide Maintainer: R Core Team [EMAIL PROTECTED] Description: R statistical functions License: GPL Version 2 or later. Built: R 2.1.1; i386-pc-linux-gnu; 2005-06-29 23:30:55; unix -- File: /usr/lib/R/library/stats/DESCRIPTION Cheers, -- Ramón Casero Cañas web:http://www.robots.ox.ac.uk/~rcasero/ __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] update MASS
Prof Brian Ripley wrote: On Sat, 1 Oct 2005, Ramón Casero Cañas wrote: I'd like to update MASS from version 7.2-11 to version 7-2.19. I am running R 2.0.1 on ubuntu, that installs MASS with the package r-cran-vr. First, MASS is part of VR, so it is VR you update. Thanks Prof Ripley, and Kjetil too, for your help. I am sure this was obvious to anyone but me, but I found the wording in http://www.stats.ox.ac.uk/pub/MASS4/Software.html#RUnix a bit confusing. May I suggest that where it says ``To get the very latest version, use update.packages() from an R session'', it could be added: ``Note also that the current version of your R base system may limit the update, i.e. if the R base system you are running is too old, you will not be able to get the latest version of MASS. You can check the dependencies of MASS from the a href=http://cran.r-project.org/src/contrib/Descriptions/VR.html;Description file of VR at CRAN/a''. And maybe right before ``As the VR bundle is now recommended'' it could say ``MASS does not have its own package, but comes bundled with other libraries in a package called VR. Thus if you want to use/update MASS you need to install/update VR''. Your R is seriously only of date. You might want to wait a week and then install 2.2.0. Searching a bit more I have found that this problem has been brought up in the ubuntu forums http://ubuntuforums.org/archive/index.php/t-49178.html i.e. that the GNU R version is too old. As a temporal fix, it is suggested to install the sarge .deb packages from http://cran.r-project.org/ and then Linux/debian/stable, overlooking the dependencies error with libc6. I'll update to 2.2.0 in a week, but for the moment I did the former and it seems to work with the code I was writting. Cheers, -- Ramón Casero Cañas web:http://www.robots.ox.ac.uk/~rcasero/ __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] update MASS
I'd like to update MASS from version 7.2-11 to version 7-2.19. I am running R 2.0.1 on ubuntu, that installs MASS with the package r-cran-vr. I have tried doing update.packages(), and I get a list of packages that I could update, but none is called MASS. I have tried the following too, right after launching emacs, but to no avail. QUOTE update.packages(MASS) trying URL `http://cran.r-project.org/src/contrib/PACKAGES' Content type `text/plain; charset=iso-8859-1' length 65650 bytes opened URL == downloaded 64Kb Error in old.packages(lib.loc = lib.loc, contriburl = contriburl, method = method, : no installed.packages for (invalid?) lib.loc=MASS /QUOTE I have the package installed in /usr/lib/R/library/MASS. I can do QUOTE library(MASS) search() [1] .GlobalEnvpackage:MASS package:methods [4] package:stats package:graphics package:grDevices [7] package:utils package:datasets Autoloads [10] package:base /QUOTE How could I update the package? Thanks, -- Ramón Casero Cañas web:http://www.robots.ox.ac.uk/~rcasero/ __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] logistic regression with nominal predictors
(Sorry for obvious mistakes, as I am quite a newby with no Statistics background). My question is going to be what is the gain of logistic regression over odds ratios when none of the input variables is continuous. My experiment: Outcome: ordinal scale, ``quality'' (QUA=1,2,3) Predictors: ``segment'' (SEG) and ``stress'' (STR). SEG is nominal scale with 24 levels, and STR is dychotomous (0,1). Considering the outcome continuous, two-way ANOVA with aov(as.integer(QUA) ~ SEG * STR) doesn't find evidence of interaction between SEG and STR, and they are significant on their own. This is the result that we would expect from clinical knowledge. I use xtabs(~QUA+SEG, data=data2.df, subset=STR==0) xtabs(~QUA+SEG, data=data2.df, subset=STR==0) for the contingency tables. There are zero cells, and for some values of SEG, there is only one none-zero cell, i.e. some values of SEG determine the output with certainty. So initially I was thinking of a proportional odds logistic regression model, but following Hosmer and Lemeshow [1], zero cells are problematic. So I take out of the data table the deterministic values of SEG, and I pool QUA=2 and QUA=3, and now I have a dychotomous outcome (QUA = Good/Bad) and no zero cells. The following model doesn't find evidence of interaction glm(QUA ~ STR * SEG, data=data3.df, family=binomial) so I go for glm(QUA ~ STR + SEG, data=data3.df, family=binomial) (I suppose that what glm does is to create design variables for SEG, where 0 0 ... 0 is for the first value of SEG, 1 0 ... 0 for the second value, 0 1 0 ... 0 for the third, etc). Coefficients: Estimate Std. Error z value Pr(|z|) (Intercept) -1.085e+00 1.933e-01-5.614 1.98e-08 *** STR.L2.112e-01 6.373e-02 3.314 0.000921 *** SEGP2C.MI -9.869e-01 3.286e-01-3.004 0.002669 ** SEGP2C.AI -1.306e+00 3.585e-01-3.644 0.000269 *** SEGP2C.AA -1.743e+00 4.123e-01-4.227 2.37e-05 *** [shortened] SEGP4C.ML -5.657e-01 2.990e-01-1.892 0.058485 . SEGP4C.BL -2.908e-16 2.734e-01 -1.06e-15 1.00 SEGSAX.MS1.092e-01 2.700e-01 0.405 0.685772 SEGSAX.MAS -5.441e-16 2.734e-01 -1.99e-15 1.00 SEGSAX.MA7.130e-01 2.582e-01 2.761 0.005758 ** SEGSAX.ML1.199e+00 2.565e-01 4.674 2.96e-06 *** SEGSAX.MP1.313e+00 2.570e-01 5.108 3.26e-07 *** SEGSAX.MI8.865e-01 2.569e-01 3.451 0.000558 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 3462.0 on 3123 degrees of freedom Residual deviance: 3012.6 on 3101 degrees of freedom AIC: 3058.6 Number of Fisher Scoring iterations: 6 Even though some coefficients have no evidence of statistical significance, the model requires them from a clinical point of view. At this point, the question would be how to interpret these results, and what advantage they offer over odds ratios. From [1] I can understand that in the case of a dychotomous and a continuous predictor, you can adjust for the continuous variable. But when all predictors are dychotomous (due to the design variables), I don't quite see the effect of adjustment. Wouldn't it be better just to split the data in two groups (STR=0 and STR=1), and instead of using logistic regression, use odds ratios for each value of SEG? Cheers, Ramón. [1] D.W. Hosmer and S. Lemeshow. ``Applied Logistic Regression''. John-Wiley. 2000. -- Ramón Casero Cañas web:http://www.robots.ox.ac.uk/~rcasero/ __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html