Re: [R] logistic regression model with non-integer weights

2006-04-16 Thread Ramón Casero Cañas
Michael Dewey wrote:
 At 17:12 09/04/06, Ramón Casero Cañas wrote:
 
 I am not sure what the problem you really want to solve is but it seems
 that
 a) abnormality is rare
 b) the logistic regression predicts it to be rare.
 If you want a prediction system why not try different cut-offs (other
 than 0.5 on the probability scale) and perhaps plot sensitivity and
 specificity to help to choose a cut-off?

Thanks for your suggestions, Michael. It took me some time to figure out
how to do this in R (as trivial as it may be for others). Some comments
about what I've done follow, in case anyone is interested.

The problem is a) abnormality is rare (Prevalence=14%) and b) there is
not much difference in the independent variable between abnormal and
normal. So the logistic regression model predicts that P(abnormal) =
0.4. I got confused with this, as I expected a cut-off point of P=0.5 to
decide between normal/abnormal. But you are right, in that another
cut-off point can be chosen.

For a cut-off of e.g. P(abnormal)=0.15, Sensitivity=65% and
Specificity=52%. They are pretty bad, although for clinical purposes I
would say that Positive/Negative Predictive Values are more interesting.
But then PPV=19% and NPV=90%, which isn't great. As an overall test of
how good the model is for classification I have computed the area under
the ROC, from your suggestion of using Sensitivity and Specificity.

I couldn't find how to do this directly with R, so I implemented it
myself (it's not difficult but I'm new here). I tried with package ROCR,
but apparently it doesn't cover binary outcomes.

The area under the ROC is 0.64, so I would say that even though the
model seems to fit the data, it just doesn't allow acceptable
discrimination, not matter what the cut-off point.


I have also studied the effect of low prevalence. For this, I used
option ran.gen in the boot function (package boot) to define a function
that resamples the data so that it balances abnormal and normal cases.

A logistic regression model is fitted to each replicate, to a parametric
bootstrap, and thus compute the bias of the estimates of the model
coefficients, beta0 and beta1. This shows very small bias for beta1, but
a rather large bias for beta0.

So I would say that prevalence has an effect on beta0, but not beta1.
This is good, because a common measure like the odds ratio depends only
on beta1.

Cheers,

-- 
Ramón Casero Cañas

http://www.robots.ox.ac.uk/~rcasero/wiki
http://www.robots.ox.ac.uk/~rcasero/blog

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] logistic regression model with non-integer weights

2006-04-16 Thread Ramón Casero Cañas
Frank E Harrell Jr wrote:
 
 This makes me think you are trying to go against maximum likelihood to
 optimize an improper criterion.  Forcing a single cutpoint to be chosen
 seems to be at the heart of your problem.  There's nothing wrong with
 using probabilities and letting the utility possessor make the final
 decision.

I agree, and in fact I was thinking along those lines, but I also needed
a way of evaluating how good is the model to discriminate between
abnormal and normal cases, as opposed to e.g. GOF. The only way I know
of is using area under ROC (thus setting cut-off points), which also
followed neatly from Michael Dewey comments. Any alternatives would be
welcome :)

-- 
Ramón Casero Cañas

http://www.robots.ox.ac.uk/~rcasero/wiki
http://www.robots.ox.ac.uk/~rcasero/blog

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] permutation of rows of a matrix

2006-04-15 Thread Ramón Casero Cañas
javargas wrote:
 How can I generate a random permutation between rows of a matrix M (of m rows 
 and n columns)?

maybe

M[ sample(1:m), ]

-- 
Ramón Casero Cañas

http://www.robots.ox.ac.uk/~rcasero/wiki
http://www.robots.ox.ac.uk/~rcasero/blog

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] editor for Ubuntu

2006-04-14 Thread Ramón Casero Cañas
camille wrote:
 
 I am new on Ubuntu. I would like to use R, but I tried Kate and Scite. 
 The first one keeps trying to use KDE applications,while the other does 
 not understand the language. I have searched for another editor for 
 hours, in vain. Which editor should work with Ubuntu?

I use emacs (which is a general purpose editor and much more) and ess
(Emacs statistics mode, supporting R,S and others).

To install ess in ubuntu do

$ sudo apt-get install ess

There's a bit of a learning curve to emacs, but it's a good tool,
especially if you are writing documents in LaTex too.

-- 
Ramón Casero Cañas

http://www.robots.ox.ac.uk/~rcasero/wiki
http://www.robots.ox.ac.uk/~rcasero/blog

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] how to count the columns of a data.frame

2006-04-14 Thread Ramón Casero Cañas
giacomo moro wrote:
 
 I would like to count the columns of a data.frame. I know how to
 count the rows, but not the columns. Can someone tell me how to do
 it?

ncol(data.frame)

-- 
Ramón Casero Cañas

http://www.robots.ox.ac.uk/~rcasero/wiki
http://www.robots.ox.ac.uk/~rcasero/blog

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Bootstrap and Jackknife Bias using Survey Package

2006-04-11 Thread Ramón Casero Cañas
Carlos Creva Singano (M2004078) wrote:
 
 1. How to compute Bootstrap and Jackknife Bias of estimates, like mean?

Have you had a look at packages boot and bootstrap? E.g. you can
compute the bias and s.e. of an estimate theta using bootstrap

library(boot)
a - boot( data, theta, R=1000 )

where boot is a function of the boot package, and theta is a function
that you can define yourself. In your case, it is not necessary and you
can use mean instead of theta.

 2. How to see each replicate estimate of the parameter?

In the example above, they can be found in the boot object a in:

a$t

-- 
Ramón Casero Cañas

http://www.robots.ox.ac.uk/~rcasero/wiki
http://www.robots.ox.ac.uk/~rcasero/blog

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

[R] logistic regression model with non-integer weights

2006-04-09 Thread Ramón Casero Cañas

When fitting a logistic regression model using weights I get the
following warning

 data.model.w - glm(ABN ~ TR, family=binomial(logit), weights=WEIGHT)
Warning message:
non-integer #successes in a binomial glm! in: eval(expr, envir, enclos)

Details follow

***

I have a binary dependent variable of abnormality

ABN = T, F, T, T, F, F, F...

and a continous predictor

TR = 1.962752 1.871123 1.893543 1.685001 2.121500, ...



As the number of abnormal cases (ABN==T) is only 14%, and there is large
overlapping between abnormal and normal cases, the logistic regression
found by glm is always much closer to the normal cases than for the
abnormal cases. In particular, the probability of abnormal is at most 0.4.

Coefficients:
Estimate Std. Error z value Pr(|z|)
(Intercept)   0.7607 0.7196   1.057   0.2905
TR2  -1.4853 0.4328  -3.432   0.0006 ***
---

I would like to compensate for the fact that the a priori probability of
abnormal cases is so low. I have created a weight vector

 WEIGHT - ABN
 WEIGHT[ ABN == TRUE ] -  1 / na / 2
 WEIGHT[ ABN == FALSE ] -  1 / nn / 2

so that all weights add up to 1, where ``na'' is the number of abnormal
cases, and ``nn'' is the number of normal cases. That is, normal cases
have less weight in the model fitting because there are so many.

But then I get the warning message at the beginning of this email, and I
suspect that I'm doing something wrong. Must weights be integers, or at
least greater than one?

Regards,

-- 
Ramón Casero Cañas

http://www.robots.ox.ac.uk/~rcasero/wiki
http://www.robots.ox.ac.uk/~rcasero/blog

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] More Logistic Regression Tools?

2006-04-06 Thread Ramón Casero Cañas
Frank E Harrell Jr wrote:
 Eric Rescorla [EMAIL PROTECTED] wrote:
 
 (2) I'd like to compute goodness-of-fit statistics for my fit
 (Hosmer-Lemeshow, Pearson, etc.). I didn't see a package that
 did this. Have I missed one?
 
 Hosmer-Lemeshow has low power and relies on arbitrary binning of
 predicted probabilities. The Hosmer-Le Cessie omnibus test is unique
 and has more power usually. To get it:
 
 f - update(f, x=T, y=T)
 resid(f, 'gof') # uses residuals.lrm 


The documentation of the Design package, section residuals.lrm, says

CITE
Details
For the goodness-of-fit test, the le Cessie-van Houwelingen normal
test statistic for the unweighted
sum of squared errors (Brier score times n) is used.
/CITE

It is not clear to me whether the test implemented is for the statistic
with a constant kernel described in

S. le Cessie and J.C. van Houwelingen. A goodness-of-fit test for binary
regression models, based on smoothing methods. Biometrics, 47:1267­1282,
Dec 1991.

or the variation in

D.W. Hosmer, T. Hosmer, S. Le Cessie, and S. Lemeshow. A comparison of
goodness-of-fit tests for the logistic regression model. Statistics in
Medicine, 16:965­980, 1997.

Sorry, I couldn't get this from looking at the code either (I'm quite
new to R). Is the statistic tested the T defined by le Cessie and van
Houwelingen?

Regards,

-- 
Ramón Casero Cañas

http://www.robots.ox.ac.uk/~rcasero/wiki
http://www.robots.ox.ac.uk/~rcasero/blog

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

[R] rate instead of scale in ?ks.test

2005-10-02 Thread Ramón Casero Cañas

I am not sure whether I'm doing something wrong or there is a bug in the
documentation of ks.test. Following the posting guide, as I'm not sure,
I haven't found this in the bug tracker, and the R FAQ says that stats
is an Add-on package of R, I think this is the place to send it.


?ks.test provides the example

QUOTE
# Does x come from a shifted gamma distribution with shape 3 and scale 2?
ks.test(x+2, pgamma, 3, 2) # two-sided
/QUOTE


Maybe it should say ``with shape 3 and rate 2''?


If I do

 x - rgamma(1e6, shape = 3, scale = 2)
 ks.test( x, 'pgamma', 3, 2 )

One-sample Kolmogorov-Smirnov test

data:  x
D = 0.7513, p-value  2.2e-16
alternative hypothesis: two.sided


whereas with


 y - rgamma(1e6, shape = 3, rate = 2)
 ks.test( y, 'pgamma', 3, 2 )

One-sample Kolmogorov-Smirnov test

data:  y
D = 5e-04, p-value = 0.9469
alternative hypothesis: two.sided


If forcing the parameter meaning:

 ks.test( x, 'pgamma', shape = 3, scale = 2 )

One-sample Kolmogorov-Smirnov test

data:  x
D = 9e-04, p-value = 0.3645
alternative hypothesis: two.sided



 version
platform i386-pc-linux-gnu
arch i386
os   linux-gnu
system   i386, linux-gnu
status
major2
minor1.1
year 2005
month06
day  20
language R

 packageDescription( stats )
Package: stats
Version: 2.1.1
Priority: base
Title: The R Stats Package
Author: R Development Core Team and contributors worldwide
Maintainer: R Core Team [EMAIL PROTECTED]
Description: R statistical functions
License: GPL Version 2 or later.
Built: R 2.1.1; i386-pc-linux-gnu; 2005-06-29 23:30:55; unix

-- File: /usr/lib/R/library/stats/DESCRIPTION

Cheers,

-- 
Ramón Casero Cañas

web:http://www.robots.ox.ac.uk/~rcasero/

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] update MASS

2005-10-01 Thread Ramón Casero Cañas
Prof Brian Ripley wrote:
 On Sat, 1 Oct 2005, Ramón Casero Cañas wrote:
 
 I'd like to update MASS from version 7.2-11 to version 7-2.19. I am
 running R 2.0.1 on ubuntu, that installs MASS with the package r-cran-vr.
 
 
 First, MASS is part of VR, so it is VR you update.

Thanks Prof Ripley, and Kjetil too, for your help. I am sure this was
obvious to anyone but me, but I found the wording in

http://www.stats.ox.ac.uk/pub/MASS4/Software.html#RUnix

a bit confusing. May I suggest that where it says ``To get the very
latest version, use update.packages() from an R session'', it could be
added: ``Note also that the current version of your R base system may
limit the update, i.e. if the R base system you are running is too old,
you will not be able to get the latest version of MASS. You can check
the dependencies of MASS from the a
href=http://cran.r-project.org/src/contrib/Descriptions/VR.html;Description
file of VR at CRAN/a''.

And maybe right before ``As the VR bundle is now recommended'' it could
say ``MASS does not have its own package, but comes bundled with other
libraries in a package called VR. Thus if you want to use/update MASS
you need to install/update VR''.

 Your R is seriously only of date.  You might want to wait a week and
 then install 2.2.0.

Searching a bit more I have found that this problem has been brought up
in the ubuntu forums

http://ubuntuforums.org/archive/index.php/t-49178.html

i.e. that the GNU R version is too old. As a temporal fix, it is
suggested to install the sarge .deb packages from
http://cran.r-project.org/ and then Linux/debian/stable, overlooking the
dependencies error with libc6.

I'll update to 2.2.0 in a week, but for the moment I did the former and
it seems to work with the code I was writting.

Cheers,

-- 
Ramón Casero Cañas

web:http://www.robots.ox.ac.uk/~rcasero/

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] update MASS

2005-09-30 Thread Ramón Casero Cañas

I'd like to update MASS from version 7.2-11 to version 7-2.19. I am
running R 2.0.1 on ubuntu, that installs MASS with the package r-cran-vr.

I have tried doing update.packages(), and I get a list of packages that
I could update, but none is called MASS.

I have tried the following too, right after launching emacs, but to no
avail.

QUOTE
 update.packages(MASS)
trying URL `http://cran.r-project.org/src/contrib/PACKAGES'
Content type `text/plain; charset=iso-8859-1' length 65650 bytes
opened URL
==
downloaded 64Kb

Error in old.packages(lib.loc = lib.loc, contriburl = contriburl, method
= method,  :
no installed.packages for (invalid?) lib.loc=MASS
/QUOTE


I have the package installed in /usr/lib/R/library/MASS. I can do

QUOTE
 library(MASS)
 search()
 [1] .GlobalEnvpackage:MASS  package:methods
 [4] package:stats package:graphics  package:grDevices
 [7] package:utils package:datasets  Autoloads
[10] package:base
/QUOTE


How could I update the package?

Thanks,

-- 
Ramón Casero Cañas

web:http://www.robots.ox.ac.uk/~rcasero/

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] logistic regression with nominal predictors

2005-09-13 Thread Ramón Casero Cañas

(Sorry for obvious mistakes, as I am quite a newby with no Statistics
background).

My question is going to be what is the gain of logistic regression over
odds ratios when none of the input variables is continuous.


My experiment:
 Outcome: ordinal scale, ``quality'' (QUA=1,2,3)
 Predictors: ``segment'' (SEG) and ``stress'' (STR). SEG is
 nominal scale with 24 levels, and STR is dychotomous (0,1).



Considering the outcome continuous, two-way ANOVA with

aov(as.integer(QUA) ~ SEG * STR)

doesn't find evidence of interaction between SEG and STR, and they are
significant on their own. This is the result that we would expect from
clinical knowledge.



I use

xtabs(~QUA+SEG, data=data2.df, subset=STR==0)
xtabs(~QUA+SEG, data=data2.df, subset=STR==0)

for the contingency tables. There are zero cells, and for some values of
SEG, there is only one none-zero cell, i.e. some values of SEG determine
the output with certainty.

So initially I was thinking of a proportional odds logistic regression
model, but following Hosmer and Lemeshow [1], zero cells are
problematic. So I take out of the data table the deterministic values of
SEG, and I pool QUA=2 and QUA=3, and now I have a dychotomous outcome
(QUA = Good/Bad) and no zero cells.

The following model doesn't find evidence of interaction

glm(QUA ~ STR * SEG, data=data3.df, family=binomial)

so I go for

glm(QUA ~ STR + SEG, data=data3.df, family=binomial)


(I suppose that what glm does is to create design variables for SEG,
where 0 0 ... 0 is for the first value of SEG, 1 0 ... 0 for the second
value, 0 1 0 ... 0 for the third, etc).

Coefficients:
  Estimate Std. Error   z value Pr(|z|)
(Intercept) -1.085e+00  1.933e-01-5.614 1.98e-08 ***
STR.L2.112e-01  6.373e-02 3.314 0.000921 ***
SEGP2C.MI   -9.869e-01  3.286e-01-3.004 0.002669 **
SEGP2C.AI   -1.306e+00  3.585e-01-3.644 0.000269 ***
SEGP2C.AA   -1.743e+00  4.123e-01-4.227 2.37e-05 ***
[shortened]
SEGP4C.ML   -5.657e-01  2.990e-01-1.892 0.058485 .
SEGP4C.BL   -2.908e-16  2.734e-01 -1.06e-15 1.00
SEGSAX.MS1.092e-01  2.700e-01 0.405 0.685772
SEGSAX.MAS  -5.441e-16  2.734e-01 -1.99e-15 1.00
SEGSAX.MA7.130e-01  2.582e-01 2.761 0.005758 **
SEGSAX.ML1.199e+00  2.565e-01 4.674 2.96e-06 ***
SEGSAX.MP1.313e+00  2.570e-01 5.108 3.26e-07 ***
SEGSAX.MI8.865e-01  2.569e-01 3.451 0.000558 ***
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 3462.0  on 3123  degrees of freedom
Residual deviance: 3012.6  on 3101  degrees of freedom
AIC: 3058.6

Number of Fisher Scoring iterations: 6


Even though some coefficients have no evidence of statistical
significance, the model requires them from a clinical point of view.

At this point, the question would be how to interpret these results, and
what advantage they offer over odds ratios. From [1] I can understand
that in the case of a dychotomous and a continuous predictor, you can
adjust for the continuous variable.

But when all predictors are dychotomous (due to the design variables), I
don't quite see the effect of adjustment. Wouldn't it be better just to
split the data in two groups (STR=0 and STR=1), and instead of using
logistic regression, use odds ratios for each value of SEG?

Cheers,

Ramón.

[1] D.W. Hosmer and S. Lemeshow. ``Applied Logistic Regression''.
John-Wiley. 2000.

-- 
Ramón Casero Cañas

web:http://www.robots.ox.ac.uk/~rcasero/

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html