Re: [R] FW: variable format

2007-09-07 Thread Frank E Harrell Jr
Martin Becker wrote:
> Dear Cory,
> 
> I am not familiar with SAS, but is this what you are looking for?
> 
> divisionTable <- matrix(c(1, "New England",
>   2, "Middle Atlantic",
>   3, "East North Central",
>   4, "West North Central",
>   5, "South Atlantic",
>   6, "East South Central",
>   7, "West South Central",
>   8, "Mountain",
>   9, "Pacific"),
> ncol=2, byrow=T)

How about just divisionTable <- c('New England', 'Middle Atlantic', ...) 
then factor(old, 1:9, divisionTable) ?

Frank

> a <- NULL
> a$divisionOld <- c(0,1,2,3,4,5)
> a$divisionNew <- 
> as.character(factor(a$divisionOld,levels=divisionTable[,1],labels=divisionTable[,2]))
> a$divisionNew
> 
> [1] NA   "New England""Middle Atlantic"  
> [4] "East North Central" "West North Central" "South Atlantic" 
> 
> 
> Kind regards,
> 
>   Martin
> 
> 
> Cory Nissen schrieb:
>>   
>>
>> Anybody?  
>>
>>
>> 
>>
>> From: Cory Nissen
>> Sent: Tue 9/4/2007 9:30 AM
>> To: r-help@stat.math.ethz.ch
>> Subject: variable format
>>
>>
>> Okay, I want to do something similar to SAS proc format.
>>
>> I usually do this...
>>
>> a <- NULL
>> a$divisionOld <- c(1,2,3,4,5)
>> divisionTable <- matrix(c(1, "New England",
>>   2, "Middle Atlantic",
>>   3, "East North Central",
>>   4, "West North Central",
>>   5, "South Atlantic"),
>> ncol=2, byrow=T)
>> a$divisionNew[match(a$divisionOld, divisionTable[,1])] <- divisionTable[,2]
>>
>> But how do I handle the case where...
>> a$divisionOld <- c(0,1,2,3,4,5)   #no format available for 0, this throws an 
>> error.
>> OR
>> divisionTable <- matrix(c(1, "New England",
>>   2, "Middle Atlantic",
>>   3, "East North Central",
>>   4, "West North Central",
>>   5, "South Atlantic",
>>   6, "East South Central",
>>   7, "West South Central",
>>   8, "Mountain",
>>   9, "Pacific"),
>> ncol=2, byrow=T)   
>> There are extra formats available... this throws a warning.
>>
>> Thanks
>>
>> Cory
>>
>>  [[alternative HTML version deleted]]
>>
>> __
>> R-help@stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Excel

2007-08-29 Thread Frank E Harrell Jr
Rolf Turner wrote:
> On 30/08/2007, at 8:49 AM, Greg Snow wrote:
> 
>> Erich Neuwirth said:
>>
>>> -Original Message-
>>> From: [EMAIL PROTECTED]
>>> [mailto:[EMAIL PROTECTED] On Behalf Of Erich Neuwirth
>>> Sent: Wednesday, August 29, 2007 12:43 PM
>>> To: r-help
>>> Subject: Re: [R] Excel
>>>
>>> Excel bashing can be fun but also can be dangerous because
>>> you are makeing your life harder than necessary.
>> My experience differs, so far using excel (other than as a table  
>> layout
>> program) has made my life harder more times than it has made it  
>> easier.
> 
>   
> 
>   Bravo!!!  Very well and very cogently expressed!
> 
>   cheers,
> 
>           Rolf Turner

Yes!  In addition I'd like to stress that the Excel model represents 
non-reproducible research.

Frank Harrell


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Stratified weighted statistics

2007-08-29 Thread Frank E Harrell Jr
Vikas Rawal wrote:
> I need to compute weighted descriptive statistics (mean, median,
> variance) for a vector of data disaggregated by a set of index
> variables. The weights to be used are vector in the same data frame
> and have to be disaggregated by the same index variables.
> 
> Package Hmisc has functions for computing weighted statistics. But
> both by/aggregate and summarise (in Hmisc) allow a function to have
> only a simgle argument. In my case, I need to specify by the variable
> and the weights as argument to the function.
> 
> Could anyone please advice how to do it.

Please look again at the help files for summarize and wtd.mean which 
show how to do this.

Frank

> 
> As an example, say I have a data.frame giving incomes of persons in
> different towns along with weights to be given for each sampled
> person. How will I compute weighted statistics for each town?
> 
> Vikas Rawal
> JNU, New Delhi.
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] validate (package Design): error message "subscript out of bounds"

2007-08-28 Thread Frank E Harrell Jr
Wentzel-Larsen, Tore wrote:
> Thanks,
> I use Design version 2.1-1 (and as provided, R 2.5.1 on Windows XP).

Sorry I missed the latter.

> The redundancies in object names were for my own convience, as part of 
> a larger command file, and the validation of this univariate model was
> only included to ease comparison with the main multivariate model. I have 

A minor point: adding more variables to the right hand side makes the 
model a multivariable model.  It is still univariate as it has only one 
dependent variable.

> tried to access Design version 2.0_12, but only managed to access version
> 2.1_1 of Design in my Windows implementation of R2.5.1 (the choice of 
> operative system is made by my institution and I am only entitled to use 
> Windows).

My mistake again.  I forgot that the Debian repositories for CRAN are 
behind.  I updated to latest Design in CRAN and found the bug.  I will 
send a separate note with a new version of the function in question for 
you to source( ) to override the current function.

Frank

> Best, Tore
> 
> -Opprinnelig melding-
> Fra: Frank E Harrell Jr [mailto:[EMAIL PROTECTED] 
> Sendt: 28. august 2007 03:17
> Til: Wentzel-Larsen, Tore
> Kopi: r-help@stat.math.ethz.ch
> Emne: Re: [R] validate (package Design): error message "subscript out of 
> bounds"
> 
> Wentzel-Larsen, Tore wrote:
>> Dear R users 
>>
>> I use Windows XP, R2.5.1 (I have read the posting guide, I have 
>> contacted the package maintainer first, it is not homework).
>>
>> In a research project on renal cell carcinoma we want to compute 
>> Harrell's c index, with optimism correction, for a multivariate 
>> Cox regression and also for some univariate Cox models.
>> For some of these univariate models I have encountered an error
>> message (and no result produced) from the function validate i 
>> Frank Harrell's Design package:
>>
>> Error in Xb(x[, xcol, drop = FALSE], coef, non.slopes, non.slopes.in.x,  : 
>> subscript out of bounds
>>
>> The following is an artificial example wherein I have been able to 
>> reproduce this error message (actual data has been changed to preserve
>> confidentiality):
> 
> I could not reproduce the error on R 2.5.1 on linux using version 2.0-12 
> of Design (you did not provide this information).
> 
> Your code involved a good deal of extra typing.  Here is a streamlined 
> version:
> 
> bc <- data.frame(time1 = c(9,24,28,43,58,62,66,107,116,118,123,
>   127,129,131,137,138,139,140,148,169,176,179,188,196,210,218,
> 
> bc
> 
> library(Design)
> 
> dd <- with(bc, datadist(bc1, age, adjto.cat='first'))
> options(datadist = 'dd')
> 
> f <- cph(Surv(time1,status1) ~ bc1,
>   data = bc, x=TRUE, y=TRUE, surv=TRUE)
> anova(f)
> f
> summary(f)
> 
> val <- validate(f, B=200, dxy=TRUE)
> 
> I don't get much value of putting the type of an object as part of the 
> object's name, as information within objects defines the object type/class.
> 
> There is little reason to validate a one degree of freedom model.
> 
> Frank
> 
>> library(Design)
>>
>> # an example data frame:
>> frame.bc <- data.frame(time1 = c(9,24,28,43,58,62,66,107,116,118,123,
>>  127,129,131,137,138,139,140,148,169,176,179,188,196,210,218,
>>  1,1,1,2,2,3,4,8,23,32,33,34,43,44,48,51,52,54,59,59,60,60,62,
>>  65,65,68,70,72,73,74,81,84,88,98,99,106,107,115,115,117,119,
>>  120,122,122,122,122,126,128,130,135,136,136,138,149,151,154,
>>  157,159,161,164,164,164,166,172,172,176,179,180,183,183,184,
>>  187,190,197,201,201,203,203,203,209,210,214,219,227,233,4,18,
>>  49,113,147,1,1,2,2,2,2,2,3,4,6,6,6,6,6,6,6,6,9,9,9,9,9,10,10,
>>  10,11,12,12,12,13,14,14,17,18,18,19,19,20,20,21,21,21,21,22,23,
>>  23,24,28,28,29,29,32,34,35,38,38,48,48,52,52,54,54,56,64,67,67,
>>  69,70,70,72,84,88,90,114,115,140,142,154,171,195),
>>  status1 = c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
>>  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
>>  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
>>  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
>>  1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
>>  1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
>>  1,1,1,1,1),
>>  bc1 = factor(c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
>>  2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
>>  2,2,2,2,2,2,2,2,2,

Re: [R] validate (package Design): error message "subscript out of bounds"

2007-08-27 Thread Frank E Harrell Jr
> Dxy   -0.124360 -0.1423409
> R2 1.00  1.000
> Slope  1.00  0.7919584
> D  0.016791  0.0147536
> U -0.002395  0.0006448
> Q  0.019186  0.0141088
>training   test
> Dxy   -0.191875 -0.1423409
> R2 1.00  1.000
> Slope  1.00  0.8936724
> D  0.022397  0.0147536
> U -0.002339  0.0001367
> Q  0.024736  0.0146169
>training   test
> Dxy   -0.199514 -0.1423409
> R2 1.00  1.000
> Slope  1.00  0.8075246
> D  0.025717  0.0147536
> U -0.002447  0.0005348
> Q  0.028163  0.0142188
> Error in Xb(x[, xcol, drop = FALSE], coef, non.slopes, non.slopes.in.x,  : 
> subscript out of bounds
> 
> 
> Any help/suggestions will be highly appreciated.
> 
> 
> Sincerely,
> Tore Wentzel-Larsen
> statistician
> Centre for Clinical research
> Armauer Hansen house 
> Haukeland University Hospital
> N-5021 Bergen
> tlf   +47 55 97 55 39 (a)
> faks  +47 55 97 60 88 (a)
> email [EMAIL PROTECTED]
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] grouping scat1d/rug and plotting to 2 axes

2007-08-27 Thread Frank E Harrell Jr
Mike wrote:
> Hi,
> 
> I'm wondering if anybody can offer a bit of  guidance on how to add a 
> couple of features to a plot. 
> 
> I'm using Frank Harrell's Design library to model some survival data in 
> R (2.3.1, windows platform).  I'm fairly comfortable with the survival 
> modeling in Design, but am still at a frustratingly low level of 
> competence when it comes to creating anything beyond simple plots in R.
> 
> A simplified version of the model is:
> 
> fit <- cph(Surv(survtime,deceased) ~ rcs(smw,4), 
> data=survdata,x=T,y=T,surv=T )
> 
> And the basic plot is:
> 
> plot(fit,smw=NA, fun=function(x) 1/(1+exp(-x)))

or plot(fit, smw=NA, fun=plogis).  But what does the logistic model have 
to do with the Cox model you fitted?  You can instead do plot(fit, 
smw=NA, time=1) to plot estimated 1-year survival prob.

> 
> I know that if I add
> 
> scat1d(smw)
> 
> I get a nice jittered rug plot of all values of the predictor smw on the 
> top axis.
> 
> What I'd like to do, however, is to plot on bottom axis the values of 
> smw for only those participants who are alive, and then on the top axis, 
> plot the values of smw for those who are deceased.  I'd appreciate any 
> tips as to how I might approach this.

That isn't so well defined because of variable follow-up time.  I would 
not get very much out of such a plot.

Frank

> 
> Thanks,
> 
> Mike Babyak
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to include bar values in a barplot?

2007-08-27 Thread Frank E Harrell Jr
Donatas G. wrote:
> On Tuesday 07 August 2007 22:09:52 Donatas G. wrote:
>> How do I include bar values in a barplot (or other R graphics, where this
>> could be applicable)?
>>
>> To make sure I am clear I am attaching a barplot created with
>> OpenOffice.org which has barplot values written on top of each barplot.
> 
> Here is the barplot mentioned above:
> http://dg.lapas.info/wp-content/barplot-with-values.jpg
> 
> it appeaars that this list does not allow attachments...
> 
That is a TERRIBLE graphic.  Can't we finally leave this subject alone?

Frank Harrell

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Random number expert needed

2007-08-24 Thread Frank E Harrell Jr
Colleagues:  An agency contacted me that has need of a true expert in 
pseudo random number generation who can perform an audit of a system 
currently in use.  If you live within say 400 miles of Nashville TN and 
have in-depth expertise, please contact me and I will put you in touch 
with the agency.

Thanks
Frank
-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regulatory Compliance and Validation Issues

2007-08-20 Thread Frank E Harrell Jr
Cody Hamilton wrote:
> Dear Thomas,
> 
> Thank you for your reply.  You are of course quite right - the R Foundation 
> couldn't be responsible for any individually contributed package.
> 
> I am curious as to how an orgainization operating in a regulated environment 
> could safely use a contributed package.  What if the author/maintainer 
> retires or loses interest in maintaining the package?  The organization would 
> then find itself in the awkward position of being reliant on software for 
> which there is no technical support and which may not be compatible with 
> future versions of the base R software.  I suppose the organization could 
> take responsibility for maintaining the individual functions within a package 
> on its own (one option made possible by the open source nature of R), but 
> this would require outstanding programming resources which the company may 
> not have (Thomas Lumleys are sadly rare).  In addition, as the organization 
> is claiming the functions as their own (and not as out-of-the-box software), 
> the level of required validation would be truly extraordinary.  I also wonder 
> if an everyone-maintain-their-own-copy approach could lead to multiple 
> mutated vers
i!
>  ons of a package's functions across the R universe (e.g. Edwards' version of 
> sas.get() vs. Company X's version of sas.get(), etc.).
> 
> Regards,
>-Cody

Cody,

I think of this issue as not unlike an organization using its own code 
written by its own analysts or SAS programmers.  Code is reused all the 
time.

Frank

> 
> As always, I am speaking for myself and not necessarily for Edwards 
> Lifesciences.
> 
> -Original Message-
> From: Thomas Lumley [mailto:[EMAIL PROTECTED]
> Sent: Sunday, August 19, 2007 8:50 AM
> To: Cody Hamilton
> Cc: r-help@stat.math.ethz.ch
> Subject: Re: [R] Regulatory Compliance and Validation Issues
> 
> On Fri, 17 Aug 2007, Cody Hamilton wrote:
> 
> 
>> I have a few specific comments/questions that I would like to present to
>> the R help list.
> 
>> 2. While the document's scope is limited to base R plus recommended
>> packages, I believe most companies will need access to functionalities
>> provided by packages not included in the base or recommended packages.
>> (For example, I don't think I could survive without the sas.get()
>> function from the Design library.)  How can a company address the issues
>> covered in the document for packages outside its scope?  For example,
>> what if a package's author does not maintain historical archive versions
>> of the package?  What if the author no longer maintains the package?
>> Is the solution to add more packages to the recommended list (I'm fairly
>> certain that this would not be a simple process) or is there another
>> solution?
> 
> This will have to be taken up with the package maintainer.  The R
> Foundation doesn't have any definitive knowledge about, eg, Frank
> Harrell's development practices and I don't think the FDA would regard our
> opinions as relevant.
> 
> Archiving, at least, is addressed by CRAN: all the previously released
> versions of packages are available
> 
>> 3. At least at my company, each new version must undergo basically the
>> same IQ/OQ/PQ as the first installation.  As new versions of R seem to
>> come at least once a year, the ongoing validation effort would be
>> painful if the most up-to-date version of R is to be maintained within
>> the company.  Is there any danger it delaying the updates (say updating
>> R within the company every two years or so)?
> 
> It's worse than that: there are typically 4 releases of R per year (the
> document you are commenting on actually gives dates).  The ongoing
> validation effort may indeed be painful, and this was mentioned as an
> issue in the talk by David James & Tony Rossini.
> 
> The question of what is missed by delaying updates can be answered by
> looking at the NEWS file. The question of whether it is dangerous is
> really an internal risk management issue for you.
> 
> -thomas
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Ask for functions to obtain partial R-square (squared partial correlation coefficients)

2007-08-20 Thread Frank E Harrell Jr
Ye Xingwang wrote:
> The partial R-square (or coefficient of partial determination, or
> squared partial correlation coefficients) measures the marginal
> contribution of one explanatory variable when all others are already
> included in multiple linear regression model.
> 
> The following link has very clear explanations on partial and
> semi-partial correlation:
> http://www.psy.jhu.edu/~ashelton/courses/stats315/week2.pdf
> 
> In SAS, the options is PCORR2 and SCORR2.
> For example(from 
> http://www.ats.ucla.edu/stat/sas/examples/alsm/alsmsasch7.htm)
> 
> data ch7tab01;
>   input X1 X2 X3 Y;
>   label x1 = 'Triceps'
> x2 = 'Thigh cir.'
> x3 = 'Midarm cir.'
>  y = 'body fat';
>   cards;
>   19.5  43.1  29.1  11.9
>   24.7  49.8  28.2  22.8
>   30.7  51.9  37.0  18.7
>   29.8  54.3  31.1  20.1
>   19.1  42.2  30.9  12.9
>   25.6  53.9  23.7  21.7
>   31.4  58.5  27.6  27.1
>   27.9  52.1  30.6  25.4
>   22.1  49.9  23.2  21.3
>   25.5  53.5  24.8  19.3
>   31.1  56.6  30.0  25.4
>   30.4  56.7  28.3  27.2
>   18.7  46.5  23.0  11.7
>   19.7  44.2  28.6  17.8
>   14.6  42.7  21.3  12.8
>   29.5  54.4  30.1  23.9
>   27.7  55.3  25.7  22.6
>   30.2  58.6  24.6  25.4
>   22.7  48.2  27.1  14.8
>   25.2  51.0  27.5  21.1
> ;
> run;
> 
> proc reg data = ch7tab01;
>   model y = x1 x2 / pcorr2 SCORR2;
>   model y = x1-x3 / pcorr2 SCORR2;
> run;
> quit;
> 
> There has been a post in
> http://tolstoy.newcastle.edu.au/R/help/05/03/0437.html
> 
> It will be great appreciated if someone could write a general function
> to work with class lm or glm to obtain the
> pcorr2 (squared partial correlation coefficients using Type II sums of 
> squares)
> and scorr2 (squared semi-partial correlation coefficients using Type
> II sums of squares)
> for all independent variables (>3 variables) simultaneously?
> 
> Thank you.
> Xingwang Ye
> 

library(Design)  # requires Hmisc
f <- ols(y ~ x1 + x2)
p <- plot(anova(f), what='partial R2')
p

The anova.Design function called above handles pooling related degrees 
of freedom and pools main effects with related interaction effects to 
get the total partial effect.

Frank


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Subject: Re: how to include bar values in a barplot?

2007-08-09 Thread Frank E Harrell Jr
Gabor Grothendieck wrote:
> You could put the numbers inside the bars in which
> case it would not add to the height of the bar:

I think the Cleveland/Tufte prescription would be much different: 
horizontal dot charts with the numbers in the right margin.  I do this 
frequently with great effect.  The Hmisc dotchart2 function makes this easy.

Frank

> 
> x <- 1:5
> names(x) <- letters[1:5]
> bp <- barplot(x)
> text(bp, x - .02 * diff(par("usr")[3:4]), x)
> 
> On 8/9/07, Frank E Harrell Jr <[EMAIL PROTECTED]> wrote:
>> [EMAIL PROTECTED] wrote:
>>> Greg, I'm going to join issue with your here! Not that I'll go near
>>> advocating "Excel-style" graphics (abominable, and the Patrick Burns
>>> URL which you cite is remarkable in its restraint). Also, I'm aware
>>> that this is potential flame-war territory --  again, I want to avoid
>>> that too.
>>>
>>> However, this is the second time you have intervened on this theme
>>> (previously Mon 6 August), along with John Kane on Wed 1 August and
>>> again today on similar lines, and I think it's time an alternative
>>> point of view was presented, to counteract (I hope usefully) what
>>> seems to be a draconianly prescriptive approach to the presentation
>>> of information.
>>
>> ---snip---
>>
>> Ted,
>>
>> You make many excellent points and provide much food for thought.  I
>> still think that Greg's points are valid too, and in this particular
>> case, bar plots are a bad choice and adding numbers at variable heights
>> causes a perception error as I wrote previously.
>>
>> Thanks for your elaboration on this important subject.
>>
>> Frank
>>
>>> On 07-Aug-07 21:37:50, Greg Snow wrote:
>>>> Generally adding the numbers to a graph accomplishes 2 things:
>>>>
>>>> 1) it acts as an admission that your graph is a failure
>>> Generally, I disagree. Different elements in a display serve different
>>> purposes, according to the psychological aspects of visual preception.
>> . . .


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Subject: Re: how to include bar values in a barplot?

2007-08-09 Thread Frank E Harrell Jr
[EMAIL PROTECTED] wrote:
> Greg, I'm going to join issue with your here! Not that I'll go near
> advocating "Excel-style" graphics (abominable, and the Patrick Burns
> URL which you cite is remarkable in its restraint). Also, I'm aware
> that this is potential flame-war territory --  again, I want to avoid
> that too.
> 
> However, this is the second time you have intervened on this theme
> (previously Mon 6 August), along with John Kane on Wed 1 August and
> again today on similar lines, and I think it's time an alternative
> point of view was presented, to counteract (I hope usefully) what
> seems to be a draconianly prescriptive approach to the presentation
> of information.


---snip---

Ted,

You make many excellent points and provide much food for thought.  I 
still think that Greg's points are valid too, and in this particular 
case, bar plots are a bad choice and adding numbers at variable heights 
causes a perception error as I wrote previously.

Thanks for your elaboration on this important subject.

Frank

> 
> On 07-Aug-07 21:37:50, Greg Snow wrote:
>> Generally adding the numbers to a graph accomplishes 2 things:
>>
>> 1) it acts as an admission that your graph is a failure
> 
> Generally, I disagree. Different elements in a display serve different
> purposes, according to the psychological aspects of visual preception.

. . .

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to include bar values in a barplot?

2007-08-07 Thread Frank E Harrell Jr
Donatas G. wrote:
> How do I include bar values in a barplot (or other R graphics, where this 
> could be applicable)? 
> 
> To make sure I am clear I am attaching a barplot created with OpenOffice.org 
> which has barplot values written on top of each barplot. 
> 

This causes an optical illusion in which the viewer adds some of the 
heights of the numbers to the heights of the bars.  And in general dot 
charts are greatly preferred to bar plots, not the least because of 
their reduced ink:information ratio.

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] (Censboot, Z-score, Cox) How to use Z-score as the statistic within censboot?

2007-08-06 Thread Frank E Harrell Jr
David Lloyd wrote:
> Dear R Help list,
> 
> My question is regarding extracting the standard error or Z-score from a
> cph or coxph call. My Cox model is: -
> 
> modz=cph(Surv(TSURV,STATUS)~RAGE+DAGE+REG_WTIME_M+CLD_ISCH+POLY_VS,
> data=kidneyT,method="breslow", x=T, y=T)

You are assuming linearity for all predictors, which is unlikely.  When 
you expand predictors into multiple parameters to allow for 
nonlinearity, or when you have factors with >2 levels, individual 
P-values will be meaningless.  You can get all Wald multiple d.f. test 
stats and P-values by saving the result of anova(modz) using 
anova.Design, which creates a matrix.

Frank

> 
> I've used names(modz) but can't see anything that will let me extract
> the Z scores for each coefficient or the standard errors in the same way
> one can use coef (modz).
> 
> I need to get either the se or Zscore as I am hoping to be able to use
> the Zscore within a statistic function used in censboot. I wish to find
> the proportion of times each of the variables in the cox model are
> significant out of the R=1000 bootstrap resamples.
> 
> My current function for us with the bootstrap only gives me the
> coefficients: - 
> 
> func=function(kidneyT,indices)
> {
> 
> kidneyT=kidneyT[indices,]
> modz=coxph(Surv(TSURV,STATUS)~RAGE+DAGE+REG_WTIME_M+CLD_ISCH+POLY_VS,dat
> a=kidneyT,method="breslow")
> 
> coefficients(modz)
> 
> }
> 
> 
> I would like to be able to modify the following code so modz.cens.boot$t
> below gives me an array of the Z-scores for each variable and each
> resample - not the coefficients
> 
> modz.cens.boot=censboot(kidneyT, func,R=1000,sim="ordinary")
> modz.cens.boot
> 
> 
> I hope this is clear enough? I'm still getting to terms with the
> Bootstrap, having only just started trying to use it a few days ago.
> 
> Any help on this would clear a massive headache for me as I've been
> going around in circles for hours and hours on this.
> 
> Thx 
> 
> DaveL
> 
> 
> 
> 
> 
> Click to find information on your credit score and your credit report.
> <http://tagline.bidsystem.com/fc/Ioyw36XIe1IYw28KqOSu66l6ED7AsGYWlz4EWyK
> N5OTcqKjQM8gfiW/> 
> 
> 
> 
>  style="font-size:13.5px">___Get
>  the Free email that has everyone talking at  href=http://www.mail2world.com target=new>http://www.mail2world.com  
> Unlimited Email Storage – POP3 – Calendar 
> – SMS – Translator – Much More!
>   [[alternative HTML version deleted]]
> 
> ______________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Normality tests

2007-08-04 Thread Frank E Harrell Jr
Alexandre Christie wrote:
> Hello All,
> 
> I am new to R, and I am writing to seek your advice on how best to use it to 
> run
> R's various normality tests in an automated way.
> 
> In a nutshell, my situation is as follows. I work in an investment bank, and 
> my
> team and I are concerned that the assumption we make in our models that the
> returns of assets are normally distributed may not be justified for certain
> asset classes. We are keen to check this statistically.
> 
> To this end, we have an Excel document which contains historical data on the
> returns of the asset classes we want to investigate, and we would like to run
> R's multiple normality tests on these data to check whether any asset classes
> are flagged up as being statistically non-normal.
> 
> I see from the R documentation that there are several R commands to test for
> this, but is it possible to progamme a tool which can (i) convert the Excel 
> data
> into a format which R can read, then (ii) run all the relevant tests from R,
> then (iii) compare the results (such as the p-values) with a user-defined
> benchmark, and (iv) output a file which shows for each asset class, which 
> tests
> reveal that the null hypothesis of normality is rejected?
> 
> My team and I would be very grateful for your advice on this.
> 
> Yours sincerely,
> 
> Alex.

Alex it would be good to work closely with a local statistician.  Models 
do not require raw data to be normally distributed, only residuals or 
conditional distributions.  And there are many models that don't require 
strong distributional assumptions (e.g. Cox regression, proportional 
odds model, transform-both-sides generalized additive models).

Frank

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ROC curve in R

2007-08-04 Thread Frank E Harrell Jr
Dylan Beaudette wrote:
> On Thursday 26 July 2007 10:45, Frank E Harrell Jr wrote:
>> Dylan Beaudette wrote:
>>> On Thursday 26 July 2007 06:01, Frank E Harrell Jr wrote:
>>>> Note that even though the ROC curve as a whole is an interesting
>>>> 'statistic' (its area is a linear translation of the
>>>> Wilcoxon-Mann-Whitney-Somers-Goodman-Kruskal rank correlation
>>>> statistics), each individual point on it is an improper scoring rule,
>>>> i.e., a rule that is optimized by fitting an inappropriate model.  Using
>>>> curves to select cutoffs is a low-precision and arbitrary operation, and
>>>> the cutoffs do not replicate from study to study.  Probably the worst
>>>> problem with drawing an ROC curve is that it tempts analysts to try to
>>>> find cutoffs where none really exist, and it makes analysts ignore the
>>>> whole field of decision theory.
>>>>
>>>> Frank Harrell
>>> Frank,
>>>
>>> This thread has caught may attention for a couple reasons, possibly
>>> related to my novice-level experience.
>>>
>>> 1. in a logistic regression study, where i am predicting the probability
>>> of the response being 1 (for example) - there exists a continuum of
>>> probability values - and a finite number of {1,0} realities when i either
>>> look within the original data set, or with a new 'verification' data set.
>>> I understand that drawing a line through the probabilities returned from
>>> the logistic regression is a loss of information, but there are times
>>> when a 'hard' decision requiring prediction of {1,0} is required. I have
>>> found that the ROCR package (not necessarily the ROC Curve) can be useful
>>> in identifying the probability cutoff where accuracy is maximized. Is
>>> this an unreasonable way of using logistic regression as a predictor?
> 
> Thanks for the detailed response Frank. My follow-up questions are below:
> 
>> Logistic regression (with suitable attention to not assuming linearity
>> and to avoiding overfitting) is a great way to estimate P[Y=1].  Given
>> good predicted P[Y=1] and utilities (losses, costs) for incorrect
>> positive and negative decisions, an optimal decision is one that
>> optimizes expected utility.  The ROC curve does not play a direct role
>> in this regard.  
> 
> Ok.
> 
>> If per-subject utilities are not available, the analyst 
>> may make various assumptions about utilities (including the unreasonable
>> but often used assumption that utilities do not vary over subjects) to
>> find a cutoff on P[Y=1]. 
> 
> Can you elaborate on what exactly a "per-subject utility" is? In my case, I 
> am 
> trying to predict the occurance of specific soil features based on two 
> predictor variables: 1 continuous, the other categorical.  Thus far my 
> evaluation of how well this method works is based on how often I can 
> correctly predict (a categorical) quality.

This could be called a per-unit utility in your case.  It is the 
consequence of decisions at the point in which you decide Y=0 or Y=1. 
If consequences are the same over all units, you just have to deal with 
the single ratio of cost of false positive to cost of false negative.

One way to limit bad consequences is to not make any decision when the 
predicted probability is in the middle, i.e., the decision is 'obtain 
more data'.  That is a real advantage of having a continuous risk estimate.

> 
> 
>> A very nice feature of P[Y=1] is that error 
>> probabilities are self-contained.  For example if P[Y=1] = .02 for a
>> single subject and you predict Y=0, the probability of an error is .02
>> by definition.  One doesn't need to compute an overall error probability
>> over the whole distribution of subjects' risks.  If the cost of a false
>> negative is C, the expected cost is .02*C in this example.
> 
> Interesting. The hang-up that I am having is that I need to predict from 
> {O,1}, as the direct users of this information are not currently interested 
> in in raw probabilities. As far as I know, in order to predict a class from a 
> probability I need use a cutoff... How else can I accomplish this without 
> imposing a cutoff on the entire dataset? One thought, identify a cutoff for 
> each level of the categorical predictor term in the model... (?)

You're right you have to ultimately use a cutoff (or better still, 
educate the users about the meaning of probabilities and let them make 
the decision without exposing the cutoff).  And see the comment 
regarding gray zones above.

> 
>>>

Re: [R] proportional odds model

2007-08-02 Thread Frank E Harrell Jr
Ramon Martínez Coscollà wrote:
> Hi all!!
> 
> I am using a proportinal odds model to study some ordered categorical
> data. I am trying to predict one ordered categorical variable taking
> into account only another categorical variable.
> 
> I am using polr from the R MASS library. It seems to work ok, but I'm
> still getting familiar and I don't know how to assess goodness of fit.
> I have this output, when using response ~ independent variable:
> 
> Residual Deviance: 327.0956
> AIC: 333.0956
>> polr.out$df.residual
> [1] 278
>> polr.out$edf
> [1] 3
> 
> When taking out every variable... (i.e., making formula: response ~ 1), I 
> have:
> 
> Residual Deviance: 368.2387
> AIC: 372.2387
> 
> How can I test if the model fits well? How can I check that the
> independent variable effectively explains the model? Is there any
> test?
> 
> Moreover, sendig summary(polr.out) I get this error:
> 
> 
> Error in optim(start, fmin, gmin, method = "BFGS", hessian = Hess, ...) :
> initial value in 'vmmin' is not finite
> 
> Something to do with the optimitation procedure... but, how can I fix
> it? Any help would be greatly appreciated.
> 
> Thanks.

You might also look at lrm and residuals.lrm in the Design package, 
which provides partial and score residual plots to check PO model 
assumptions.

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ROC curve in R

2007-07-26 Thread Frank E Harrell Jr
Dylan Beaudette wrote:
> On Thursday 26 July 2007 06:01, Frank E Harrell Jr wrote:
>> Note that even though the ROC curve as a whole is an interesting
>> 'statistic' (its area is a linear translation of the
>> Wilcoxon-Mann-Whitney-Somers-Goodman-Kruskal rank correlation
>> statistics), each individual point on it is an improper scoring rule,
>> i.e., a rule that is optimized by fitting an inappropriate model.  Using
>> curves to select cutoffs is a low-precision and arbitrary operation, and
>> the cutoffs do not replicate from study to study.  Probably the worst
>> problem with drawing an ROC curve is that it tempts analysts to try to
>> find cutoffs where none really exist, and it makes analysts ignore the
>> whole field of decision theory.
>>
>> Frank Harrell
> 
> Frank,
> 
> This thread has caught may attention for a couple reasons, possibly related 
> to 
> my novice-level experience. 
> 
> 1. in a logistic regression study, where i am predicting the probability of 
> the response being 1 (for example) - there exists a continuum of probability 
> values - and a finite number of {1,0} realities when i either look within the 
> original data set, or with a new 'verification' data set. I understand that 
> drawing a line through the probabilities returned from the logistic 
> regression is a loss of information, but there are times when a 'hard' 
> decision requiring prediction of {1,0} is required. I have found that the 
> ROCR package (not necessarily the ROC Curve) can be useful in identifying the 
> probability cutoff where accuracy is maximized. Is this an unreasonable way 
> of using logistic regression as a predictor? 

Logistic regression (with suitable attention to not assuming linearity 
and to avoiding overfitting) is a great way to estimate P[Y=1].  Given 
good predicted P[Y=1] and utilities (losses, costs) for incorrect 
positive and negative decisions, an optimal decision is one that 
optimizes expected utility.  The ROC curve does not play a direct role 
in this regard.  If per-subject utilities are not available, the analyst 
may make various assumptions about utilities (including the unreasonable 
but often used assumption that utilities do not vary over subjects) to 
find a cutoff on P[Y=1].  A very nice feature of P[Y=1] is that error 
probabilities are self-contained.  For example if P[Y=1] = .02 for a 
single subject and you predict Y=0, the probability of an error is .02 
by definition.  One doesn't need to compute an overall error probability 
over the whole distribution of subjects' risks.  If the cost of a false 
negative is C, the expected cost is .02*C in this example.

> 
> 2. The ROC curve can be a helpful way of communicating false positives / 
> false 
> negatives to other users who are less familiar with the output and 
> interpretation of logistic regression. 

What is more useful than that is a rigorous calibration curve estimate 
to demonstrate the faithfulness of predicted P[Y=1] and a histogram 
showing the distribution of predicted P[Y=1].  Models that put a lot of 
predictions near 0 or 1 are the most discriminating.  Calibration curves 
and risk distributions are easier to explain than ROC curves.  Too often 
a statistician will solve for a cutoff on P[Y=1], imposing her own 
utility function without querying any subjects.

> 
> 
> 3. I have been using the area under the ROC Curve, kendall's tau, and cohen's 
> kappa to evaluate the accuracy of a logistic regression based prediction, the 
> last two statistics based on a some probability cutoff identified before 
> hand. 

ROC area (equiv. to Wilcoxon-Mann-Whitney and Somers' Dxy rank 
correlation between pred. P[Y=1] and Y) is a measure of pure 
discrimination, not a measure of accuracy per se.  Rank correlation 
(concordance) measures do not require the use of cutoffs.

> 
> 
> How does the topic of decision theory relate to some of the circumstances 
> described above? Is there a better way to do some of these things?

See above re: expected loses/utilities.

Good questions.

Frank
> 
> Cheers,
> 
> Dylan
> 
> 
> 
>> [EMAIL PROTECTED] wrote:
>>> http://search.r-project.org/cgi-bin/namazu.cgi?query=ROC&max=20&result=no
>>> rmal&sort=score&idxname=Rhelp02a&idxname=functions&idxname=docs
>>>
>>> there is a lot of help try help.search("ROC curve") gave
>>> Help files with alias or concept or title matching 'ROC curve' using
>>> fuzzy matching:
>>>
>>>
>>>
>>> granulo(ade4) Granulometric Curves
>>> plot.roc(analogue)Plot ROC curves and

Re: [R] logistic regression

2007-07-26 Thread Frank E Harrell Jr
Mary,

The 10-group approach results in a low-resolution and fairly arbitrary 
calibration curve.  Also, it is the basis of the original 
Hosmer-Lemeshow goodness of fit statistic which has been superceded by 
the Hosmer et al single degree of freedom GOF test that does not require 
any binning.  The Design package handles both.  Do ?calibrate.lrm, 
?residuals.lrm, ?lrm for details.

Frank Harrell


Sullivan, Mary M wrote:
> Greetings,
>  
> 
> I am working on a logistic regression model in R and I am struggling with the 
> code, as it is a relatively new program for me.  In searching Google for 
> 'logistic regression diagnostics' I came Elizabeth Brown's Lecture 14 from 
> her Winter 2004 Biostatistics 515 course  
> (http://courses.washington.edu/b515/l14.pdf) .  I found most of the code to 
> be very helpful, but I am struggling with the lines on to calculate the 
> observed and expected values in the 10 groups created by the cut function.  I 
> get error messages in trying to create the E and O matrices:  R won't accept 
> assignment of "fi1c==j" and it won't calculate the sum.  
> 
>  
> 
> I am wondering whether someone might be able to offer me some assistance...my 
> search of the archives was not fruitful.
> 
>  
> 
> Here is the code that I adapted from the lecture notes:
> 
>  
> 
> fit <- fitted(glm.lyme)
> 
> fitc <- cut(fit, br = c(0, quantile(fit, p = seq(.1, .9, .1)),1))  
> 
> t<-table(fitc)
> 
> fitc <- cut(fit, br = c(0, quantile(fit, p = seq(.1, .9, .1)), 1), labels = F)
> 
> t<-table(fitc)
> 
>  
> 
> #Calculate observed and expected values in ea group
> 
> E <- matrix(0, nrow=10, ncol = 2)
> 
> O <- matrix(0, nrow=10, ncol=2)
> 
> for (j in 1:10) {
> 
>   E[j, 2] = sum(fit[fitc==j])
> 
>   E[j, 1] = sum((1- fit)[fitc==j])
> 
>   O[j, 2] = sum(pcdata$lymdis[fitc==j])
> 
>   O[j, 1] = sum((1-pcdata$lymdis)[fitc==j])
> 
> 
> 
> }
> 
>  
> 
> Here is the error message:  Error in Summary.factor(..., na.rm = na.rm) : 
> sum not meaningful for factors
> 
>  
> 
>  
> 
> I understand what it means; I just can't figure out how to get around it or 
> how to get the output printed in table form.  Thank you in advance for any 
> assistance.
> 
>  
> 
> Mary Sullivan
> 
> __________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ROC curve in R

2007-07-26 Thread Frank E Harrell Jr
Note that even though the ROC curve as a whole is an interesting 
'statistic' (its area is a linear translation of the 
Wilcoxon-Mann-Whitney-Somers-Goodman-Kruskal rank correlation 
statistics), each individual point on it is an improper scoring rule, 
i.e., a rule that is optimized by fitting an inappropriate model.  Using 
curves to select cutoffs is a low-precision and arbitrary operation, and 
the cutoffs do not replicate from study to study.  Probably the worst 
problem with drawing an ROC curve is that it tempts analysts to try to 
find cutoffs where none really exist, and it makes analysts ignore the 
whole field of decision theory.

Frank Harrell


[EMAIL PROTECTED] wrote:
> http://search.r-project.org/cgi-bin/namazu.cgi?query=ROC&max=20&result=normal&sort=score&idxname=Rhelp02a&idxname=functions&idxname=docs
> 
> there is a lot of help try help.search("ROC curve") gave
> Help files with alias or concept or title matching 'ROC curve' using fuzzy 
> matching:
> 
> 
> 
> granulo(ade4) Granulometric Curves
> plot.roc(analogue)Plot ROC curves and associated 
> diagnostics
> roc(analogue) ROC curve analysis
> colAUC(caTools)   Column-wise Area Under ROC Curve 
> (AUC)
> DProc(DPpackage)  Semiparametric Bayesian ROC 
> curve analysis
> cv.enet(elasticnet)   Computes K-fold cross-validated 
> error curve for elastic net
> ROC(Epi)  Function to compute and draw 
> ROC-curves.
> lroc(epicalc) ROC curve
> cv.lars(lars) Computes K-fold cross-validated 
> error curve for lars
> roc.demo(TeachingDemos)   Demonstrate ROC curves by 
> interactively building one
> 
> HTH
> see the help and examples those will suffice
> 
> Type 'help(FOO, package = PKG)' to inspect entry 'FOO(PKG) TITLE'.
> 
> 
> 
> Regards,
> 
> Gaurav Yadav
> +++
> Assistant Manager, CCIL, Mumbai (India)
> Mob: +919821286118 Email: [EMAIL PROTECTED]
> Bhagavad Gita:  Man is made by his Belief, as He believes, so He is
> 
> 
> 
> "Rithesh M. Mohan" <[EMAIL PROTECTED]> 
> Sent by: [EMAIL PROTECTED]
> 07/26/2007 11:26 AM
> 
> To
> 
> cc
> 
> Subject
> [R] ROC curve in R
> 
> 
> 
> 
> 
> 
> Hi,
> 
>  
> 
> I need to build ROC curve in R, can you please provide data steps / code
> or guide me through it.
> 
>  
> 
> Thanks and Regards
> 
> Rithesh M Mohan
> 
> 
>  [[alternative HTML version deleted]]
> 
-
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Subscript out of bounds when using datadist() from Design library

2007-07-25 Thread Frank E Harrell Jr
Cody Hamilton wrote:
> I am running R version 2.4.1 on Windows XP.  I have a question regarding the 
> datadist() function from the Design library.  I have a data.frame (call it 
> my.data) with 4 columns.  When I submit the code
> 
> datadist(data=my.data)

You can just use
dd <- datadist(my.data); options(datadist='dd')

If that doesn't fix it, generate a test dataset that fails or do 
save(..., compress=TRUE) and send me your dataset so I can debug.

Frank

> 
> I get the following error message:
> 
> Error in X[[1]] : subscript out of bounds
> 
> I suspect there may be something wrong with my data.frame (I'm certain there 
> is nothing wrong with datadist()), but I don't know what.  Has anyone 
> experienced the mistake I seem to be making?
> 
> Regards,
> Cody Hamilton
> Edwards Lifesciences
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Constructing bar charts with standard error bars

2007-07-25 Thread Frank E Harrell Jr
John Zabroski wrote:
> I am new to R.
> 
> I want to graph group data using a "Traditional Bar Chart with Standard
> Error Bar", like the kind shown here:
> http://samiam.colorado.edu/~mcclella/ftep/twoGroups/twoGroupGraphs.html

There are severe problems with dynamite plots such as these.  See 
http://biostat.mc.vanderbilt.edu/DynamitePlots for a list of problems 
and solutions.

Frank

> 
> Is there a simple way to do this?
> 
> So far, I have only figured out how to plot the bars using barplot.
> 
> testdata <- scan(, list(group=0,xbar=0,se=0))
> 400 0.36038 0.02154
> 200 0.35927 0.02167
> 100 0.35925 0.02341
> 50 0.35712 0.01968
> 25 0.35396 0.01931
> 
> barplot(testdata$xbar, names.arg=as.character(testdata$group), main="a=4.0",
> xlab="Group", ylab="xbar")
> xvalues <- c(0.7, 1.9, 3.1, 4.3, 5.5)
> arrows(xvalues, testdata$xbar, xvalues, testdata$xbar+testdata$se, length=
> 0.4, angle=90, code=3)
> 
> 
> The best clue I have so far is Rtips #5.9:
> http://pj.freefaculty.org/R/Rtips.html#5.9 which is what I based my present
> solution off of.
> 
> However, I do not understand how this works.  It seems like there is no
> concrete way to determine the arrow drawing parameters x0 and x1 for a
> barplot.  Moreover, the bars seem to be "cut off".
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Off-topic: simulate time dependent covariates

2007-07-22 Thread Frank E Harrell Jr
Jin Lo wrote:
> Dear R friends,
> 
> this is an off-topic question. Could you please point
> me in ways (e.g., references or even R code) for
> simulating time varying covariates in a survival
> analysis setting.
> 
> Thanks in advance for any responses.
> 
> yours sincerely,
> 
> Jin

I'm not sure if this article addresses time-dependent covariates but it 
may be worth a look:

@Article{ben05gen,
   author =  {Bender, Ralf and Augustin, Thomas and Blettner, 
Maria},
   title =   {Generating survival times to simulate {Cox}
proportional hazards models},
   journal = Stat in Med,
   year =2005,
   volume =  24,
   pages =   {1713-1723},
   annote =  {simulation setup;simulating survival
times;censoring;Cox model;errata 25:1978-1979}
}

The spower function in the Hmisc package can simulate data for a 
non-constant-hazard ratio treatment effect.  The user specifies the 
hazard ratio as a function of time.

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] exclude1 in summary.formula from Hmisc

2007-07-20 Thread Frank E Harrell Jr
david dav wrote:
> Here is a peace of the data and code :
> 
> sex <-c(2,2,1,1,1,1,1,1,1,2,1,1,1,1,1,2,2,1,1,2,2)
> AGN <- 
> c("C","C","C","C","C","A","A","C","B","B","C","C","C","C","C","C","B","B","C","C","C")
>  
> 
> X <- c(2,2,2,2,1,2,2,2,2,2,1,1,1,1,1,1,1,1,1,2,2)
> 
> varqual <- data.frame(sex,AGN,X)
> 
> desqual <- summary.formula(varqual$X ~.,  data = subset( varqual, select 
> = -X), method = "reverse",  overall = T, test = T, long = T, exclude1 = F)
> 
> desqual doesn't show the results for sex ==1 as it is redundant.
> I also tried long =T wich didn't change anything here.

Oh yes.  exclude1 is not an argument to summary.formula but is an 
argument to the print, plot, and latex methods.  So do print(desqual, 
exclude1=FALSE, long=TRUE).  Thanks for the reproducible example.

Note that you say summary( ) and don't need to type out summary.formula

> 
> Bonus question if I may :
> This function is a little bit complex for me and  I couldn't figure out 
> how to get either Yates' continuity correction or Fisher Test instead of 
> Chisquare. Does it ask a lot of program coding ?

Neither of those is recommended so they are not automatically supported. 
  But users can add their own test functions - see the help file for the 
catTest argument to summary.formula.  Fisher's test is conservative. 
The Yates' continuity correction tries to mimic the conservatism of 
Fisher's test.  I don't like to get larger P-values when I can avoid it. 
  And the recommendations about worrying about the chi-square 
approximation when some cell sizes are small are a bit overdone.  Better 
might be to use the likelihood ratio tests, and many other tests are 
available.

Frank

> 
> Regards.
> 
> David


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] exclude1 in summary.formula from Hmisc

2007-07-20 Thread Frank E Harrell Jr
david dav wrote:
> Dear Users,
> Still having troubles with sharing my results, I'm trying to display a
> contingency table using summary.formula.
> Computing works well but I'd like to display information on redundant
> entries say, males AND females.

Please tell us what output you got.  And it is sometimes helpful to get 
a trivial example of the failure with simulated data.

Also see a related parameter "long".

Frank

> I wonder if this code is correct :
> 
> desqualjum <- summary.formula(varqual1$X ~.,  data = subset( varqual1,
> select = -X), method = "reverse",  overall = T, test = T, exclude1 = FALSE)
> where varqual1 is the data frame containing the variable "sex".
> 
> I'm using R 2.5.1 and Hmisc 3.4-2 on a Mac (OSX.4, intel)  but I had the
> same trouble with  former versions of R and the package.
> 
> Thank you for your help.
> David
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Package for .632 (and .632+) bootstrap and the cross-validation of ROC Parameters

2007-07-13 Thread Frank E Harrell Jr
spime wrote:
> Suppose I have
> 
> Training data: my.train
> Testing data: my.test

The bootstrap does not need split samples.

> 
> I want to calculate bootstrap error rate for logistic model. My wrapper
> function for prediction
> 
> pred.glm <- function(object, newdata) {
> ret <- as.factor(ifelse(predict.glm(object, newdata,
> type='response') < 0.4, 0, 1))
> return(ret)
> }
> 
> But i thing i cant understand if i want to calculate misclassification error
> for my testing data what will be in my data in the following formula.

Misclassification error has many problems because it is not a proper 
scoring rule, i.e., it is optimized by bogus models.

Frank

> 
> errorest(RES ~., data=???, model=glm, estimator="boot", predict=pred.glm, 
>est.para=control.errorest(nboot = 10))
> 
> Using my.test got following error,
> 
> Error in predict(mymodel, newdata = outbootdata) : 
> unused argument(s) (newdata = list(RES = c(1, 0, 0, 0, 1, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1,
> 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0,
> 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,
> 0), CAT01 = c(4, 4, 2, 4, 4, 4, 4, 4, 4, 2, 1, 2, 2, 4, 4, 4, 1, 1, 2, 2, 1,
> 4, 1, 4, 1, 4, 2, 4, 1, 4, 2, 3, 1, 1, 3, 3, 4, 2, 4, 2, 1, 2, 2, 1, 1, 
> 
> please reply...
> 
> 
> 
> 
> 
> 
> Frank E Harrell Jr wrote:
>> spime wrote:
>>> Hi users,
>>>
>>> I need to calculate .632 (and .632+) bootstrap and the cross-validation
>>> of
>>> area under curve (AUC) to compare my models. Is there any package for the
>>> same. I know about 'ipred' and using it i can calculate misclassification
>>> errors. 
>>>
>>> Please help. It's urgent. 
>> See the validate* functions in the Design package.
>>
>> Note that some simulations (see http://biostat.mc.vanderbilt.edu/rms) 
>> indicate that the advantages of .632 and .632+ over the ordinary 
>> bootstrap are highly dependent on the choice of the accuracy measure 
>> being validated.  The bootstrap variants seem to have advantages mainly 
>> if an improper, inefficient, discontinuous scoring rule such as the 
>> percent classified correct is used.
>>
>> -- 
>> Frank E Harrell Jr   Professor and Chair   School of Medicine
>>   Department of Biostatistics   Vanderbilt University
>>
>> __
>> R-help@stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Package for .632 (and .632+) bootstrap and the cross-validation of ROC Parameters

2007-07-12 Thread Frank E Harrell Jr
spime wrote:
> 
> Hi users,
> 
> I need to calculate .632 (and .632+) bootstrap and the cross-validation of
> area under curve (AUC) to compare my models. Is there any package for the
> same. I know about 'ipred' and using it i can calculate misclassification
> errors. 
> 
> Please help. It's urgent. 

See the validate* functions in the Design package.

Note that some simulations (see http://biostat.mc.vanderbilt.edu/rms) 
indicate that the advantages of .632 and .632+ over the ordinary 
bootstrap are highly dependent on the choice of the accuracy measure 
being validated.  The bootstrap variants seem to have advantages mainly 
if an improper, inefficient, discontinuous scoring rule such as the 
percent classified correct is used.

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] variable and value labels

2007-07-11 Thread Frank E Harrell Jr
Steve Powell wrote:
> Dear list
> I am new here - enjoying the power of R compared to SPSS.
> Looking for sets of tips and tricks for people with old SPSS habits.
> In particular, would like to know an easy way to set variable labels 
> across a dataset and to set value labels for sets of variables.
> Grateful for any help,
> Steve Powell

In the Hmisc package see the functions spss.get, label, upData, and 
describe.
Frank

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Incidence estimated from Kaplan-Meier

2007-07-05 Thread Frank E Harrell Jr
Nguyen Dinh Nguyen wrote:
> Dear all,
> 
> I have a stat question that may not be related to R, but I would like to
> have your advice.
> 
>  
> 
> I have just read a medical paper in which the authors report the 1-p (where
> p is the cumulative survival probability from the Kaplan Meier curve) as
> incidence of disease.  
> 
>  
> 
> Specifically, the study followed ~12000 women on drug A and ~2 women on
> drug B for 12 months.  During that period 29 women on drug A and 80 on drug
> B had the disease.  The incidence of disease for A and B was 0.24% and 0.30%
> respectively.  However, instead of reporting these numbers, they report the
> 1-p figure which was 0.3% for A and 0.6% for B. 
> 
>  
> 
> So, the incidence from 1-p was substantially higher than the actual
> incidence.  My question is: is it appropriate to use 1-p estimated from
> Kaplan-Meier as the incidence of disease?  If not, why not? 
> 
>  
> 
> Regards,
> 
> Nguyen

Yes it's appropriate, and it makes you state the cumulative incidence by 
time t rather than leaving time unspecified.  In your example it is 
likely that all women weren't followed completely, so simple incidences 
are not appropriate to compute because the denominator is not constant.

Frank

> 
>  
> 
>  
> Nguyen Dinh Nguyen, 
> 
> Bone and Mineral Research Program 
> Garvan Institute of Medical Research 
> St Vincent's Hospital 
> 384 Victoria Street, Darlinghurst 
> Sydney, NSW 2010 
> Australia 
> Tel; 61-2-9295 8274 
> Fax: 61-2-9295 8241 
> E-mail: [EMAIL PROTECTED] 
> 
>  
> 
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Harrell's C

2007-07-03 Thread Frank E Harrell Jr
Madero Jarabo, Rosario wrote:
> I need to calculate Harrell's C for some survival analyses using Design 
> package with R version 2.4.1. ¿How can I try or do it?
> 
> Rosario Madero
> Sección de Bioestadística
> Hospital Universitario La Paz
> Pºde la Castellana, 261
> 28046 Madrid, España
> Tfno: 917277112
> [EMAIL PROTECTED]

It's in the documention with the Hmisc package.  Type
?rcorr.cens
?rcorrp.cens

Frank

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Formula syntax question

2007-07-03 Thread Frank E Harrell Jr
Isaac Kohane wrote:
> Forgive me if this is obvious:
> 
>   I have a frame of data with the variables in each column (e.g.  
> Discrete_Variable1, ContinuousVariable_1, ContinuousVariable_2,  ...   
> ContinuousVariable_n)
> 
>   and I want to create a model using lrm i.e.
>   model <- lrm(Discrete_Variable1 ~ ContinuousVariable_1,  
> data=lotsofdata)
> 
>   Is there a syntax for having all the continuous variables referenced  
> in the formula without having to enumerate them all?
> 
>   I've seen the ~ . notation but when I try
> 
> 
>   model <- lrm(Discrete_Variable1 ~  ., data=lotsofdata)
> 
>   I get this error:
> 
>   Error in terms.formula(formula, specials = "strat") :
>   '.' in formula and no 'data' argument
>   
> 
>   Any help is appreciated.
> 
> -Zak

It may be best to write a function to determine what is continuous (>= 
10 unique values for example, and numeric) and to run sapply on that 
function, over your data frame.  Then you could use lrm(y ~ ., 
data=mydata[continuous]) if it were not for a problem with lrm which 
Charles Thomas Dupont (the Design package maintainer) and I will work 
on.  Until then you can write a command to compose a formula, e.g.,

form <- as.formula(paste('y', paste(names(mydata)[continuous], 
collapse='+'), sep='~'))
lrm(form, data=mydata)


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Warning message in Design ...

2007-06-29 Thread Frank E Harrell Jr
Petar Milin wrote:
> Hello!
> There were some questions regarding warning messages in Design library,
> but no answer how to fix them. What about:
> 
> "use of storage.mode(x) <- "single" is deprecated: use mode<- instead"
> 
> What does it mean? How to use properly simple model like:
> 
> m1 = ols(Y ~ A + B + C, data = someData)
> 
> After specifying the model above, warning message to use "mode(x)"
> instead of "storage.mode(x)" appears.
> 
> Best,
> PM

It's just a warning to be ignored.  It will go away in a future release 
of Design.

Frank

> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression and dummy variable coding

2007-06-29 Thread Frank E Harrell Jr
Bingshan Li wrote:
> Hi All,
> 
> Now it works. Thanks for all your answers and the explanations are  
> very clear.
> 
> Bingshan

But note that you are not using R correctly unless you are doing a 
simulation and have some special speed issues.  Let the model functions 
do all this for you.

Frank

> 
> On Jun 28, 2007, at 7:44 PM, Seyed Reza Jafarzadeh wrote:
> 
>> NewVar <- relevel( factor(OldVar), ref = "b")
>> should create a dummy variable, and change the reference category  
>> for the model.
>>
>> Reza
>>
>>
>> On 6/28/07, Bingshan Li <[EMAIL PROTECTED]> wrote:
>>> Hello everyone,
>>>
>>> I have a variable with several categories and I want to convert this
>>> into dummy variables and do logistic regression on it. I used
>>> model.matrix to create dummy variables but it always picked the
>>> smallest one as the reference. For example,
>>>
>>> model.matrix(~.,data=as.data.frame(letters[1:5]))
>>>
>>> will code 'a' as '0 0 0 0'. But I want to code another category as
>>> reference, say 'b'. How to do it in R using model.matrix? Is there
>>> other way to do it if model.matrix  has no such functionality?
>>>
>>> Thanks!
>>>
>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> __
>>> R-help@stat.math.ethz.ch mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting- 
>>> guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Marginal Effects of continuous variable in lrm model (Design package)

2007-06-26 Thread Frank E Harrell Jr
Minyu Chen wrote:
> Dear all:
> 
> When I am trying to get the marginal effects:
> 
> summary(result7,adv_inc_ratio=mean(m9201 
> $adv_inc_ratio),adv_price_ratio=mean(m9201$adv_price_ratio), ...(SOME  
> MORE CONTINUOUS AND DISCRETE VARIABLES BUT I AM NOT LISTING)... regW=c 
> (0,mean(m9201$regW),1), regWM=c(0,mean(m9201$regWM),1))
> 
> It gave out an error message:
> 
> Error in summary.Design(result7, adv_inc_ratio = mean(m9201 
> $adv_inc_ratio),  :
>   ranges not defined here or with datadist for adv_inc_ratio  
> adv_price_ratio agem1 change2 yr1Libor yr1Libor1mtLag yr1Libor1yrLag  
> change21yrLag change21mtLag fwdReal4yr fwdInfl4yr
> 
> But I remember from my previous operation a few months ago (I  
> recorded the commands) that to summary the marginal effect, I don't  
> have to specify the ranges for the discrete variables. However, I use  
> the command again (with slight modification because of the newly  
> added variables) it doesn't work. I don't know what went wrong.
> 
> Thank you very much.
> 
> Thanks,
> Minyu Chen

Please read the package's help files especially about the datadist function.

Frank


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] help on the use of ldBand

2007-06-22 Thread Frank E Harrell Jr
Tomas Goicoa wrote:
>   Hi R Users,
> 
>   I am trying to use the ldBand package. Together 
> with the package, I have downloaded the ld98 
> program (version for windows) as indicated in the 
> help page on ldBand. I did it, but obtained an 
> error message "Error in (head + 1):length(w) : 
> Argument NA/NaN" when I copied the help examples, 
> so it seems that a conection between R and ld98 
> is not well performed in my computer.  Did I put 
> ld98.exe in the wrong place? If so,  where should 
> I put it? Thanks a lot in advance,
> 
> 
>   Berta Ibañez Beroiz

Do you mean the Hmisc package?  Do you mean the ldBands function?  Did 
you put ld98.exe in a place that is in your system path as specified in 
the ldBands help file?  And please read the posting guide referenced below.

Frank

> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] BIC and Hosmer-Lemeshow statistic for logistic regression

2007-06-19 Thread Frank E Harrell Jr
spime wrote:
> 
> Is there any windows version of Design package???

Soon the new version will will make its way to Windows, probably in a 
day or two.

Frank

> 
> 
> 
> 
> 
> 
> Frank E Harrell Jr wrote:
>> spime wrote:
>>> I haven't find any helpful thread. How can i calculate BIC and
>>> Hosmer-Lemeshow statistic for a logistic regression model. I have used
>>> glm
>>> for logistic fit.
>> See the Design package's lrm function and residuals.lrm for a better GOF 
>> test.
>>
>>
>>
>> -- 
>> Frank E Harrell Jr   Professor and Chair   School of Medicine
>>   Department of Biostatistics   Vanderbilt University
>>
>> __
>> R-help@stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] BIC and Hosmer-Lemeshow statistic for logistic regression

2007-06-19 Thread Frank E Harrell Jr
spime wrote:
> 
> I haven't find any helpful thread. How can i calculate BIC and
> Hosmer-Lemeshow statistic for a logistic regression model. I have used glm
> for logistic fit.

See the Design package's lrm function and residuals.lrm for a better GOF 
test.

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] psm/survreg coefficient values ?

2007-06-18 Thread Frank E Harrell Jr
sj wrote:
> I am using psm to model some parametric survival data, the data is for
> length of stay in an emergency department. There are several ways a
> patient's stay in the emergency department can end (discharge, admit, etc..)
> so I am looking at modeling the effects of several covariates on the various
> outcomes. Initially I am trying to fit a  survival model for each type of
> outcome using the psm function in the design package,  i.e., all  patients
> who's visits  come to an end  due to  any event other than the event of
> interest are considered to be censored.  Being new to the psm and  survreg
> packages (and to parametric survival modeling) I am not entirely sure how to
> interpret the coefficient values that psm returns. I have included the
> following code to illustrate code similar to what I am using on my data. I
> suppose that the coefficients are somehow rescaled , but I am not sure how
> to return them to the original scale and make sense out of the coefficients,
> e.g., estimate the the effect of higher acuity on time to event in minutes.
> Any explanation or direction on how to interpret the  coefficient values
> would be greatly appreciated.
> 
> this is from the documentation for survreg.object.
> coefficientsthe coefficients of the linear.predictors, which multiply the
> columns of the model matrix. It does not include the estimate of error
> (sigma). The names of the coefficients are the names of the
> single-degree-of-freedom effects (the columns of the model matrix). If the
> model is over-determined there will be missing values in the coefficients
> corresponding to non-estimable coefficients.
> 
> code:
> LOS <- sort(rweibull(1000,1.4,108))
> AGE <- sort(rnorm(1000,41,12))
> ACUITY <- sort(rep(1:5,200))
> EVENT <-  sample(x=c(0,1),replace=TRUE,1000)
> psm(Surv(LOS,EVENT)~AGE+as.factor(ACUITY),dist='weibull')
> 
> output:
> 
> psm(formula = Surv(LOS, CENS) ~ AGE + as.factor(ACUITY), dist = "weibull")
> 
>Obs Events Model L.R.   d.f.  P R2
>   10005132387.62  5  0   0.91
> 
>   Value  Std. Error  z p
> (Intercept) 1.10550.04425  24.98 8.92e-138
> AGE 0.07720.00152  50.93  0.00e+00
> ACUITY=2 0.09440.01357   6.96  3.39e-12
> ACUITY=3 0.17520.02111   8.30  1.03e-16
> ACUITY=4 0.13910.02722   5.11  3.18e-07
> ACUITY=5-0.05440.03789  -1.43  1.51e-01
> Log(scale)-2.72870.03780 -72.18  0.00e+00
> 
> Scale= 0.0653
> 
> best,
> 
> Spencer

I have a case study using psm (survreg wrapper) in my book.  Briefly, 
coefficients are on the log median survival time scale.

Frank


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] error bars on survival curve

2007-06-17 Thread Frank E Harrell Jr
Murray Pung wrote:
> I am using plot(survfit(Surv(time,status) ~...) and would like to add
> error bars rather than the confidence intervals. Am I able to do this
> at specified times? e.g. when time = 20 & 40.
> 
> 
> leukemia.surv <- survfit(Surv(time, status) ~ x, data = aml)
> plot(leukemia.surv, lty = 2:3,xlim = c(0,50))
> #can i add error bars at times 20 & 40?
> legend(100, .9, c("Maintenance", "No Maintenance"), lty = 2:3)
> 
> 
> Thanks
> Murray

Error bars when done correctly should equal confidence intervals in the 
minds of many.

When the Design package is available again for the latest version of R, 
you can have more control using the survplot function, with bars and 
bands options.  Do ?survplot.survfit

Frank

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] complex contrasts and logistic regression

2007-06-16 Thread Frank E Harrell Jr
Nicholas Lewin-Koh wrote:
> Hi,
> I am doing a retrospective analysis on a cohort from a designed trial,
> and I am fitting
> the model
> 
> fit<-glmD(survived ~ Covariate*Therapy + confounder,myDat,X=TRUE,
> Y=TRUE, family=binomial()) 

For logistic regression you can also use Design's lrm function which 
gives you more options.

> 
> My covariate has three levels ("A","B" and "C") and therapy has two
> (treated and control), confounder is a continuous variable.
> Also patients were randomized to treatment in the trial, but Covariate
> is something that is measured
> posthoc and can vary in the population.

If by posthoc you mean that the covariate is measured after baseline, it 
is difficult to get an interpretable analysis.

>  
> I am trying to wrap my head around how to calculate a few quantities
> from the model
> and get reasonable confidence intervals for them, namely I would like to
> test
> 
> H0: gamma=0, where gamma is the regression coefficient of the odds
> ratios of surviving
>  under treatment vs control at each level of Covariate
>  (adjusted for the confounder)

You mean regression coefficient on the log odds ratio scale.  This is 
easy to do with the contrast( ) function in Design.  Do ?contrast.Design 
for details and examples.

> 
> and I would like to get the odds of surviving at each level of Covariate
> under treatment and control
> for each level of covariate adjusted for the confounder. I have looked
> at contrast in the Design 
> library but I don't think it gives me the right quantity, for instance 
> 
> contrast(fit,list(covariate="A", Therapy="Treated",
> confounder=median(myDat$confounder), X=TRUE)
> ( "A" is the baseline level of Covariate) 
> 
> gives me beta0 + beta_Treated + beta_confounder*68  
> 
> Is this correctly interpreted as the conditional odds of dying? 
> As to the 1st contrast I am not sure how to get it, would it be using
> type = 'average' with some weights 
> in contrast? The answers are probably staring me in the face, i am just
> not seeing them today.

contrast( ) is for contrasts (differences).  Sounds like you want 
predicted values.  Do ?predict  ?predict.lrm  ?predict.Design.  Also do 
?gendata which will generate a data frame for getting predictors, with 
unspecified predictors set to reference values such as medians.

Frank

> 
> Nicholas
> 
> 
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] selecting cut-off in Logistic regression using ROCR package

2007-06-16 Thread Frank E Harrell Jr
Tirthadeep wrote:
> 
> Hi,
> 
> I am using logistic regression to classify a binary psychometric data. using
> glm() and then predict.glm() i got the predicted odds ratio of the testing
> data. Next i am going to plot ROC curve for the analysis of my study.
> 
> Now what i will do:
> 
> 1. first select a cut-off (say 0.4) and classify the output of predict.glm()
> into {0,1} segment and then use it to draw ROC curve using ROCR package 
> 
> OR
> 
> 2. just use the predicted odds ratio in ROCR package to get "error rate" and
> use the minimum error rate (as new cut-off) to draw new ROC curve.
> 
> waiting for reply.
> 
> with regards and thanks.
> 
> Tirtha.

It's not clear why any cutoff or ROC curve is needed.  Please give us 
more information about why a continuous variable should be dichotomized, 
and read

@Article{roy06dic,
   author =  {Royston, Patrick and Altman, Douglas G. and
Sauerbrei, Willi},
   title =   {Dichotomizing continuous predictors in multiple
regression: a bad idea},
   journal = Stat in Med,
   year =2006,
   volume =  25,
   pages =   {127-141},
   annote =  {continuous
covariates;dichotomization;categorization;regression;efficiency;clinical
research;residual confounding;destruction of statistical inference
when cutpoints are chosen using the response variable;varying effect
estimates from change in cutpoints;difficult to interpret effects
when dichotomize;nice plot showing effect of categorization;PBC data}
}

Frank

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Design library installation problem

2007-06-13 Thread Frank E Harrell Jr
Uwe Ligges wrote:
> 
> 
> Ian Watson wrote:
>> Dear Listers
>>
>> I have tried to install Frank Harrell's two libaries: Hmisc and Design.
>>
>> I found that Hmisc was listed in the list of packages from the Install 
>> Packages command on the Packages menu, but Design was not. I installed 
>> Hmisc from this list, and when I   issued the library(Hmisc) command, 
>> it loaded into memory correctly.
>>
>> I then copied the Design 1.1-1.zip file from the 
>> http://lib.stat.cmu.edu/S/Harrell/library/r/ site and used the Install 
>> Packages from Local Zip file command.
>> I received no error messages and a visual inspection of the R\library 
>> directory shows Design has been installed.
>>
>> However, when I issued the library(Design) command I get the following 
>> error message:
>>
>> Error in library(Design) : 'Design' is not a valid package -- 
>> installed < 2.0.0?
>>
>>
>> I also notice, from a visual inspection of the R\library\Design\R 
>> directory that there is only one file: design. In other directories, 
>> eg. R\library\Hmisc\R there are usually 3 files:
>> Hmisc
>> Hmisc.rdx
>> Hmisc.rdb
>>
>> I am new to R, and a bit lost. I have read the R-admin.pdf 
>> documentation on packages but am still unsure how to proceed from here.
>>
>> I would appreciate any advice, and any answers to the following 
>> questions:
>>
>> 1) is there a reason why Design is not listed in the Install Packages 
>> list as Hmisc is?
> 
> Yes. The current version does not pass the checks under Windows. Please 
> convince the maintainer to fix the package, and a binary will be made 
> available shortly.

Please note that this would be a lot easier if we did not get a segfault 
when running R CMD CHECK --- a segfault that does not happen with the 
example code is run outside of CMD CHECK.  Charles Thomas Dupont is 
working hard on this.

Frank

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R vs. Splus in Pharma/Devices Industry

2007-06-11 Thread Frank E Harrell Jr
[EMAIL PROTECTED] wrote:
> Following up to some extent on Friday's discussion regarding the
> 'validation' of R, could I ask the list group's opinion on possible
> advantages of R over Splus from a pharma/devices perspective?  I wish to
> exclude the obvious price difference, which doesn’t seem to carry as much
> weight as I would have thought.  Besides, I have noticed many former Splus
> users gravitating towards R, and I suspect that the reasons are not purely
> economic.
> 
> I can think of a few advantages of Splus:
> 1. SeqTrial (of course that means more $)
> 2. Tech support
> 3. The warm fuzzies that management seems to get from proprietary software
> 
> I can also think of a few advantages of R:
> 1. Based on my personal experiences, simulations requiring a lot of looping
> seem to run faster.
> 2. R interfaces with BUGS, for example through BRUGS.
> 3. The wonderful help list!
> 
> As always, I am speaking for myself and not necessarily for Edwards
> Lifesciences.
> 
> Regards,
>-Cody
> 
> Cody Hamilton, PhD
> Edwards Lifesciences
>   [[alternative HTML version deleted]]

A big one for us is plotmath (for clinical trial reports we put a lot of 
greek letters and subscripts on plots), and later we will consider 
migrating a lot of our stuff to the ggplot package which I don't think 
is available in S-Plus.  Lexical scoping is another advantage of R as is 
the ability to reference files on the internet.

Frank

> 
> 
> 
> 
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-08 Thread Frank E Harrell Jr
Dale Steele wrote:
> For windows users, EpiData Entry  is an
> excellent (free) tool for data entry and documentation.--Dale

Note that EpiData seems to work well under linux using wine.
Frank

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] "R is not a validated software package.."

2007-06-08 Thread Frank E Harrell Jr
Bert Gunter wrote:
> Frank et. al:
> 
> I believe this is a bit too facile. 21 CFR Part 11 does necessitate a
> software validation **process** -- but this process does not require any

For database software and for medical devices -

> particular software. Rather, it requires that those using whatever software
> demonstrate to the FDA's satisfaction that the software does what it's
> supposed to do appropriately. This includes a lot more than assuring, say,
> the numerical accuracy of computations; I think it also requires
> demonstration that the data are "secure," that it is properly transferred
> from one source to another, etc. I assume that the statistical validation of
> R would be relatively simple, as R already has an extensive test suite, and
> it would simply be a matter of providing that test suite info. A bit more
> might be required, but I don't think it's such a big deal. 
> 
> I think Wensui Liu's characterization of clinical statisticians as having a
> mentality "related to job security" is a canard. Although I work in
> nonclinical, my observation is that clinical statistics is complex and
> difficult, not only because of many challenging statistical issues, but also
> because of the labyrinthian complexities of the regulated and extremely
> costly environment in which they work. It is certainly a job that I could
> not do.
> 
> That said, probably the greatest obstacle to change from SAS is neither
> obstinacy nor ignorance, but rather inertia: pharmaceutical companies have
> over the decades made a huge investment in SAS infrastructure to support the
> collection, organization, analysis, and submission of data for clinical
> trials. To convert this to anything else would be a herculean task involving
> huge expense, risk, and resources. R, S-Plus (and much else -- e.g. numerous
> "unvalidated" data mining software packages) are routinely used by clinical
> statisticians to better understand their data and for "exploratory" analyses
> that are used to supplement official analyses (e.g. for trying to justify
> collection of tissue samples or a pivotal study in a patient subpopulation).
> But it is difficult for me to see how one could make a business case to
> change clinical trial analysis software infrastructure from SAS to S-Plus,
> SPSS, or anything else.

What I would love to have is some efficiency estimates for SAS macro 
programming as done in pharma vs. using a high-level language.  My bias 
is that SAS macro programming, which costs companies more than SAS 
licenses, is incredibly inefficient.

Frank

> 
> **DISCLAINMER** 
> My opinions only. They do not in any way represent the view of my company or
> its employees.
> 
> 
> Bert Gunter
> Genentech Nonclinical Statistics
> South San Francisco, CA 94404
> 650-467-7374
> 
> 
> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of Frank E Harrell Jr
> Sent: Friday, June 08, 2007 7:45 AM
> To: Giovanni Parrinello
> Cc: r-help@stat.math.ethz.ch
> Subject: Re: [R] "R is not a validated software package.."
> 
> Giovanni Parrinello wrote:
>> Dear All,
>> discussing with a statistician of a pharmaceutical company I received 
>> this answer about the statistical package that I have planned to use:
>>
>> As R is not a validated software package, we would like to ask if it 
>> would rather be possible for you to use SAS, SPSS or another approved 
>> statistical software system.
>>
>> Could someone suggest me a 'polite' answer?
>> TIA
>> Giovanni
>>
> 
> Search the archives and you'll find a LOT of responses.
> 
> Briefly, in my view there are no requirements, just some pharma 
> companies that think there are.  FDA is required to accepted all 
> submissions, and they get some where only Excel was used, or Minitab, 
> and lots more.  There is a session on this at the upcoming R 
> International Users Meeting in Iowa in August.  The session will include 
> dicussions of federal regulation compliance for R, for those users who 
> feel that such compliance is actually needed.
> 
> Frank
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] "R is not a validated software package.."

2007-06-08 Thread Frank E Harrell Jr
Sicotte, Hugues Ph.D. wrote:
> People, don't get angry at the pharma statistician, he is just trying to
> abide by an FDA requirement that is designed to insure that test perform
> reliably the same. There is no point in getting into which product is
> better. As far as the FDA rules are concerned a validated system beats a
> "better" system any day of the week.

There is no such requirement.

> 
> Here is your polite answer.
> You can develop and try your software in R.
> Should they need to use those results in a report that will matter to
> the FDA, then you can work together with him to set up a validated
> environment for S-plus. You then have to commit to port your code to
> S-plus.

That doesn't follow.  What matters is good statistical analysis practice 
no matter which environment you use.  Note that more errors are made in 
the data preparation / derived variables stage than are made by 
statistical software.

Frank

> 
> As I assume that you do not work in a regulated environment, you
> probably wouldn't have access to a validated SAS environment anyways. It
> is not usually enough to install a piece of software, you have to
> validate every step of the installation. Since AFAIK the FDA uses
> S-plus, it would be to your pharma person's advantage to speed-up
> submissions if they also had a validated S-plus environment.
> 
> http://www.msmiami.com/custom/downloads/S-PLUSValidationdatasheet_Final.
> pdf
> 
> 
> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of Wensui Liu
> Sent: Friday, June 08, 2007 9:24 AM
> To: Giovanni Parrinello
> Cc: r-help@stat.math.ethz.ch
> Subject: Re: [R] "R is not a validated software package.."
> 
> I like to know the answer as well.
> To be honest, I really have hard time to understand the mentality of
> clinical trial guys and rather believe it is something related to job
> security.
> 
> On 6/8/07, Giovanni Parrinello <[EMAIL PROTECTED]> wrote:
>> Dear All,
>> discussing with a statistician of a pharmaceutical company I received
>> this answer about the statistical package that I have planned to use:
>>
>> As R is not a validated software package, we would like to ask if it
>> would rather be possible for you to use SAS, SPSS or another approved
>> statistical software system.
>>
>> Could someone suggest me a 'polite' answer?
>> TIA
>> Giovanni
>>
>> --
>> dr. Giovanni Parrinello
>> External Lecturer
>> Medical Statistics Unit
>> Department of Biomedical Sciences
>> Viale Europa, 11 - 25123 Brescia Italy
>> Tel: +390303717528
>> Fax: +390303717488
>> email: [EMAIL PROTECTED]
>>
>>
>> [[alternative HTML version deleted]]
>>
>> __
>> R-help@stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> 
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] "R is not a validated software package.."

2007-06-08 Thread Frank E Harrell Jr
Giovanni Parrinello wrote:
> Dear All,
> discussing with a statistician of a pharmaceutical company I received 
> this answer about the statistical package that I have planned to use:
> 
> As R is not a validated software package, we would like to ask if it 
> would rather be possible for you to use SAS, SPSS or another approved 
> statistical software system.
> 
> Could someone suggest me a 'polite' answer?
> TIA
> Giovanni
> 

Search the archives and you'll find a LOT of responses.

Briefly, in my view there are no requirements, just some pharma 
companies that think there are.  FDA is required to accepted all 
submissions, and they get some where only Excel was used, or Minitab, 
and lots more.  There is a session on this at the upcoming R 
International Users Meeting in Iowa in August.  The session will include 
dicussions of federal regulation compliance for R, for those users who 
feel that such compliance is actually needed.

Frank

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-07 Thread Frank E Harrell Jr
> much data transformation software. Not many people work on data
> preparation software. In fact, the category is so obscure that there
> isn't one agreed term: data cleaning , data munging , data crunching ,
> or just getting the data ready for analysis.
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Harrell's C

2007-06-02 Thread Frank E Harrell Jr
Wachtel, Mitchell wrote:
> R will not let me load Design or Hmisc anymore. I need to calculate Harrell's 
> C for some survival analyses. Can anyone provide an R formula for Harrell's C?
> With kindest regards.
> Mitchell Wachtel, MD

You provided very little information.  We have had no reports of Hmisc 
not loading under Windows but Design is not currently available for 
Windows for R 2.5.  We are working to fix that.

Frank
-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Problem with Weighted Variance in Hmisc

2007-06-01 Thread Frank E Harrell Jr
jiho wrote:
> On 2007-June-01  , at 01:03 , Tom La Bone wrote:
>> The function wtd.var(x,w) in Hmisc calculates the weighted variance  
>> of x
>> where w are the weights.  It appears to me that wtd.var(x,w) = var 
>> (x) if all
>> of the weights are equal, but this does not appear to be the case. Can
>> someone point out to me where I am going wrong here?  Thanks.
> 
> The true formula of weighted variance is this one:
>   http://www.itl.nist.gov/div898/software/dataplot/refman2/ch2/ 
> weighvar.pdf
> But for computation purposes, wtd.var uses another definition which  
> considers the weights as repeats instead of true weights. However if  
> the weights are normalized (sum to one) to two formulas are equal. If  
> you consider weights as real weights instead of repeats, I would  
> recommend to use this option.
> With normwt=T, your issue is solved:
> 
>  > a=1:10
>  > b=a
>  > b[]=2
>  > b
> [1] 2 2 2 2 2 2 2 2 2 2
>  > wtd.var(a,b)
> [1] 8.68421
> # all weights equal 2 <=> there are two repeats of each element of a
>  > var(c(a,a))
> [1] 8.68421
>  > wtd.var(a,b,normwt=T)
> [1] 9.17
>  > var(a)
> [1] 9.17
> 
> Cheers,
> 
> JiHO

The issue is what is being assumed for N in the denominator of the 
variance formula, since the unbiased estimator subtracts one.  Using 
normwt=TRUE means you are in effect assuming N is the number of elements 
in the data vector, ignoring the weights.

Frank Harrell

> ---
> http://jo.irisson.free.fr/
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R's Spearman

2007-05-31 Thread Frank E Harrell Jr
Mendiburu, Felipe (CIP) wrote:
> Dear Ray,
> 
> The R's Spearman calculated by R is correct for ties or nonties, which is not 
> correct is the probability for the case of ties. I send to you formulates it 
> for the correlation with ties, that is equal to R. 
> 
> Regards,
> 
> Felipe de Mendiburu
> Statistician

Just use midranks for ties (as with rank()) and compute the ordinary 
correlation on those (see also the spearman2 and rcorr functions in 
Hmisc package).  No need to use complex formulas.  And the t 
approximation for p-values works pretty well.

Frank Harrell

> 
> 
> # Spearman correlation "rs" with ties or no ties
> rs<-function(x,y) {
> d<-rank(x)-rank(y)
> tx<-as.numeric(table(x))
> ty<-as.numeric(table(y))
> Lx<-sum((tx^3-tx)/12)
> Ly<-sum((ty^3-ty)/12)
> N<-length(x)
> SX2<- (N^3-N)/12 - Lx
> SY2<- (N^3-N)/12 - Ly
> rs<- (SX2+SY2-sum(d^2))/(2*sqrt(SX2*SY2))
> return(rs)
> }
> 
> # Aplicacion
>> cor(y[,1],y[,2],method="spearman")
> [1] 0.2319084
>> rs(y[,1],y[,2])
> [1] 0.2319084
> 
> 
> 
> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] Behalf Of Raymond Wan
> Sent: Monday, May 28, 2007 10:29 PM
> To: r-help@stat.math.ethz.ch
> Subject: [R] R's Spearman
> 
> 
> 
> Hi all,
> 
> I am trying to figure out the formula used by R's Spearman rho (using 
> cor(method="spearman")) because I can't seem to get the same value as by 
> calculating "by hand".  Perhaps I'm using "cor" wrong, but I don't know 
> where.  Basically, I am running these commands:
> 
>  > y=read.table(file="tmp",header=TRUE,sep="\t")
>  > y
>IQ Hours
> 1 106 7
> 2  86 0
> 3  9720
> 4 11312
> 5 12012
> 6 11017
>  > cor(y[1],y[2],method="spearman")
>Hours
> IQ 0.2319084
> 
> [it's an abbreviated example of one I took from Wikipedia].  I 
> calculated by hand (apologies if the table looks strange when pasted 
> into e-mail):
> 
>   IQHoursrank(IQ)  rank(hours)diffdiff^2
> 110673 2 11
> 2 8601 1 00
> 3 9720   2 6-416
> 411312   5 3.5 1.52.25
> 512012   6 3.5 2.56.25
> 611017   4 5-11
>   26.5
>
>   rho=0.242857
> 
> where rho = (1 - ((6 * 26.5) / 6 * (6^2 - 1))).  I kept modifying the 
> table and realized that the difference in result comes from ties.  i.e., 
> if I remove the tie in rows 4 and 5, I get the same result from both cor 
> and calculating by hand.  Perhaps I'm handling ties wrong...does anyone 
> know how R does it or perhaps I need to change how I'm using it?
> 
> Thank you!
> 
> Ray
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] AIC for lrm(Hmisc/Design) model.

2007-05-29 Thread Frank E Harrell Jr
Milton Cezar Ribeiro wrote:
> Dear all,
> 
> I am adjusting a Logistic Regression Model using lmr() function of 
> Hmisc/Design package. Now I would like to compute AIC for this model. How can 
> I do that?
> 
> Kind regards,
> 
> miltinho
> Brazil

I like to change AIC to have it on the chi-square scale.  For that you 
can do

aic <- function(fit)
   round(unname(fit$stats['Model L.R.'] - 2*fit$stats['d.f.']),2)

f <- lrm( )
aic(f)

If unname doesn't exist in S-Plus as it does in R, you can remove that part.

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] normality tests [Broadcast]

2007-05-25 Thread Frank E Harrell Jr
[EMAIL PROTECTED] wrote:
> Following up on Frank's thought, why is it that parametric tests are so
> much more popular than their non-parametric counterparts?  As
> non-parametric tests require fewer assumptions, why aren't they the
> default?  The relative efficiency of the Wilcoxon test as compared to the
> t-test is 0.955, and yet I still see t-tests in the medical literature all
> the time.  Granted, the Wilcoxon still requires the assumption of symmetry
> (I'm curious as to why the Wilcoxon is often used when asymmetry is
> suspected, since the Wilcoxon assumes symmetry), but that's less stringent
> than requiring normally distributed data.  In a similar vein, one usually
> sees the mean and standard deviation reported as summary statistics for a
> continuous variable - these are not very informative unless you assume the
> variable is normally distributed.  However, clinicians often insist that I
> included these figures in reports.
> 
> Cody Hamilton, PhD
> Edwards Lifesciences

Well said Cody, just want to add that Wilcoxon does not assume symmetry 
if you are interested in testing for stochastic ordering and not just 
for a mean.

Frank

> 
> 
> 
>        
>      Frank E Harrell   
>  Jr
>  <[EMAIL PROTECTED]  To 
>  bilt.edu> "Lucke, Joseph F"   
>  Sent by:  <[EMAIL PROTECTED]>
>  [EMAIL PROTECTED]  cc 
>  at.math.ethz.ch   r-help
>Subject 
>Re: [R] normality tests 
>  05/25/2007 02:42  [Broadcast] 
>  PM
>
>
>
>
>
> 
> 
> 
> 
> Lucke, Joseph F wrote:
>>  Most standard tests, such as t-tests and ANOVA, are fairly resistant to
>> non-normalilty for significance testing. It's the sample means that have
>> to be normal, not the data.  The CLT kicks in fairly quickly.  Testing
>> for normality prior to choosing a test statistic is generally not a good
>> idea.
> 
> I beg to differ Joseph.  I have had many datasets in which the CLT was
> of no use whatsoever, i.e., where bootstrap confidence limits were
> asymmetric because the data were so skewed, and where symmetric
> normality-based confidence intervals had bad coverage in both tails
> (though correct on the average).  I see this the opposite way:
> nonparametric tests works fine if normality holds.
> 
> Note that the CLT helps with type I error but not so much with type II
> error.
> 
> Frank
> 
>> -Original Message-
>> From: [EMAIL PROTECTED]
>> [mailto:[EMAIL PROTECTED] On Behalf Of Liaw, Andy
>> Sent: Friday, May 25, 2007 12:04 PM
>> To: [EMAIL PROTECTED]; Frank E Harrell Jr
>> Cc: r-help
>> Subject: Re: [R] normality tests [Broadcast]
>>
>> From: [EMAIL PROTECTED]
>>> On 25/05/07, Frank E Harrell Jr <[EMAIL PROTECTED]> wrote:
>>>> [EMAIL PROTECTED] wrote:
>>>>> Hi all,
>>>>>
>>>>> apologies for seeking advice on a general stats question. I ve run
>>>>> normality tests using 8 different methods:
>>>>> - Lilliefors
>>>>> - Shapiro-Wilk
>>>>> - Robust Jarque Bera
>>>>> - Jarque Bera
>>>>> - Anderson-Darling
>>>>> - Pearson chi-square
>>>>> - Cramer-von Mises
>>>>> - Shapiro-Francia
>>>>>
>>>>> All show that the null hypothesis that the data come from a normal
>>>>> distro cannot be rejected. Great. However, I don't think
>>> it looks nice
>>>>> to report the values of 8 different tests on a report. One note is
>>>>> that my sample size is really tiny (less than 20
>>> independent cases).
>>>>> Without wanting to start a flame war, are th

Re: [R] normality tests [Broadcast]

2007-05-25 Thread Frank E Harrell Jr
[EMAIL PROTECTED] wrote:
> Thank you all for your replies they have been more useful... well
> in my case I have chosen to do some parametric tests (more precisely
> correlation and linear regressions among some variables)... so it
> would be nice if I had an extra bit of support on my decisions... If I
> understood well from all your replies... I shouldn't pay s much
> attntion on the normality tests, so it wouldn't matter which one/ones
> I use to report... but rather focus on issues such as the power of the
> test...

If doing regression I assume your normality tests were on residuals 
rather than raw data.

Frank

> 
> Thanks again.
> 
> On 25/05/07, Lucke, Joseph F <[EMAIL PROTECTED]> wrote:
>>  Most standard tests, such as t-tests and ANOVA, are fairly resistant to
>> non-normalilty for significance testing. It's the sample means that have
>> to be normal, not the data.  The CLT kicks in fairly quickly.  Testing
>> for normality prior to choosing a test statistic is generally not a good
>> idea.
>>
>> -Original Message-
>> From: [EMAIL PROTECTED]
>> [mailto:[EMAIL PROTECTED] On Behalf Of Liaw, Andy
>> Sent: Friday, May 25, 2007 12:04 PM
>> To: [EMAIL PROTECTED]; Frank E Harrell Jr
>> Cc: r-help
>> Subject: Re: [R] normality tests [Broadcast]
>>
>> From: [EMAIL PROTECTED]
>> >
>> > On 25/05/07, Frank E Harrell Jr <[EMAIL PROTECTED]> wrote:
>> > > [EMAIL PROTECTED] wrote:
>> > > > Hi all,
>> > > >
>> > > > apologies for seeking advice on a general stats question. I ve run
>>
>> > > > normality tests using 8 different methods:
>> > > > - Lilliefors
>> > > > - Shapiro-Wilk
>> > > > - Robust Jarque Bera
>> > > > - Jarque Bera
>> > > > - Anderson-Darling
>> > > > - Pearson chi-square
>> > > > - Cramer-von Mises
>> > > > - Shapiro-Francia
>> > > >
>> > > > All show that the null hypothesis that the data come from a normal
>>
>> > > > distro cannot be rejected. Great. However, I don't think
>> > it looks nice
>> > > > to report the values of 8 different tests on a report. One note is
>>
>> > > > that my sample size is really tiny (less than 20
>> > independent cases).
>> > > > Without wanting to start a flame war, are there any
>> > advices of which
>> > > > one/ones would be more appropriate and should be reported
>> > (along with
>> > > > a Q-Q plot). Thank you.
>> > > >
>> > > > Regards,
>> > > >
>> > >
>> > > Wow - I have so many concerns with that approach that it's
>> > hard to know
>> > > where to begin.  But first of all, why care about
>> > normality?  Why not
>> > > use distribution-free methods?
>> > >
>> > > You should examine the power of the tests for n=20.  You'll probably
>>
>> > > find it's not good enough to reach a reliable conclusion.
>> >
>> > And wouldn't it be even worse if I used non-parametric tests?
>>
>> I believe what Frank meant was that it's probably better to use a
>> distribution-free procedure to do the real test of interest (if there is
>> one) instead of testing for normality, and then use a test that assumes
>> normality.
>>
>> I guess the question is, what exactly do you want to do with the outcome
>> of the normality tests?  If those are going to be used as basis for
>> deciding which test(s) to do next, then I concur with Frank's
>> reservation.
>>
>> Generally speaking, I do not find goodness-of-fit for distributions very
>> useful, mostly for the reason that failure to reject the null is no
>> evidence in favor of the null.  It's difficult for me to imagine why
>> "there's insufficient evidence to show that the data did not come from a
>> normal distribution" would be interesting.
>>
>> Andy
>>
>>
>> > >
>> > > Frank
>> > >
>> > >
>> > > --
>> > > Frank E Harrell Jr   Professor and Chair   School
>> > of Medicine
>> > >   Department of Biostatistics
>> > Vanderbilt University
>> > >
>> >
>> >
>> > --
>> > yianni
>> >
>> > __
>> > R-hel

Re: [R] normality tests [Broadcast]

2007-05-25 Thread Frank E Harrell Jr
Lucke, Joseph F wrote:
>  Most standard tests, such as t-tests and ANOVA, are fairly resistant to
> non-normalilty for significance testing. It's the sample means that have
> to be normal, not the data.  The CLT kicks in fairly quickly.  Testing
> for normality prior to choosing a test statistic is generally not a good
> idea. 

I beg to differ Joseph.  I have had many datasets in which the CLT was 
of no use whatsoever, i.e., where bootstrap confidence limits were 
asymmetric because the data were so skewed, and where symmetric 
normality-based confidence intervals had bad coverage in both tails 
(though correct on the average).  I see this the opposite way: 
nonparametric tests works fine if normality holds.

Note that the CLT helps with type I error but not so much with type II 
error.

Frank

> 
> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of Liaw, Andy
> Sent: Friday, May 25, 2007 12:04 PM
> To: [EMAIL PROTECTED]; Frank E Harrell Jr
> Cc: r-help
> Subject: Re: [R] normality tests [Broadcast]
> 
> From: [EMAIL PROTECTED]
>> On 25/05/07, Frank E Harrell Jr <[EMAIL PROTECTED]> wrote:
>>> [EMAIL PROTECTED] wrote:
>>>> Hi all,
>>>>
>>>> apologies for seeking advice on a general stats question. I ve run
> 
>>>> normality tests using 8 different methods:
>>>> - Lilliefors
>>>> - Shapiro-Wilk
>>>> - Robust Jarque Bera
>>>> - Jarque Bera
>>>> - Anderson-Darling
>>>> - Pearson chi-square
>>>> - Cramer-von Mises
>>>> - Shapiro-Francia
>>>>
>>>> All show that the null hypothesis that the data come from a normal
> 
>>>> distro cannot be rejected. Great. However, I don't think
>> it looks nice
>>>> to report the values of 8 different tests on a report. One note is
> 
>>>> that my sample size is really tiny (less than 20
>> independent cases).
>>>> Without wanting to start a flame war, are there any
>> advices of which
>>>> one/ones would be more appropriate and should be reported
>> (along with
>>>> a Q-Q plot). Thank you.
>>>>
>>>> Regards,
>>>>
>>> Wow - I have so many concerns with that approach that it's
>> hard to know
>>> where to begin.  But first of all, why care about
>> normality?  Why not
>>> use distribution-free methods?
>>>
>>> You should examine the power of the tests for n=20.  You'll probably
> 
>>> find it's not good enough to reach a reliable conclusion.
>> And wouldn't it be even worse if I used non-parametric tests?
> 
> I believe what Frank meant was that it's probably better to use a
> distribution-free procedure to do the real test of interest (if there is
> one) instead of testing for normality, and then use a test that assumes
> normality.
> 
> I guess the question is, what exactly do you want to do with the outcome
> of the normality tests?  If those are going to be used as basis for
> deciding which test(s) to do next, then I concur with Frank's
> reservation.
> 
> Generally speaking, I do not find goodness-of-fit for distributions very
> useful, mostly for the reason that failure to reject the null is no
> evidence in favor of the null.  It's difficult for me to imagine why
> "there's insufficient evidence to show that the data did not come from a
> normal distribution" would be interesting.
> 
> Andy
> 
>  
>>> Frank
>>>
>>>
>>> --
>>> Frank E Harrell Jr   Professor and Chair   School 
>> of Medicine
>>>   Department of Biostatistics   
>> Vanderbilt University
>>
>> --
>> yianni
>>
>> __
>> R-help@stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>>
> 
> 
> 
> --
> Notice:  This e-mail message, together with any
> attachments,...{{dropped}}
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] normality tests

2007-05-25 Thread Frank E Harrell Jr
[EMAIL PROTECTED] wrote:
> Hi all,
> 
> apologies for seeking advice on a general stats question. I ve run
> normality tests using 8 different methods:
> - Lilliefors
> - Shapiro-Wilk
> - Robust Jarque Bera
> - Jarque Bera
> - Anderson-Darling
> - Pearson chi-square
> - Cramer-von Mises
> - Shapiro-Francia
> 
> All show that the null hypothesis that the data come from a normal
> distro cannot be rejected. Great. However, I don't think it looks nice
> to report the values of 8 different tests on a report. One note is
> that my sample size is really tiny (less than 20 independent cases).
> Without wanting to start a flame war, are there any advices of which
> one/ones would be more appropriate and should be reported (along with
> a Q-Q plot). Thank you.
> 
> Regards,
> 

Wow - I have so many concerns with that approach that it's hard to know 
where to begin.  But first of all, why care about normality?  Why not 
use distribution-free methods?

You should examine the power of the tests for n=20.  You'll probably 
find it's not good enough to reach a reliable conclusion.

Frank


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] clustered standarderrors using design package

2007-05-19 Thread Frank E Harrell Jr
Anders Eklund wrote:
> Please help,
> 
> I have a strange problem. I've got a balanced panel data set. I use dummy
> variable regression and I've got results with lm function.
> 
> summary(lm(y ~ post + t19961 + t19962 + t19963 + t19964 + t19971 + t19972
> + t19973 + t19974 + t19981+factor( id)))
> 
> 
> The problem is that I would like to get my standard errors clustered but
> then gets the following error message:
> 
> f<-(lm(y ~ post + t19961 + t19962 + t19963 + t19964 + t19971 + t19972 +
> t19973 + t19974 + t19981+factor( id)))

Omit id from the model (usually) and do
library(Design)

f <- ols(y ~ post + . . . , x=TRUE,  y=TRUE)
g.clust1 <- robcov(f, id)

Frank

> library(Design)
> g.clust1 <- robcov(f, id)
> Error in match.arg(type) : 'arg' should be one of working, response,
> deviance, pearson, partial
> 
> All my variables is vectors and I've tried with other variables inside and
> outside the model and all results in the same errormessage.
> 
> Best regards
> 
> Anders Eklund
> Stockholm, Sweden.
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to analyse simple study: Placebo-controlled (2 groups) repeated measurements (ANOVA, ANCOA???)

2007-05-17 Thread Frank E Harrell Jr
Karl Knoblick wrote:
> Hallo!
> 
> I have two groups (placebo/verum), every subject is measured at 5 times, the 
> first time t0 is the baseline measurement, t1 to t4 are the measurements 
> after applying the medication (placebo or verum). The question is, if there 
> is a significant difference in the two groups and how large the differnce is 
> (95% confidence intervals).
> 
> Let me give sample data
> # Data
> ID<-factor(rep(1:50,each=5)) # 50 subjects
> GROUP<-factor(c(rep("Verum", 115), rep("Placebo", 135)))
> TIME<-factor(rep(paste("t",0:4,sep=""), 50))
> set.seed(1234)
> Y<-rnorm(250)
> # to have an effect:
> Y[GROUP=="Verum" & TIME=="t1"]<-Y[GROUP=="Verum" & TIME=="t1"] + 0.6 
> Y[GROUP=="Verum" & TIME=="t2"]<-Y[GROUP=="Verum" & TIME=="t2"] + 0.3 
> Y[GROUP=="Verum" & TIME=="t3"]<-Y[GROUP=="Verum" & TIME=="t3"] + 0.9 
> Y[GROUP=="Verum" & TIME=="t4"]<-Y[GROUP=="Verum" & TIME=="t4"] + 0.9 
> DF<-data.frame(Y, ID, GROUP, TIME)
> 
> I have heard of different ways to analyse the data
> 1) Comparing the endpoint t4 between the groups (t-test), ignoring baseline

Don't even consider this

> 2) Comparing the difference t4 minus t0 between the two groups (t-test)

This is not optimal

> 3) Comparing the endpoint t4 with t0 as a covariate between the groups (ANOVA 
> - how can this model be calculated in R?)

Using t0 as a covariate is the way to go.  A question is whether to just 
use t4.  Generally this is not optimum.

> 4) Taking a summary score (im not sure but this may be a suggestion of 
> Altman) istead of t4
> 5) ANOVA (repeated measurements) times t0 to t5, group placebo/verum), 
> subject as random factor - interested in interaction times*groups (How to do 
> this in R?)
> 6) as 5) but times t1 to t5, ignoring baseline (How to do this in R?)
> 7) as 6) but additional covariate baseline t0 (How to do this in R?)
> 
> What will be best? - (Advantages / disadvantages?)
> How to analyse these models in R with nested and random effects and possible 
> covariate(ID, group - at least I think so) and random parameter ID)? Or is 
> there a more simple possibility?

It's not obvious that random effects are needed if you take the 
correlation into account in a good way.  Generalized least squares using 
for example an AR1 correlation structure (and there are many others) is 
something I often prefer.  A detailed case study with R code (similar to 
your situation) is in http://biostat.mc.vanderbilt.edu/FrankHarrellGLS . 
  This includes details about why t0 is best to consider as a covariate. 
  One reason is that the t0 effect may not be linear.

If you want to focus on t4 it is easy to specify a contrast (after 
fitting is completed) that tests t4.  If time is continuous this 
contrast would involve predicted values at the 4th time, otherwise 
testing single parameters.

Frank Harrell

> 
> Perhaps somebody can recommend a book or weblink where these different 
> strategies of analysing are discussed - preferable with examples with raw 
> data which I can recalculate. And if there is the R syntax includede - this 
> would be best!
> 
> Any help will be appreciate!
> 
> Thanks!
> Karl

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Re : Bootstrap sampling for repeated measures

2007-05-15 Thread Frank E Harrell Jr
justin bem wrote:
> Hi, 
> 
> If it was me I would have done this
> - First reshape the data frame to get some thing like
> 
> header   measure1 measure3 measure3 
> 12800012.471.482.23 ...
> 
> Since you have same number of measure for all subject. The you define you 
> statistic with the data frame in this form. and you can use the boot function 
> in boot or  Hmisc  bootstrap function.
> 
> 
> Justin BEM

I don't think that's the best way to go.  As in the Design package (see 
the predab.resample, validate, calibrate functions) and the Hmisc 
rm.boot function you can easily sample from subject IDs and put together 
the needed records.

Frank Harrell

> Elève Ingénieur Statisticien Economiste
> BP 294 Yaoundé.
> Tél (00237)9597295.
> 
> - Message d'origine 
> De : Niccolò Bassani <[EMAIL PROTECTED]>
> À : r-help@stat.math.ethz.ch
> Envoyé le : Mardi, 15 Mai 2007, 11h15mn 51s
> Objet : [R] Bootstrap sampling for repeated measures
> 
> Dear R users,
> I'm having some problems trying to create a routine for a bootstrap
> resampling issue. Suppose I've got a dataset like this:
> 
> Header  inr    weeks  .
> 12800012.47     0   ...
> 12800011.48     1  ...
> 12800012.23   . 2  ..
> 
> 
> 1280369  2.5   ...   56
> 
> i.e. a dataset with n subjects identified by the column "header", with a set
> of repeated mesaures. The amount of repeated measures for each subject is
> 57, with a few of them being more or lesse frequent. That is, generalizing,
> that I haven't got the same number of observations for each patient.
> I've created a function allowing me to to reorder, subsetting and calculate
> some statistics over this dataset, but now I need to bootstrap it all. I was
> looking for a routine in R that could resample longitudinal data, in order
> to resample "on the ID of the subjects". This means that while resampling
> (suppose m samples of n length) I wish to consider (better with replacement)
> either none or all of the observations related to a subject.
> So, if my bootstrap 1st sample takes the patient with header 1280001, I want
> the routine to consider all of the observations related with a subject with
> such a header.
> Thus, I shall obtain a bootstrap sample of my original dataset to wich apply
> the function cited before (whose only argument is the dataset).
> Can anybody help me? I'm trying to understand how the rm.boot function from
> Hmisc package resamples this way, but it's not that easy, so if anyone could
> help me I'd be very grateful.
> Thanks in advance
> Niccolò
> 
> [[alternative HTML version deleted]]
> 
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 
> 
> 
> 
> 
> 
>   
>   
>   
> ___
> 
> 
> 
> 
> 
>   [[alternative HTML version deleted]]
> 
> 
> 
> ------------
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Nicely formatted summary table with mean, standard deviation or number and proportion

2007-05-13 Thread Frank E Harrell Jr
Keith Wong wrote:
> Dear all,
> 
> The incredibly useful Hmisc package provides a method to generate 
> summary tables that can be typeset in latex. The Alzola and Harrell book 
>   "An introduction to S and the Hmisc and Design libraries" provides an 
> example that generates mean and quartiles for continuous variables, and 
> numbers and percentages for count variables: summary() with method = 
> 'reverse'.
> 
> I wonder if there is a way to change it so the mean and standard 
> deviation are reported instead for continuous variables.
> 
> I illustrate my question below using an example from the book.
> 
> Thank you.
> 
> Keith

Newer versions of Hmisc have an option to add mean and SD for 
method='reverse'.  Quartiles are always there.

Frank

> 
> 
>  > 
>  > library(Hmisc)
>  >
>  > set.seed(173)
>  > sex = factor(sample(c("m", "f"), 500, rep = T))
>  > age = rnorm(500, 50, 5)
>  > treatment = factor(sample(c("Drug", "Placebo"), 500, rep = T))
>  > summary(sex ~ treatment, fun = table)
> sexN=500
> 
> +-+---+---+---+---+
> | |   |N  |f  |m  |
> +-+---+---+---+---+
> |treatment|Drug   |263|140|123|
> | |Placebo|237|133|104|
> +-+---+---+---+---+
> |Overall  |   |500|273|227|
> +-+---+---+---+---+
>  >
>  >
>  >
>  > (x = summary(treatment ~ age + sex, method = "reverse"))
>  > # generates quartiles for continuous variables
> 
> 
> Descriptive Statistics by treatment
> 
> +---+--+--+
> |   |Drug  |Placebo   |
> |   |(N=263)   |(N=237)   |
> +---+--+--+
> |age|46.5/49.9/53.2|46.7/50.0/53.4|
> +---+--+--+
> |sex : m|   47% (123)  |   44% (104)  |
> +---+--+--+
>  >
>  >
>  > # latex(x) generates a very nicely formatted table
>  > # but I'd like "mean (standard deviation)" instead of quartiles.
> 
> 
> 
>  > # this function from 
> http://tolstoy.newcastle.edu.au/R/e2/help/06/11/4713.html
>  > g <- function(y) {
> +   s <- apply(y, 2,
> +  function(z) {
> +z <- z[!is.na(z)]
> +n <- length(z)
> +if(n==0) c(NA,NA,NA,0) else
> +if(n==1) c(z, NA,NA,1) else {
> +  m <- mean(z)
> +  s <- sd(z)
> +  c(N=n, Mean=m, SD=s)
> +}
> +  })
> +   w <- as.vector(s)
> +   names(w) <-  as.vector(outer(rownames(s), colnames(s), paste, sep=''))
> +   w
> + }
> 
>  >
>  > summary(treatment ~ age + sex, method = "reverse", fun = g)
>  > # does not work, 'fun' or 'FUN" argument is ignored.
> 
> 
> Descriptive Statistics by treatment
> 
> +---+--+--+
> |   |Drug  |Placebo   |
> |   |(N=263)   |(N=237)   |
> +---+------+------+
> |age|46.5/49.9/53.2|46.7/50.0/53.4|
> +---+--+--+
> |sex : m|   47% (123)  |   44% (104)  |
> +---+--+--+
>  >
>  >
>  > (x1 = summarize(cbind(age), llist(treatment), FUN = g, 
> stat.name=c("n", "mean", "sd")))
>treatment   n mean   sd
> 1  Drug 263 49.9 4.94
> 2   Placebo 237 50.1 4.97
>  >
>  > # this works but table is rotated, and it count data has to be
>  > # treated separately.
> 
> 
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Predicted values from a logistic model

2007-05-13 Thread Frank E Harrell Jr
MANASI VYDYANATH wrote:
> Hello -
> 
> I apologize if this question is simple/obvious, but I couldn't find a  
> satisfactory answer online, and I am not very accustomed to working  
> with R (Matlab is my poison. :-)). Any help would be greatly  
> appreciated.
> 
> I have a model with a three-level factor and a continuous covariate.  
> The call I use is:
> 
> mymodel <- glm(Response ~ Factor_covariate + continuous_covariate -  
> 1, family = binomial(link = "logit"))
> 
> I would like to generate predicted values for a given level of the  
> covariate, and a given level of the factor. For instance, I want it  
> to give me a fitted value for the response at factor level 1 and  
> continuous covariate value 10. How would I go about expressing this?  
> I tried to look at the package Design, and specifically, at the  
> command "predict.lrt". But I was unable to quite understand how I  
> ought to enter my x-values.  Again, any help would be much appreciated.
> 
> Thank you for taking the time to read this!
> 
> Cheers,
> 
> Manasi

With Design you do predict(mymodel, data.frame(age=30, sex='male'), 
type='fitted')

For ordinal responses there are several options for prediction different 
things.  If you want to leave some covariates unspecified (default to 
reference values - medians or modes) you can use predict(mymodel, 
gendata(mymodel, list of covariates you care to specify))

Frank Harrell

> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] OT: Predicted probabilities from ordinal regressions

2007-05-11 Thread Frank E Harrell Jr
Andrew Perrin wrote:
> Sorry if this is too off-topic, as we may not implement this in R 
> (although we may do so).
> 
> A student of mine is looking at some public opinion data in which there 
> appears to be a statistically significant difference between levels of 
> support for a proposal based on which of two question wordings is used. 
> That is, roughly half the sample was asked the question with wording 1, 
> the other half with wording 2, and the difference is large enough to be of 
> interest (approx. 6 percentage points different with an N of about 
> 30,000).
> 
> The question is how best to model this. In the past I have generated 
> predicted probabilities based on the sample asked wording 1, used these to 
> assign those asked wording 2 to predicted categories, and used a logistic 
> regression to predict difference between predicted and actual response. In 
> this case, though, the question of interest uses a four-level ordinal 
> response, so ordinary predicted probabilities based on, e.g., and ordinal 
> logistic regression generate literally probabilities of being in each of 
> the four categories. Transforming this outcome into a prediction of 
> membership in one of the given categories is not straightforward. Can 
> anyone provide some insight into how to model predicted vs. actual 
> outcomes on an ordinal scale?
> 
> Thanks.

With the lrm function in the Design package you can get predicted 
probabilities for each class as well as predicted mean scores.

Frank

> 
> --
> Andrew J Perrin - andrew_perrin (at) unc.edu - http://perrin.socsci.unc.edu
> Assistant Professor of Sociology; Book Review Editor, _Social Forces_
> University of North Carolina - CB#3210, Chapel Hill, NC 27599-3210 USA
> New Book: http://www.press.uchicago.edu/cgi-bin/hfs.cgi/00/178592.ctl
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Follow-up about ordinal logit with mixtures: how about 'continuation ratio' strategy?

2007-05-11 Thread Frank E Harrell Jr
Paul Johnson wrote:
> On 5/10/07, Frank E Harrell Jr <[EMAIL PROTECTED]> wrote:
>> Paul Johnson wrote:
>>> This is a follow up to the message I posted 3 days ago about how to
>>> estimate mixed ordinal logit models.  I hope you don't mind that I am
>>> just pasting in the code and comments from an R file for your
>>> feedback.  Actual estimates are at the end of the post.
>> . . .
>>
>> Paul,
>>
>> lrm does not give an incorrect sign on the intercepts.  Just look at how
>> it states the model in terms of Prob(Y>=j) so that its coefficients are
>> consistent with the way people state binary models.
>>
>> I'm not clear on your generation of simulated data.  I specify the
>> population logit, anti-logit that, and generate binary responses with
>> those probabilities.  I don't use rlogis.
> 
> Thank you.
> 
> I don't think I'm telling you anything you don't already know, but for
> the record, here goes.  I think the difference in signs is just
> convention within fields.  In choice models (the econometric
> tradition), we usually write that the response is in a higher category
> if
> 
> eta + random > cutpoint
> 
> and that's how I created the data--rlogis supplies the random noise.  Then

Just want to make sure that samples it generates are from the correct 
probability model.  I'm just used to doing this with ifelse(runif(n) <- 
plogis(population.logit)) for the binary case.

> 
> eta - cutpoint > random
> 
> or
> 
> cutpoint - eta < random
> 
> and so
> 
> Prob ( higher outcome ) = Prob ( random > cutpoint - eta)
> 
> In the docs on polr from MASS, V&R say they have the logit equal to
> 
> cutpoint - eta
> 
> so their parameterization is consistent with mine.  On the other hand,
> your approach is to say the response is in a higher category if
> 
>  intercept + eta > random,
> 
> where I think your intercept is -cutpoint. So the signs in your
> results are reversed.
> 
> -cutpoint + eta > random
> 
> 
> But this is aside from the major question I am asking.  Do we think
> that the augmented data frame approach described in Cole, Allison, and
> Ananth is a good alternative to maximum likelihood estimation of
> ordinal logit models, whether they are interpreted as proportional
> odds, continuation, or stopping models?   In the cases I've tested,
> the parameter estimates from the augmented data frame are consistent
> with polr or lrm, but the standard errors and other diagnostic
> informations are quite different.

Then I wouldn't use the augmented data approach.

> 
> I do not think I can follow your suggestion to use penalties in lrm
> because I have to allow for the possibilities that there are random
> effects across clusters of observations, possibly including random
> slope effects, but certainly including random intercepts for 2 levels
> of groupings (in the HLM sense of these things).

My suggestion doesn't handle random slopes but does handle random 
intercepts in a sense.

Good luck
Frank

> 
> Meanwhile, I'm studying how to use optim and numeric integration to
> see if the results are comparable.
> 
> pj
> 
>> See if using the PO model with lrm with penalization on the factor does
>> what you need.
>>
>> lrm is not set up to omit an intercept with the -1 notation.
>>
>> My book goes into details about the continuation ratio model.
>>
>> Frank Harrell
>>
> 
> 
> 
> 
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Read SAS data into R

2007-05-11 Thread Frank E Harrell Jr
kseefeld wrote:
> Kim’s EZ Recipe for….
> 
> SAS TO R, perfectly formatted table with no loss of data
> 
> • In SAS: Export SAS DB as access db
> • In R go to directory where access db is stored
> • Use package RODBC
> 
> #R code
> #Create connection object (Can set up DSN but I’m too lazy to)
> c<-odbcConnectAccess("x.mdb")
> #Create table object, store db in object
> x<-sqlFetch(c, "test")
> #Close connection object
> odbcClose(c)
> 
> 

That doesn't help people who don't have SAS.

Note that an upcoming release of the Hmisc package has a new Access 
import function for users who have access to the mdbtools package on 
their operating system (e.g., linux).

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Read SAS data into R

2007-05-10 Thread Frank E Harrell Jr
John Kane wrote:
> Have a look at the Hmisc package:sas.get or
> sasxport.get.  In either case you need to have a
> working SAS installation on the same machine. 
> 
> Another way, although you lose label is simply to
> export the SAS data file as a csv or delim file and
> import it.

I'm also looking into running the SAS Viewer under wine on linux to read 
non-transport SAS files in a new Hmisc function.  But so far there is a 
problem - SAS in all its wisdom doesn't quote character string output 
when creating csv files with the Viewer, so you can't have commas in 
your character strings.  There is a tab delimiter option as long as you 
don't have tabs in the strings.  Amazing that SAS couldn't do better 
than that.

Frank

> 
> 
> --- AbouEl-Makarim Aboueissa
> <[EMAIL PROTECTED]> wrote:
> 
>> Dear ALL:
>>
>> Could you please let me know how to read SAS data
>> file into R.
>>
>> Thank you so much for your helps.
>>
>> Regards;
>>
>> Abou
>>
>>
>> ==
>> AbouEl-Makarim Aboueissa, Ph.D.
>> Assistant Professor of Statistics
>> Department of Mathematics & Statistics
>> University of Southern Maine
>> 96 Falmouth Street
>> P.O. Box 9300
>> Portland, ME 04104-9300
>>
>> Tel: (207) 228-8389
>> Email: [EMAIL PROTECTED]
>>   [EMAIL PROTECTED]
>> Office: 301C Payson Smith
>>
>> __
>> R-help@stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained,
>> reproducible code.
>>
> 
> ______________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Follow-up about ordinal logit with mixtures: how about 'continuation ratio' strategy?

2007-05-10 Thread Frank E Harrell Jr
Paul Johnson wrote:
> This is a follow up to the message I posted 3 days ago about how to
> estimate mixed ordinal logit models.  I hope you don't mind that I am
> just pasting in the code and comments from an R file for your
> feedback.  Actual estimates are at the end of the post.

. . .

Paul,

lrm does not give an incorrect sign on the intercepts.  Just look at how 
it states the model in terms of Prob(Y>=j) so that its coefficients are 
consistent with the way people state binary models.

I'm not clear on your generation of simulated data.  I specify the 
population logit, anti-logit that, and generate binary responses with 
those probabilities.  I don't use rlogis.

See if using the PO model with lrm with penalization on the factor does 
what you need.

lrm is not set up to omit an intercept with the -1 notation.

My book goes into details about the continuation ratio model.

Frank Harrell

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Mantel-Haenszel relative risk with Greenland-Robins variance estimate

2007-05-08 Thread Frank E Harrell Jr
[EMAIL PROTECTED] wrote:
> Would this function help:
> http://www.csm.ornl.gov/~frome/ES/RRMHex/MHanalysis.txt ?
> 
> Regards, -Cody

I think so.  Thank you Cody.  If you have time would you mind defining, 
probably offline, the input data elements?  I assume that one of them is 
a stratification factor other than occupational group.

Thanks again
Frank

> 
> 
> 
>    
>  Frank E Harrell   
>  Jr
>  <[EMAIL PROTECTED]  To 
>  bilt.edu> R list
>  Sent by:   cc 
>  [EMAIL PROTECTED] 
>  at.math.ethz.ch   Subject 
>[R] Mantel-Haenszel relative risk   
>with Greenland-Robins variance  
>  05/08/2007 02:38  estimate
>  PM
>
>
>
>
>
> 
> 
> 
> 
> Does anyone know of an R function for computing the Greenland-Robins
> variance for Mantel-Haenszel relative risks?
> 
> Thanks
> Frank
> --

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Mantel-Haenszel relative risk with Greenland-Robins variance estimate

2007-05-08 Thread Frank E Harrell Jr
Does anyone know of an R function for computing the Greenland-Robins 
variance for Mantel-Haenszel relative risks?

Thanks
Frank
-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] pseudo-R2 or GOF for regression trees?

2007-05-05 Thread Frank E Harrell Jr
Prof. Jeffrey Cardille wrote:
> Hello,
> 
> Is there an accepted way to convey, for regression trees, something  
> akin to R-squared?
> 
> I'm developing regression trees for a continuous y variable and I'd  
> like to say how well they are doing. In particular, I'm analyzing the  
> results of a simulation model having highly non-linear behavior, and  
> asking what characteristics of the inputs are related to a particular  
> output measure.  I've got a very large number of points: n=4000.  I'm  
> not able to do a model sensitivity analysis because of the large  
> number of inputs and the model run time.
> 
> I've been googling around both on the archives and on the rest of the  
> web for several hours, but I'm still having trouble getting a firm  
> sense of the state of the art.  Could someone help me to quickly  
> understand what strategy, if any, is acceptable to say something like  
> "The regression tree in Figure 3 captures 42% of the variance"?  The  
> target audience is readers who will be interested in the subsequent  
> verbal explanation of the relationship, but only once they are  
> comfortable that the tree really does capture something.  I've run  
> across methods to say how well a tree does relative to a set of trees  
> on the same data, but that doesn't help much unless I'm sure the  
> trees in question are really capturing the essence of the system.
> 
> I'm happy to be pointed to a web site or to a thread I may have  
> missed that answers this exact question.
> 
> Thanks very much,
> 
> Jeff
> 
> --
> Prof. Jeffrey Cardille
> [EMAIL PROTECTED]
> R-help@stat.math.ethz.ch mailing list

Ye (below) has a method to get a nearly unbiased estimate of R^2 from 
recursive partitioning.  In his examples the result was similar to using 
the formula for adjusted R^2 with regression degrees of freedom equal to 
about 3n/4.  You can also use something like 10-fold cross-validation 
repeated 20 times to get a fairly precise and unbiased estimate of R^2.

Frank


>@ARTICLE{ye98mea,
   author = {Ye, Jianming},
   year = 1998,
   title = {On measuring and correcting the effects of data mining and model
   selection},
   journal = JASA,
   volume = 93,
   pages = {120-131},
   annote = {generalized degrees of freedom;GDF;effective degrees of
freedom;data mining;model selection;model
uncertainty;overfitting;nonparametric regression;CART;simulation
setup}
}
-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Error in if (!length(fname) || !any(fname == zname)) { :

2007-05-04 Thread Frank E Harrell Jr
hongyuan cao wrote:
> Dear R users,
> 
> I tried to fit a cox proportional hazard model to get estimation of 
> stratified survival probability. my R code is as follows:
> 
> cph(Surv(time.sur, status.sur)~ strat(colon[,13])+colon[,18] 
> +colon[,20]+colon[,9], surv=TRUE)
> Error in if (!length(fname) || !any(fname == zname)) { : 
> missing value where TRUE/FALSE needed
> Here colon[,13] is the one that I want to stratify and the others are all 
> coefficients. How can I solve this problem?  Thanks a lot!
> 
> Grace

The Design package does not like you to have complex variable names like 
that, and in general storing your data in a matrix when you want to 
treat columns as separate variables is not the best approach.  I would 
use something like

S <- with(d, Surv(  ) )   # d = data frame
h <- as.data.frame(colon)  # assumes colon is a matrix;make sure it had 
column names
cph(S ~ strat(v1)+v2+v3+v4, data=h)

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Hmisc curve label size & cex

2007-04-28 Thread Frank E Harrell Jr
Brian O'Connor wrote:
> R-Masters,
> 
> I need to produce high resolution line plots and place labels on the 
> curves. It seems that cex must be high relative to the other cex 
> values in order to produce sufficiently large & legible tick labels 
> at high resolutions. But high cex values cause the curve labels to 
> become gigantic when using Hmisc. I've struggled and searched the 
> archives, but cannot find a way of controlling the sizes of the curve 
> labels in this situation.
> 
> These commands produce the problem on my PC using XP:
> 
> 
> png("trial.png", width=3000, height=2400, res = 600, pointsize=12 )
> par(ann=F, font.main=1, font.lab=1, font.axis=1, cex=5, cex.main=1, 
> cex.lab=1, cex.axis=1,
> lwd=12, las=1, mar=c(4, 4, 2, 2)   )
> 
> x = seq(-2.5, 2.5, length=100)
> 
> labcurve( list( One=  list( x,sin(x)), Two=  list( x,cos(x)),
>Three=list( x,(x*x)), Four= list( x,exp(x)) ),
>keys=c('1','2','3','4'),  keyloc="none", pl=TRUE )
> 
> dev.off()
> 
> 
> Thanks for your time.
> 

cex.main .lab .axis etc. are relative so yo need for your case to 
specify something like cex.axis=1/5

Not sure why you are using keys of 1-4 when you've already given nice 
labels.  I tried

  labcurve( list( One=  list( x,sin(x)), Two=  list( x,cos(x)),
 Three=list( x,(x*x)), Four= list( x,exp(x)) ),
 pl=TRUE )

and got some nice output after reducing cex.*

Frank

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Integrating R-programs into larger systems

2007-04-27 Thread Frank E Harrell Jr
Ralf Finne wrote:
> Hi experts!
> 
> Have anybody experience in including an R-program 
> as part of a larger system?  In Matlab there is a toolbox
> that converts a m-script into C-code. 
> One application in mind is that I do the model building in R,
> for estimating the risk for cancer based on clinical measurements.
> 
> When the model is ready, a small R-program can simulate
> the model to estimate the risk for a new patient. The idea is
> that a doctor gets the measurements for the patient sitting in his
> office.  Then he goes to Internet and types in the test measuremnets
> and gets the estimated risk.
> Look at www.finne.info for an early version to get the idea.
> There I developed the model in Matlab and converted it to Excel.
> Don't use the results!  Much better are available in R!
> There are many more applications that need a higher degree
> of integration.
> 
> Than you in advance.
> 
> Ralf Finne
> SYH, University of Applied Sciences
> Vasa Finland

The Design package for R has a version for S-Plus.  In S-Plus you can 
use its Dialog function to automatically create a GUI for getting 
predicted values from a series of fitted models.  We will be working on 
an R version but it will publish the model to a web server.

Frank
-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Not showing dvi with Hmisc latex()

2007-04-26 Thread Frank E Harrell Jr
Duncan Murdoch wrote:
> On 4/26/2007 9:20 PM, Gad Abraham wrote:
>> Hi,
>>
>> I'm using latex() from Frank Harrell's Hmisc library to produce LaTeX 
>> files. By default, it calls xdvi and displays the dvi.
>>
>> How can I make xdvi not show? I couldn't find a clue in the extensive 
>> documentation.
> 
> Unclass the result so it doesn't print as a latex object.  For example,
> 
>  > unclass(latex("1", file="test.tex"))
> $file
> [1] "test.tex"
> 
> $style
> character(0)
> 
> Alternatively, if you just assign the result you can print it later. 
> It's when you print that the latex'ing happens.
> 
> Duncan Murdoch

Just say

w <- latex(object, file='foo.tex')

Frank

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Biostatistician Opportunities at Vanderbilt

2007-04-25 Thread Frank E Harrell Jr
The Department of Biostatistics at Vanderbilt University's School of 
Medicine has openings for biostatisticians at all levels.  Details and 
application procedures may be found at 
http://biostat.mc.vanderbilt.edu/JobOpenings .  For M.S. and B.S. 
biostatisticians we are especially interested in statisticians 
proficient in R, S-Plus, or Stata.  We have faculty positions available 
at the Assistant, Associate, and Professor levels.

Frank Harrell
Chairman, Dept. of Biostatistics

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Coercing data types for use in model.frame

2007-04-25 Thread Frank E Harrell Jr
Prof Brian Ripley wrote:
> Moved to R-devel 
> 
> What is the 'data class'?  In particular what is its underlying type? 
> And where in model.frame[.default] are you trying to use it (in the 
> formula, data, in ..., etc).
> 
> This is an example of where some reproducible code and the error 
> messages would be very helpful indeed.
> 
> Brian

Brian,

Sorry - this was one of those "too late in the day" errors.  The problem 
was in a function called just before model.frame.  model.frame seems to 
work fine with an object of class c('mChoice', 'labelled').  It keeps 
mChoice variables as mChoice.  After model.frame is finished I'll change 
such variables to factors or matrices.

Frank


> 
> On Tue, 24 Apr 2007, Frank E Harrell Jr wrote:
> 
>> In the Hmisc package there is a new data class 'mChoice' for multiple
>> choice variables.  There are format and as.numeric methods (the latter
>> creates a matrix of dummy variables).  mChoice variables are not allowed
>> by model.frame.  Is there a way to specify a conversion function that
>> model.frame will use automatically?  I would use as.factor here.
>> model.frame does not seem to use as.data.frame.foo for individual 
>> variables.
>>
>> Thanks
>> Frank
>>
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Coercing data types for use in model.frame

2007-04-24 Thread Frank E Harrell Jr
In the Hmisc package there is a new data class 'mChoice' for multiple 
choice variables.  There are format and as.numeric methods (the latter 
creates a matrix of dummy variables).  mChoice variables are not allowed 
by model.frame.  Is there a way to specify a conversion function that 
model.frame will use automatically?  I would use as.factor here. 
model.frame does not seem to use as.data.frame.foo for individual variables.

Thanks
Frank
-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] help comparing two median with R

2007-04-18 Thread Frank E Harrell Jr
[EMAIL PROTECTED] wrote:
> Has anyone proposed using a bootstrap for Pedro's problem?
> 
> What about taking a boostrap sample from x, a boostrap sample from y, take
> the difference in the medians for these two bootstrap samples, repeat the
> process 1,000 times and calculate the 95th percentiles of the 1,000
> computed differences?  You would get a CI on the difference between the
> medians for these two groups, with which you could determine whether the
> difference was greater/less than zero.  Too crude?
> 
> Regards,
>-Cody

As hinted at by Brian Ripley, the following code will approximate that. 
  It gets the nonparametric confidence interval for the median and 
solves for the variance that would give the same confidence interval 
width if normality of the median held.

  g <- function(y) {
 y <- sort(y[!is.na(y)])
 n <- length(y)
 if(n < 4) return(c(median=median(y),q1=NA,q3=NA,variance=NA))
 qu <- quantile(y, c(.5,.25,.75))
 names(qu) <- NULL
 r <- pmin(qbinom(c(.025,.975), n, .5) + 1, n)  ## Exact 0.95 C.L.
 w <- y[r[2]] - y[r[1]] ## Width of C.L.
 var.med <- ((w/1.96)^2)/4  ## Approximate variance of median
 c(median=qu[1], q1=qu[2], q3=qu[3], variance=var.med)
   }

Run g separately by group, add the two variances, and take the square 
root to approximate the variance of the difference in medians and get a 
confidence interval.

Frank
> 
> 
> 
> 
>        
>      Frank E Harrell   
>  Jr
>  <[EMAIL PROTECTED]  To 
>  bilt.edu> Thomas Lumley   
>  Sent by:  <[EMAIL PROTECTED]>  
>  [EMAIL PROTECTED]  cc 
>  at.math.ethz.ch   r-help@stat.math.ethz.ch
>Subject 
>Re: [R] help comparing two median   
>  04/18/2007 05:02  with R  
>  AM
>
>
>
>
>            
> 
> 
> 
> 
> Thomas Lumley wrote:
>> On Tue, 17 Apr 2007, Frank E Harrell Jr wrote:
>>
>>> The points that Thomas and Brian have made are certainly correct, if
>>> one is truly interested in testing for differences in medians or
>>> means.  But the Wilcoxon test provides a valid test of x > y more
>>> generally.  The test is consonant with the Hodges-Lehmann estimator:
>>> the median of all possible differences between an X and a Y.
>>>
>> Yes, but there is no ordering of distributions (taken one at a time)
>> that agrees with the Wilcoxon two-sample test, only orderings of pairs
>> of distributions.
>>
>> The Wilcoxon test provides a test of x>y if it is known a priori that
>> the two distributions are stochastically ordered, but not under weaker
>> assumptions.  Otherwise you can get x>y>z>x. This is in contrast to the
>> t-test, which orders distributions (by their mean) whether or not they
>> are stochastically ordered.
>>
>> Now, it is not unreasonable to say that the problems are unlikely to
>> occur very often and aren't worth worrying too much about. It does imply
>> that it cannot possibly be true that there is any summary of a single
>> distribution that the Wilcoxon test tests for (and the same is true for
>> other two-sample rank tests, eg the logrank test).
>>
>> I know Frank knows this, because I gave a talk on it at Vanderbilt, but
>> most people don't know it. (I thought for a long time that the Wilcoxon
>> rank-sum test was a test for the median pairwise mean, which is actually
>> the R-estimator corresponding to the *one*-sample Wilcoxon test).
>>
>>
>> -thomas
>>
> 
> Thanks for your note Thomas.  I do feel that the problems you have
> rightly listed occur infrequently and that often I only care about two
> groups.  Rank tests generally are good at relatives

Re: [R] help comparing two median with R

2007-04-18 Thread Frank E Harrell Jr
Thomas Lumley wrote:
> On Tue, 17 Apr 2007, Frank E Harrell Jr wrote:
> 
>> The points that Thomas and Brian have made are certainly correct, if 
>> one is truly interested in testing for differences in medians or 
>> means.  But the Wilcoxon test provides a valid test of x > y more 
>> generally.  The test is consonant with the Hodges-Lehmann estimator: 
>> the median of all possible differences between an X and a Y.
>>
> 
> Yes, but there is no ordering of distributions (taken one at a time) 
> that agrees with the Wilcoxon two-sample test, only orderings of pairs 
> of distributions.
> 
> The Wilcoxon test provides a test of x>y if it is known a priori that 
> the two distributions are stochastically ordered, but not under weaker 
> assumptions.  Otherwise you can get x>y>z>x. This is in contrast to the 
> t-test, which orders distributions (by their mean) whether or not they 
> are stochastically ordered.
> 
> Now, it is not unreasonable to say that the problems are unlikely to 
> occur very often and aren't worth worrying too much about. It does imply 
> that it cannot possibly be true that there is any summary of a single 
> distribution that the Wilcoxon test tests for (and the same is true for 
> other two-sample rank tests, eg the logrank test).
> 
> I know Frank knows this, because I gave a talk on it at Vanderbilt, but 
> most people don't know it. (I thought for a long time that the Wilcoxon 
> rank-sum test was a test for the median pairwise mean, which is actually 
> the R-estimator corresponding to the *one*-sample Wilcoxon test).
> 
> 
> -thomas
> 

Thanks for your note Thomas.  I do feel that the problems you have 
rightly listed occur infrequently and that often I only care about two 
groups.  Rank tests generally are good at relatives, not absolutes.  We 
have an efficient test (Wilcoxon) for relative shift but for estimating 
an absolute one-sample quantity (e.g., median) the nonparametric 
estimator is not very efficient.  Ironically there is an exact 
nonparametric confidence interval for the median (unrelated to Wilcoxon) 
but none exists for the mean.

Cheers,
Frank
-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] read.spss (package foreign) and SPSS 15.0 files

2007-04-17 Thread Frank E Harrell Jr
John Kane wrote:
> --- Charilaos Skiadas <[EMAIL PROTECTED]> wrote:
> 
>> On Apr 16, 2007, at 10:41 AM, John Kane wrote:
>>> --- Charilaos Skiadas <[EMAIL PROTECTED]> wrote:
>>>> It is not an "export" option, it is a "save as"
>>>> option. I don't have
>>>> a 14 to check, but on a 15 I just go to File ->
>> Save
>>>> As, and change
>>>> the "Save as type" field to "Comma Delimited
>>>> (.*.csv)". (I suppose
>>>> tab delimited would be another option). Then
>> there
>>>> are two check-
>>>> boxes below the window that allow a bit further
>>>> customizing, one of
>>>> them is about using value labels where defined
>>>> instead of data values.
>>> I'm now back on a machine with SPSS 14.  No csv
>> option
>>> that I can see.  Perhaps an enhancement to v15.
>> I don't have a 14, but I did check a 13 today and
>> you are correct, no  
>> csv option is there, which in my opinion is quite
>> unacceptable for a  
>> statistical package that is on its 13/14th version.
>> But there was an  
>> option for "Excel 97 and ...", and that seemed to
>> allow using value  
>> labels instead of the values (again you have to
>> check the  
>> corresponding box). So perhaps that would be an
>> option.
> 
> Not my problem at the moment but a colleague pointed
> out that he had 750+ variables. Excel handles 256. 
> 
> Actually my problem is now a SAS one where I can get a
> clean csv export but lose the variable labels.  My
> crude workaround was just to do a proc contents and
> cut-and-paste the results.  A  pain but it worked.  
> 
> I've got to figure out why I cannot get Hmisc to work!

And note that Hmisc has other SAS import options besides sas.get that 
keep labels.  Some of them require you to run PROC CONTENTS CNTLOUT= to 
save metadata in a dataset.

Frank

> 
> 
> Thanks for the suggestion.
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] help comparing two median with R

2007-04-17 Thread Frank E Harrell Jr
Thomas Lumley wrote:
> On Tue, 17 Apr 2007, Robert McFadden wrote:
> 
>>> -Original Message-
>>> From: [EMAIL PROTECTED]
>>> [mailto:[EMAIL PROTECTED] On Behalf Of Jim Lemon
>>> Sent: Tuesday, April 17, 2007 12:37 PM
>>> To: Pedro A Reche
>>> Cc: r-help@stat.math.ethz.ch
>>> Subject: Re: [R] help comparing two median with R
>>>
>>> Pedro A Reche wrote:
>>>> Dear R users,
>>>> I am new to R and  I would like to ask your help with the following
>>>> topic. I have three sets of numeral data, 2 sets are paired and a
>>>> third is independent of the other two. For each of these sets I have
>>>> obtained their basic statistics (mean, median, stdv, range ...).
>>>> Now I want to compare if these sets differ. I could compare
>>> the mean
>>>> doing a basic T test . However, I was looking for a test to compare
>>>> the medians using R.   If that is possible I would love to
>>> hear the
>>>> specifics.
>>> Hi Pedro,
>>> You can use the Mann-Whitney test ("wilcox" with two
>>> samples), but you would have to check that the second and
>>> third moments of the variable distributions were the same, I think.
>>>
>>> Jim
>> Use Mann-Whitney U test, but remember about 2 assumption:
>> 1. samples come from continuous distribution (there are no tied
>> obserwations)
>> 2. distributions are identical in shape. It's very similar to t-test but
>> Mann-Whitney U test is not as affected by violation of the homogeneity of
>> variance assumption as t-test is.
>>
> 
> This turns out not to be quite correct.
> 
> If the two distributions differ only by a location shift then the 
> hypothesis that the shift is zero is equivalent to the medians being the 
> same (or the means, or the 3.14159th percentile), and the Mann-Whitney U 
> test will test this hypothesis. Otherwise the Mann-Whitney U test does not 
> test for equal medians.
> 
> The assumption that the distributions are continuous is for convenience -- 
> it makes the distribution of the test statistic easier to calculate and 
> otherwise R uses a approximation.  The assumption of a location shift is 
> critical -- otherwise it is easy to construct three data sets x,y,z so 
> that the Mann-Whitney U test thinks x is larger than y, y is larger than z 
> and z is larger than x (Google for Efron Dice). That is, the Mann-Whitney 
> U test cannot be a test for any location statistic.
> 
> There actually is an exact test for the median that does not assume a 
> location shift:  dichotomize your data at the pooled median to get a 2x2 
> table of above/below median by group, and do Fisher's exact test on the 
> table.  This is almost never useful (because it doesn't come with an 
> interval estimate), but is interesting because it (and the generalizations 
> to other quantiles) is the only exactly distribution-free location test 
> that does not have the 'non-transitivity' problem of the Mann-Whitney U 
> test.  I believe this median test is attributed to Mood, but I have not 
> seen the primary source.
> 
>   -thomas

The Mood test is so inefficient that its use is no longer recommended:

@Article{fri00sho,
   author =   {Freidlin, Boris and Gastwirth, Joseph L.},
   title ={Should the median test be retired from 
general use?},
   journal =  American Statistician,
   year = 2000,
   volume =   54,
   number =   3,
   pages =    {161-164},
   annote =   {inefficiency of Mood median test}
}

The points that Thomas and Brian have made are certainly correct, if one 
is truly interested in testing for differences in medians or means.  But 
the Wilcoxon test provides a valid test of x > y more generally.  The 
test is consonant with the Hodges-Lehmann estimator: the median of all 
possible differences between an X and a Y.

Frank


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Nonparametric Effect size indices

2007-04-13 Thread Frank E Harrell Jr
Chuck Cleland wrote:
> Martin Plöderl wrote:
>> Hello!
>>
>> For comparing two non-normally distributed samples, Leech (2002) suggested
>> to report nonparametric effect size indices, such as Vargha & Delaney's A or
>> Cliff's d. I tried to search the R-sites, but could not find related
>> procedures or packages that include nonparametric effect sizes. 
>> Thank you for your help!
>>
>> Citation: Leech (2002). A call for greater use of nonparametric statistics.
>> Paper presented at the Annual Meeting of the Mid-South Educational Research
>> Association, Chattanooga, TN, November 6-8.

Please beware.  That literature is just giving new names to much older 
concepts.  Look at Mann-Whitney, Wilcoxon, Kendall, Somers for better 
citations.  And Cohen's d in the pdf below should just be called a z-score.

Frank Harrell

> 
>   Based on the description of Cliff's d in
> http://www.florida-air.org/romano06.pdf, you could do something like the
> following:
> 
> g1 <- c(1,1,2,2,2,3,3,3,4,5)
> g2 <- c(1,2,3,4,4,5)
> 
> # Dominance matrix
> sign(outer(g1, g2, FUN="-"))
>   [,1] [,2] [,3] [,4] [,5] [,6]
>  [1,]0   -1   -1   -1   -1   -1
>  [2,]0   -1   -1   -1   -1   -1
>  [3,]10   -1   -1   -1   -1
>  [4,]10   -1   -1   -1   -1
>  [5,]10   -1   -1   -1   -1
>  [6,]110   -1   -1   -1
>  [7,]110   -1   -1   -1
>  [8,]110   -1   -1   -1
>  [9,]11100   -1
> [10,]111110
> 
> mean(rowMeans(sign(outer(g1, g2, FUN="-"
> [1] -0.25
> 
>   If you can point us to a description of Vargha & Delaney's A, someone
> can likely suggest a way of obtaining that too.
> 
>> Regards, 
>>
>> Martin Plöderl PhD
>> Suicide Prevention Research Program, Institute of Public Health
>> Paracelsus Private Medical University
>> Dept. of Suicide Prevention, University Clinic of Psychiatry and
>> Psychotherapy
>> Christian - Doppler - Klinik
>> Ignaz Harrerstrasse 79
>> A-5020 Salzburg
>> AUSTRIA
>> Phone: +43-662-4483-4345
>> Fax: +43-662-4483-4344
>> E-mail: [EMAIL PROTECTED]
>>
>> __
>> [EMAIL PROTECTED] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code. 
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] sas.get problem

2007-04-11 Thread Frank E Harrell Jr
>> formats.sas7bcat
>>>> not found. Formatting ignored.
>>>>  in: sas.get(library = "F:/sas", mem =
>> "form.ea1",
>>>> format.library = "sas.fmts.sas7bdat",
>>>> 2: 'cmd' execution failed with error code 1 in:
>>>> shell(cmd, wait = TRUE, intern = output)
>>> The sas.get function in the Hmisc library is
>> broken
>>> under Windows.
>>>
>>> Change line 127 from:
>>>
>>> status <- sys(paste(shQuote(sasprog),
>>> shQuote(sasin), "-log",
>>> shQuote(log.file)), output = FALSE)
>>>
>>> to:
>>>
>>> status <- system(paste(shQuote(sasprog),
>>> shQuote(sasin), "-log",
>>> shQuote(log.file)))
>>>
>>> Tim C
>> Thanks Tim,
>>
>> How do I make this change? I naively have tried by
>> a) list sas.get and copy to editor
>> b) reload R without loading Hmisc
>> c) made recommended changes to sas.get
>> d) stuck a "sas.get <- " in front of the function
>> and
>> ran it.
>>
>> R returns the same error messages.  I have also
>> corrected the typo in sasprog and done some renaming
>> following Cody's suggestions.
>>
>> Currently I have:
>>formea1.sas7bdat
>>formea2.sas7bdat
>>formats.sas7bdat
>>
>> detach("package:Hmisc")
>>  mydata <- sas.get(library="F:/sas", mem="formea1",
>>format.library="F:/sas",
>>  sasprog = '"C:/Program
>> Files/SAS/SAS9.1/sas.exe"')
>>
>> RESULTS
>> Error in sas.get(library = "F:/sas", mem =
>> "formea1",
>> format.library = "F:/sas",  :
>> SAS job failed with status -1
>> In addition: Warning messages:
>> 1: F:/sas/formats.sc? or formats.sas7bcat  not
>> found.
>> Formatting ignored.
>>  in: sas.get(library = "F:/sas", mem = "formea1",
>> format.library = "F:/sas",
>> 2: "\"C:/Program not found
>> -
>> I really don't see why the sagprog does not work
>> unless  an early error is falling through.
>>
>> Thanks for all the help
>>
>> __
>> R-help@stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained,
>> reproducible code.
>>
>>
>>
>>
>>
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Reasons to Use R

2007-04-10 Thread Frank E Harrell Jr
Taylor, Z Todd wrote:
> On Monday, April 09, 2007 3:23 PM, someone named Wilfred wrote:
> 
>> So what's the big deal about S using files instead of memory
>> like R. I don't get the point. Isn't there enough swap space
>> for S? (Who cares anyway: it works, isn't it?) Or are there
>> any problems with S and large datasets? I don't get it. You
>> use them, Greg. So you might discuss that issue.
> 
> S's one-to-one correspondence between S objects and filesystem
> objects is the single remaining reason I haven't completely
> converted over to R.  With S I can manage my objects via
> makefiles.  Corrections to raw data or changes to analysis
> scripts get applied to all objects in the project (and there
> are often thousands of them) by simply typing 'make'.  That
> includes everything right down to the graphics that will go
> in the report.
> 
> How do people live without that?

Personally I'd rather have R's save( ) and load( ).

Frank

> 
> --Todd


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] read.spss (package foreign) and SPSS 15.0 files

2007-04-06 Thread Frank E Harrell Jr
Charilaos Skiadas wrote:
> On Apr 6, 2007, at 12:32 PM, John Kane wrote:
> 
>> I have simply moved to exporting the SPSS file to a
>> delimited file and loading it. Unfortunately I'm
>> losing all the labelling which can be time-consuming
>> to redo.Some of the data has something like 10
>> categories for a variable.
> 
> I save as csv format all the time, and it offers me a choice to use  
> the labels instead of the corresponding numbers. So you shouldn't  
> have to lose that labelling.
> 
> Haris Skiadas
> Department of Mathematics and Computer Science
> Hanover College

That's a different point.  The great advantage of read.spss (and the 
spss.get function in Hmisc that uses it) is that long variable labels 
are supported in addition to variable names.  That's why I like getting 
SPSS or Stata files instead of csv files.  I'm going to enhance csv.get 
in Hmisc to allow a row number to be specified, to contain long variable 
labels.

Frank

> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Doing partial-f test for stepwise regression

2007-04-01 Thread Frank E Harrell Jr
Petr Klasterecky wrote:
> And what about to read the help page ?anova ...?
> 
>  >>>
> When given a sequence of objects, 'anova' tests the models against
>   one another in the order specified.
> <<<
> 
> Generally you almost never fit a full model (including all possible 
> interactions etc) - no one can interpret such complicated models. Anova 
> gives you a comparison between a broader model (the first argument to 
> anova) and its submodel(s).

True you might not fit a model with high-order interactions, but the 
full pre-specified model is the only one whose standard errors and test 
statistics work as advertised.

Frank

> 
> Petr
> 
> [EMAIL PROTECTED] napsal(a):
>> Hello all,
>> I am trying to figure out an optimal linear model by using stepwise
>> regression which requires partial f-test, I did some Googling on the
>> Internet and realised that someone seemed to ask the question before:
>>
>> Jim Milks <[EMAIL PROTECTED]> writes: 
>>> Dear all: 
>>>
>>> I have a regression model that has collinearity problems (between 
>>> three regressor variables). I need a F-test that will allow me to 
>>> compare between full (with all variables) and partial models (minus 
>>> 1=< variables). The general F-test formula I'm using is: 
>>>
>>> F = {[SS(full model) - SS(reduced model)] / (#variables taken out)} / 
>>> MSS(full model) 
>>>
>>> Unfortunately, the ANOVA table parses the SS and MSS between the 
>>> variables and does not give the statistics for the regression model as 
>>> a whole, otherwise I'd do this by hand. 
>>>
>>> So, really, I have two questions: 1) Can I just add up all the SS and 
>>> MSS for all the variables to get the model SS and MSS and 2) Are 
>>> there any functions or packages I can use to calculate the F-statistic? 
>>> Just use anova(model1, model2). 
>>> (One potential catch: Make sure that both models are fitted to the same
>>> data set. Missing values in predictors may interfere.) 
>> However, in the answer provided by Mr. Peter Dalgaard,(use
>> anova(model1,model2) I could not understand what model1 and model2 are
>> supposed to referring to, which one is supposedly to be the full model and
>> which one is to be the partial model? Or it does not matter?
>>
>> Thanks in advance for help from anyone!
>>
>> Regards,
>> Anyi Zhu
>>
>> __
>> R-help@stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Wikibooks

2007-03-29 Thread Frank E Harrell Jr
Ben Bolker wrote:
> Alberto Monteiro  centroin.com.br> writes:
> 
>> As a big fan of Wikipedia, it's frustrating to see how little there is about 
>> R in the correlated project, the Wikibooks:
>>
>> http://en.wikibooks.org/wiki/R_Programming
>>
>> Alberto Monteiro
>>
> 
>   Well, we do have an R wiki -- http://wiki.r-project.org/rwiki/doku.php --
> although it is not as active as I'd like.  (We got stuck halfway through
> porting Paul Johnson's "R Tips" to it ...)   Please contribute!
>   Most of the (considerable) effort people expend in answering
> questions about R goes to the mailing lists -- I personally would like it if 
> some
> tiny fraction of that energy could be redirected toward the wiki, where
> information can be presented in a nicer format and (ideally) polished
> over time -- rather than having to dig back through multiple threads on the
> mailing lists to get answers.  (After that we have to get people
> to look for the answers on the wiki.)

I would like to strongly second Ben.  In some ways, R experts are too 
nice.  Continuing to answer the same questions over and over does not 
lead to a better way using R wiki.  I would rather see the work go into 
enhancing the wiki and refactoring information, and responses to many 
r-help please for help be "see wiki topic x".  While doing this let's 
consider putting a little more burden on new users to look for good 
answers already provided.

Frank

> 
>   Just my two cents -- and I've been delinquent in my 
> wiki'ing recently too ...
> 
>   Ben Bolker
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] what is the difference between survival analysis and (...)

2007-03-29 Thread Frank E Harrell Jr
Thomas Lumley wrote:
> On Wed, 28 Mar 2007, Frank E Harrell Jr wrote:
> 
>> Eric Elguero wrote:
>>> Hi everybody,
>>>
>>> recently I had to teach a course on Cox model, of which I am
>>> not a specialist, to an audience of medical epidemiologists.
>>> Not a good idea you might say.. anyway, someone in the
>>> audience was very hostile. At some point, he sayed that
>>> Cox model was useless, since all you have to do is count
>>> who dies and who survives, divide by the sample sizes
>>> and compute a relative risk, and if there was significant
>>> censoring, use cumulated follow-up instead of sample
>>> sizes and that's it!
>>> I began arguing that in Cox model you could introduce
>>> several variables, interactions, etc, then I remembered
>>> of logistic models ;-)
>>> The only (and poor) argument I could think of was that
>>> if mr Cox took pains to devise his model, there should
>>> be some reason...
>>
>> That is a very ignorant person, concerning statistical
>> efficiency/power/precision and how to handle incomplete follow-up
>> (variable follow-up duration).  There are papers in the literature (I
>> wish I had them at my fingertips) that go into the efficiency loss of
>> just counting events.  If the events are very rare, knowing the time
>> doesn't help as much, but the Cox model still can handle censoring
>> correctly and that person's approach doesn't.
>>
> 
> Certainly just counting the events is inefficient -- the simplest 
> example would be studies of some advanced cancers where nearly everyone 
> dies during followup, so that there is little or no censoring but simple 
> counts are completely uninformative.
> 
> It's relatively hard to come up with an example where using the 
> total-time-on-test (rather than sample size) as a denominator is much 
> worse than the Cox mode, though. You need the baseline hazard to vary a 
> lot over time and the censoring patterns to be quite different in the 
> groups, but proportional hazards to still hold.
> 
> I think the advantages of the Cox model over a reasonably sensible 
> person-time analysis are real, but not dramatic -- it would be hard to 
> find a data set that would convince the sort of person who would make 
> that sort of claim.
> 
> I would argue that computational convenience on the one hand, and the 
> ability to exercise lots of nice mathematical tools on the other hand 
> have also contributed to the continuing popularity of the Cox model.
> 
> 
> -thomas
> 
> Thomas LumleyAssoc. Professor, Biostatistics
> [EMAIL PROTECTED]University of Washington, Seattle
> 
> 
> 

Nicely put Thomas.  I have seen examples from surgical research where 
the hazard function is bathtub shaped and the epidemiologist's use of 
the exponential distribution is very problematic.  I have also seen 
examples in acute illness and medical treatment where time until death 
is important even with only 30-day follow-up.

Frank

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] weight factor in somers2 function

2007-03-28 Thread Frank E Harrell Jr
[EMAIL PROTECTED] wrote:
> Hi!
> I’m trying to calculate de C index (concordance probability) through the
> somers2 function (library Hmisc). I’m interesting on including the
> sampling effort as a weight factor for the evaluation of model predictions
> with real data. I’ve some questions about that: first of all I’m not
> really sure if I can include sampling effort as a weight factor. Since the
> weight factor should be a numeric vector of observation (usually
> frequencies), I would expect that sampling effort could be a surrogate of
> the frequency count of the number of subjects (i.e. frequency of
> observation). However, when I use sampling effort as a weight factor, I
> get C index larger than one. I guess/know this is statistically wrong.
> Then, if these values were frequency of observation; what is working
> incorrectly? What should be the characteristics of the weight vector? Or
> what could be exactly included as weight factor?
> Thank you very much!

Send me the smallest artificial example you can construct and I'll work 
on it.

Frank


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] what is the difference between survival analysis and (...)

2007-03-28 Thread Frank E Harrell Jr
Eric Elguero wrote:
> Hi everybody,
> 
> recently I had to teach a course on Cox model, of which I am
> not a specialist, to an audience of medical epidemiologists.
> Not a good idea you might say.. anyway, someone in the
> audience was very hostile. At some point, he sayed that
> Cox model was useless, since all you have to do is count
> who dies and who survives, divide by the sample sizes
> and compute a relative risk, and if there was significant
> censoring, use cumulated follow-up instead of sample
> sizes and that's it!
> I began arguing that in Cox model you could introduce
> several variables, interactions, etc, then I remembered
> of logistic models ;-)
> The only (and poor) argument I could think of was that
> if mr Cox took pains to devise his model, there should
> be some reason...

That is a very ignorant person, concerning statistical 
efficiency/power/precision and how to handle incomplete follow-up 
(variable follow-up duration).  There are papers in the literature (I 
wish I had them at my fingertips) that go into the efficiency loss of 
just counting events.  If the events are very rare, knowing the time 
doesn't help as much, but the Cox model still can handle censoring 
correctly and that person's approach doesn't.

Frank

> 
> but the story doesn't end here. When I came back to my office,
> I tried these two methods on a couple of data sets, and true,
> crude RRs are very close to those coming from Cox model.
> 
> hence this question: could someone provide me with a
> dataset (preferably real) where there is a striking
> difference between estimated RRs and/or between
> P-values? and of course I am interested in theoretical
> arguments and references.
> 
> sorry that this question has nothing to do with R
> and thank you in advance for your leniency.
> 
> Eric Elguero
> GEMI-UMR 2724 IRD-CNRS,
> Équipe "Évolution des Systèmes Symbiotiques"
> 911 avenue Agropolis, BP 64501,
> 34394 Montpellier cedex 5 FRANCE
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to get "lsmeans"?

2007-03-22 Thread Frank E Harrell Jr
John Fox wrote:
> Dear Frank,
> 
> The object is to focus on a high-order term of the model, holding other
> terms "constant" at typical values; in the case of a factor, a "typical
> value" is ambiguous, so an average is taken over the levels of the factor.
> If the factor is, e.g., gender, one could produce an estimate of the average
> fitted value for a population composed equally of men and women, or of a
> population composed of men and women in proportion to their distribution in
> the data. Otherwise, one would have to produce separate sets of fitted
> values for men and women, with the number of such sets increasing as the
> number of levels of the factors held constant increase. On the scale of the
> linear predictor, these sets would differ only by constants.
> 
> Regards,
>  John

Makes sense.  I just set other factors to the mode, and if it is 
important to see estimates for other categories, I give estimates for 
each factor level.  If I want to uncondition on a variable (not often) I 
take the proper weighted average of predicted values.

Cheers
Frank

> 
> 
> John Fox
> Department of Sociology
> McMaster University
> Hamilton, Ontario
> Canada L8S 4M4
> 905-525-9140x23604
> http://socserv.mcmaster.ca/jfox 
>  
> 
>> -Original Message-
>> From: Frank E Harrell Jr [mailto:[EMAIL PROTECTED] 
>> Sent: Thursday, March 22, 2007 8:41 AM
>> To: John Fox
>> Cc: 'Prof Brian Ripley'; 'r-help'; 'Chuck Cleland'; 'Liaw, Andy'
>> Subject: Re: [R] how to get "lsmeans"?
>>
>> John Fox wrote:
>>> Dear Brian et al.,
>>>
>>> My apologies for chiming in late: It's been a busy day.
>>>
>>> First some general comments on "least-squares means" and 
>> "effect displays."
>> ... much good stuff deleted ...
>>
>> John - the one thing I didn't get from your post is a 
>> motivation to do all that as opposed to easy-to-explain 
>> predicted values.  I would appreciate your thoughts.
>>
>> Thanks
>> Frank
>>
>> -- 
>> Frank E Harrell Jr   Professor and Chair   School of Medicine
>>   Department of Biostatistics   
>> Vanderbilt University
> 
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to get "lsmeans"?

2007-03-22 Thread Frank E Harrell Jr
John Fox wrote:
> Dear Brian et al.,
> 
> My apologies for chiming in late: It's been a busy day.
> 
> First some general comments on "least-squares means" and "effect displays."
... much good stuff deleted ...

John - the one thing I didn't get from your post is a motivation to do 
all that as opposed to easy-to-explain predicted values.  I would 
appreciate your thoughts.

Thanks
Frank

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Stepwise Logistic Regression

2007-03-21 Thread Frank E Harrell Jr
Sergio Della Franca wrote:
> Dear R-Helpers,
> 
> I'd like to perform a Logistic Regression whit a Stepwise method.
> 
> 
> Can you tell me how can i proceed to develop this procedure?
> 
> 
> Thank you in advance.
> 
> 
> Sergio Della Franca.

If the number of events is not incredibly large, you can get almost the 
same result with the following code :-)

candidates <- setdiff(names(mydataframe), 'y')
p <- length(candidates)
sample(candidates, sample(round(p/4):round(p/2), 1))

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to get "lsmeans"?

2007-03-21 Thread Frank E Harrell Jr
ata=a;  class A B C;  model Y=A B C d; 
>>>  lsmeans A B C/cl; run;  
>>>
>>> In R, I tried this:  
>>>  library(Design)
>>>  ddist<-datadist(a)
>>>  options(datadist="ddist")
>>>  f<-ols(Y~A+B+C+D,data=a,x=TRUE,y=TRUE,se.fit=TRUE)  
>>>
>>> then how to get the "lsmeans" for A, B, and C, respectively 
>>> with predict function?
>>>
>>>  
>>>
>>> Best wishes
>>> yours, sincerely 
>>> Xingwang Ye
>>> PhD candidate 
>>> Research Group of Nutrition Related Cancers and Other Chronic 
>>> Diseases  
>>> Institute for Nutritional Sciences,  
>>> Shanghai Institutes of Biological Sciences, 
>>> Chinese Academy of Sciences 
>>> P.O.Box 32 
>>> 294 Taiyuan Road 
>>> Shanghai 200031 
>>> P.R.CHINA
>>>
>>>
>>
>> --
>> Notice:  This e-mail message, together with any attachments,...{{dropped}}
>>
>> __
>> R-help@stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R and clinical studies

2007-03-20 Thread Frank E Harrell Jr
[EMAIL PROTECTED] wrote:
> Thank you to all those that responded to Delphine's original post on R and
> clinical studies.  They have provided much food for thought.
> 
> I had a couple of follow up questions/comments.  Andrew is very correct in
> pointing out that there are classes and workshops available for R.  It's my
> understanding that there are even commercial versions of R that now provide
> formal commercial-style courses.  And at any rate, the money saved by
> potentially avoiding pricey software could certainly justify any training
> expense in time or money  - this assumes of course that the pricey software
> could be dispensed with (I suspect that would take considerable time at my
> current company as so many legacy projects have been done in proprietary
> software).  I still think that R provides less 'hand-holding' and requires
> more initiative (which may be more or less present on a per
> programmer/statistician basis).
> 
> I guess one could always integrate R/Splus in with SAS, as Terry's group
> has done at Mayo - I will probably do this at least as a start.  I have a
> few concerns with regards to this approach (these may be needless concerns,
> but I will venture expressing them anyway).  First, I'm worried about the
> possibility of compatability concerns (will anyone be worried about a SAS
> dataset read into R or vice-versa?).  Second, I would prefer focusing all
> my learning on one package if possible.  I actually have more experience
> with SAS (as do others in my group), and if the switch to R is to be made I
> would like to make that switch as complete as possible.   This would also
> avoid requiring new hires to know both languages.  Third, if SAS is to be
> kept around, it defeats one of the main advantages of having open source
> code in the first place (R is wonderfully free!).  Like Mayo, Baylor Health
> (my previous employer) used both Splus and SAS.  I was warned that data
> manipulation would be much more difficult in R/Splus than it was in SAS.
> To be honest, and I say this humbly realizing that most posters to this
> list have much more experience than I, I haven't found data manipulation to
> be that much more difficult in R/Splus (at least as I have gained
> experience in R/Splus).   I can think of two exceptions (1) large datasets
> and (2) SAS seems to play nicer with MS products (e.g. PROC IMPORT seemed
> to read in messy Excel spreadsheets better than importData in Splus).  Is
> it possible (and I again say this with MUCH humility) that the perceived
> advantages of SAS with regards to data manipulation may be due in part to
> some users only using R/Splus for stat modeling and graphics (thus never
> becoming familiar with the data manipulation capabilities of R/Splus) or to
> the reluctance of SAS-trained individuals and companies to make the
> complete switch?

You are exactly correct on this point.  Some graduate programs only 
teach students how to use R/S-Plus for modeling and graphics.  R/S-Plus 
are wonderful for data manipulation - more powerful than SAS but not 
easy to learn (plus in R there are sometimes too many ways to do 
something; new users get lost - e.g. the reshape and reShape functions 
and the reshape package). 
http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RS/sintro.pdf has many 
examples of complex data manipulation as do some web sites.  We do 
analysis for pharmaceutical companies with 100% of the data manipulation 
done in R after importing say 50 SAS datasets into R.  Doing tasks such 
as finding a lab value measured the closest in time to some event is 
much more elegant in R/S-Plus than in SAS.

Frank

> 
> Tony, the story about the "famous software" and the "certain operating
> system" at the "large company" was priceless.
> 
> In closing, I should mention that in all posts I am speaking for myself and
> not for Edwards LifeSciences.
> 
> Regards,
> -Cody
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R and clinical studies

2007-03-16 Thread Frank E Harrell Jr
Delphine Fontaine wrote:
> Thanks for your answer which was very helpfull. I have another question:
> 
> I have read in this document  
> (http://cran.r-project.org/doc/manuals/R-intro.pdf) that most of the  
> programs written in R are ephemeral and that new releases are not  
> always compatible with previous releases. What I would like to know is  
> if R functions are already validated and if not, what should we do to  
> validate a R function ?
> 

In the sense in which most persons use the term 'validate', it means to 
show with one or more datasets that the function is capable of producing 
the right answer.  It doesn't mean that it produces the right answer for 
every dataset although we hope it does.  [As an aside, most errors are 
in the data manipulation phase, not in the analysis phase.]  So I think 
that instead of validating functions we should spend more effort on 
validating analyses [and validating analysis file derivation].  Pivotal 
analyses can be re-done a variety of ways, in R or in separate 
programmable packages such as Stata.

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ols Error : missing value where TRUE/FALSE needed

2007-03-14 Thread Frank E Harrell Jr
Charles Evans wrote:
> I have installed Hmisc and Design.  When I use ols, I get the  
> following error message:
> 
> Error in if (!length(fname) || !any(fname == zname)) { :
>   missing value where TRUE/FALSE needed
> 
> The model that I am running is:
> 
>  > ecools <- ols(eco$exp ~ eco$age + eco$own + eco$inc + inc2, x=TRUE)

ecools <- ols(exp ~ age + own + inc + inc2, data=eco, x=TRUE)

Watch out for variables named exp but probably OK.

Frank Harrell

> 
> I have tried several other combinations of arguments that take TRUE/ 
> FALSE values, but no luck.
> 
> Ultimately, I am trying to calculate robust standard errors.
> 
> Any help would be appreciated.
> 
> Charles Evans
> Executive Director
> Free Curricula Center
> 
> 
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R and SAS proc format

2007-03-07 Thread Frank E Harrell Jr
Ulrike Grömping wrote:
> 
> 
>>>The down side to R's factor solution: 
> 
>>>The numerical values of factors are always 1 to number of levels. Thus, it
> 
>>>can be tough and requires great care to work with studies that have both
> 
>>>numerical values different from this and value labels. This situation is
> 
>>>currently not well-supported by R.
> 
>>>
> 
>>>Regards, Ulrike
> 
>>>
> 
>>>P.S.: I fully agree with Frank regarding the annoyance one sometimes
> 
>>>encounters with formats in SAS! 
> 
>  > You can add an attribute to a variable.  In the sas.get function in the
>  > Hmisc package for example, when importing SAS variables that have PROC
>  > FORMAT value labels, an attribute 'sas.codes' keeps the original codes;
>  > these can be retrieved using sas.codes(variable name).  This could be
>  > done outside the SAS import context also.
>  >
>  > Frank
>  > --
>  > Frank E Harrell Jr   Professor and Chair   School of Medicine
>  >   Department of Biostatistics   Vanderbilt 
> University
> 
> Frank,
> 
> are these attributes preserved when merging or subsetting a data frame?
> Are they used in R packages other than Hmisc and Design (e.g. in a 
> simple table request)?

no; would need to add functions like those that are used by the Hmisc 
label or impute functions.  And they are not used outside Hmisc/Design. 
  In fact I have little need for them as I always find the final labels 
as the key to analysis.

> 
> If this is the case, my wishlist items 8658 and 8659 
> (http://bugs.r-project.org/cgi-bin/R/wishlist?id=8658;user=guest, 
> http://bugs.r-project.org/cgi-bin/R/wishlist?id=8659;user=guest) can be 
> closed.
> Otherwise, I maintain the opinion that there are workarounds but that R 
> is not satisfactorily able to handle this type of data.

R gives the framework for doing this elegantly but the user has an 
overhead of implementing new methods for such attributes.

Cheers
Frank

> 
> Regards, Ulrike
> 
> 
> *--- End of Original Message ---*


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R and SAS proc format

2007-03-06 Thread Frank E Harrell Jr
Ulrike Grömping wrote:
> The down side to R's factor solution: 
> The numerical values of factors are always 1 to number of levels. Thus, it
> can be tough and requires great care to work with studies that have both
> numerical values different from this and value labels. This situation is
> currently not well-supported by R.

You can add an attribute to a variable.  In the sas.get function in the 
Hmisc package for example, when importing SAS variables that have PROC 
FORMAT value labels, an attribute 'sas.codes' keeps the original codes; 
these can be retrieved using sas.codes(variable name).  This could be 
done outside the SAS import context also.

Frank

> 
> Regards, Ulrike
> 
> P.S.: I fully agree with Frank regarding the annoyance one sometimes
> encounters with formats in SAS! 
> 
> 
> lamack lamack wrote:
>> Dear all, Is there an R equivalent to SAS's proc format?
>>
>> Best regards
>>
>> J. Lamack
>>
>> _
>> O Windows Live Spaces é seu espaço na internet com fotos (500 por mês),
>> blog 
>> e agora com rede social http://spaces.live.com/
>>
>> __
>> R-help@stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R and SAS proc format

2007-03-06 Thread Frank E Harrell Jr
lamack lamack wrote:
> Dear all, Is there an R equivalent to SAS's proc format?
> 
> Best regards
> 
> J. Lamack

Fortunately not.  SAS is one of the few large systems that does not 
implicitly support value labels and that separates label information 
from the database [I can't count the number of times someone has sent me 
a SAS dataset and forgotten to send the PROC FORMAT value labels].  See 
the factor function for information about how R does this.

Frank Harrell

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] bootstrap

2007-02-28 Thread Frank E Harrell Jr
Indermaur Lukas wrote:
> Hi,
> I would like to evaluate the frequency of the variables within the best 
> selected model by AIC among a set of 12 competing models (I fit them with 
> GLM) with a bootstrap procedure  to get unbiased results. So I would ike to 
> do the ranking of the 12-model-set 10'000 times separately and calculate the 
> frequency of variables of the 10'000 best ranked models. I wrote a script 
> doing the model ranking (with some difficulty) for the pre-defined 
> 12-model-set, that works. My model results (one row per model) are stored in 
> a dataframe called "mydataframe". How can I implement my script simply into a 
> bootstrap and sum up the frequencies of variables of the 10'000 bootstraped 
> results?
>  
> I am not so good in R and would appreciate any hint.
>  
> Regards
> Lukas

I'm not entirely clear on your goals but the bootstrap selection 
frequency is mainly a replay of the significance of variables in the 
full model.  I don't see that it provides new information.  And 
selection frequency is marred by collinearity.

Frank

>  
>  
> °°° 
> Lukas Indermaur, PhD student 
> eawag / Swiss Federal Institute of Aquatic Science and Technology 
> ECO - Department of Aquatic Ecology
> Überlandstrasse 133
> CH-8600 Dübendorf
> Switzerland
>  
> Phone: +41 (0) 71 220 38 25
> Fax: +41 (0) 44 823 53 15 
> Email: [EMAIL PROTECTED]
> www.lukasindermaur.ch
> 
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


  1   2   3   4   5   6   7   >