Re: [R] UseR! 2010 - my impressions

2010-07-24 Thread Frank E Harrell Jr

On 07/23/2010 06:50 PM, Ravi Varadhan wrote:

Dear UseRs!,

Everything about UseR! 2010 was terrific!  I really mean everything - the 
tutorials, invited talks, kaleidoscope sessions, focus sessions, breakfast, snacks, 
lunch, conference dinner, shuttle services, and the participants. The organization was 
fabulous.  NIST were gracious hosts, and provided top notch facilities.  The rousing 
speech by Antonio Possolo, who is the chief of Statistical Engineering Division at NIST, 
set the tempo for the entire conference.  Excellent invited lectures by Luke Tierney, 
Frank Harrell, Mark Handcock, Diethelm Wurtz, Uwe Ligges, and Fritz Leisch.  All the 
sessions that I attended had many interesting ideas and useful contributions.  During the 
whole time that I was there, I could not help but get the feeling that I am a part of 
something great.

Before I end, let me add a few words about a special person.  This conference 
would not have been as great as it was without the tireless efforts of Kate 
Mullen.  The great thing about Kate is that she did so much without ever 
hogging the limelight.  Thank you, Kate and thank you NIST!

I cannot wait for UseR!2011!

Best,
Ravi.



Ravi Varadhan, Ph.D.
Assistant Professor,
Division of Geriatric Medicine and Gerontology
School of Medicine
Johns Hopkins University

Ph. (410) 502-2619
email: rvarad...@jhmi.edu


I want to echo what Ravi said.  The talks were terrific (thanks to the 
program committee and the speakers) and Kate Mullen and her team did an 
extraordinary job in putting the conference together and running it.  I 
am proud to have been a part of it.  Thank you all!


Frank

--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] c-statiscs 95% CI for cox regression model

2010-07-24 Thread Frank E Harrell Jr

On 07/24/2010 09:51 PM, paaventhan jeyaganth wrote:


Dear all,
how can i do the calculate the  C-statistics
95% confidences interval for the cox regression model.
  Thanks very much for your any help.
  Paaveenthan   


install.packages('Hmisc')
require(Hmisc
?rcorr.cens (there is an example at the bottom)

Frank

--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Problem with offset (Glm or glmD) in Design library

2010-07-15 Thread Frank E Harrell Jr

On 07/15/2010 07:59 AM, Claus Dethlefsen / Region Nordjylland wrote:

Dear List

My question relates to the rms (or Design) package. I have modified
the example from help(Glm) to include an offset argument. It works
with glm but Glm gives the error

Error in if (!length(fname) || !any(fname == zname)) { : missing
value where TRUE/FALSE needed


# -- begin example library(rms) of- 100:92

counts- c(18,17,15,20,10,20,25,13,12) outcome- gl(3,1,9)
treatment- gl(3,3) f- glm(counts ~ outcome + treatment,
family=poisson(),offset=of) f anova(f) summary(f) f- Glm(counts ~
outcome + treatment, family=poisson(),offset=of) # -- end example


That should be counts ~ outcome + treatment + offset(of)

Frank



The same error occurs using glmD from the Design package.

I use R version 2.11.1 for Windows and have installed the rms
package together with Design and Hmisc.

Is this a bug or am I missing something?

Best regards,

Claus  Claus Dethlefsen, MSc,
PhD Biostatistician at Center for Cardiovascular Research Aalborg
Hospital, Aarhus University Hospital, Denmark

__ R-help@r-project.org
mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do
read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R's Data Dredging Philosophy for Distribution Fitting

2010-07-14 Thread Frank E Harrell Jr

On 07/14/2010 06:22 PM, emorway wrote:


Forum,

I'm a grad student in Civil Eng, took some Stats classes that required
students learn R, and I have since taken to R and use it for as much as I
can.  Back in my lab/office, many of my fellow grad students still use
proprietary software at the behest of advisers who are familiar with the
recommended software (Statistica, @Risk (Excel Add-on), etc).  I have spent
a lot of time learning R and am confident it can generally out-process,
out-graph, or more simply stated, out-perform most of these other software
packages.  However, one area my view has been humbled in is distribution
fitting.

I started by reading through
http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf  After that
I started digging around on this forum and found posts like this one
http://r.789695.n4.nabble.com/Fitting-usual-distributions-td80.html#a80
that are close to what I'm after.  That is, given an observation dataset, I
would like to call a function that cycles through numerous distributions
(common or not) and then ranks them for me based on Chi-Square,
Kolmogorov-Smirnov and/or Anderson-Darling, for example.

This question was asked back in 2004:
http://finzi.psych.upenn.edu/R/Rhelp02a/archive/37053.html but the response
was that this kind of thing wasn't in R nor in proprietary software to the
best of the responding author's memory.  In 2010, however, this is no longer
true as @Risk's
(http://www.palisade.com/risk/?gclid=CKvblPSM7KICFZQz5wodDRI2fg)
Distribution Fitting function does this very thing.  And it is here that
my R pride has taken a hit.  Based on the first response to the question
posed here
http://r.789695.n4.nabble.com/Which-distribution-best-fits-the-data-td859448.html#a859448
is it fair to say that the R community (I realize this is only 1 view) would
take exception to this kind of data mining?

Unless I've missed a discussion of a package that does this very thing, it
seems as though I would need to code something up using fitdistr() and do
all the ranking myself.  Undoubtedly that would be a good exercise for me,
but its hard for me to believe R would be a runner-up to something like
distribution fitting in @Risk.

Eric


Eric,

I didn't read the links you provided but the approach you have advocated 
(and you are not alone) is futile.  If you entertain more than about 2 
distributions, the variance of the final fits is no better than the 
variance of the empirical cumulative distribution function (once you 
properly adjust variances for model uncertainty).  So just go empirical. 
 In general if your touchstone is the observed data (as in checking 
goodness of fit of various parametric distributions), your final 
estimators will have the variance of empirical estimators.


Frank
--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] SAS Proc summary/means as a R function

2010-07-13 Thread Frank E Harrell Jr
What is the original intent?  The bandwidth:productivity ratio is not 
looking encouraging for this problem.


Frank

On 07/13/2010 12:38 PM, schuster wrote:


Hello,

are you trying to pase SAS code (or lightly modified SAS code) and run it in R?

Then you are right: the hard part is parsing the code. I don't believe that's
possible without a custom parser, and even then it's really hard to parse all
the SAS sub languages right: data step, macro code and macro variables, IML,
SAS Procedures etc.



On Tuesday 13 July 2010 02:39:22 pm Roger Deangelis wrote:

Thanks Richard and Erik,

I hate to buy the book and not find the solution to the following:

proc.means- function() {
deparse(match.call()[-1])
}

proc.means(this is a sentence)

unexpected symbol in   proc means(this is)

One possible solution would be to 'peek' into the memory buffer that holds
the
function arguments.

It is easy to replicate the 'dataset' output for many SAS procs(ie
transpose, freq, summary, means...)
I am not interested in 'report writing in R'.

The hard part is parsing the SAS syntax, I wish R had a drop down to PERL.

per1 on;

some perl code

perl off;

also

sas on;

   some SAS code

sas off;

The purpose of parmbuff is to turn off of Rs scanning and resolution of
function arguments
and just provide the bare text between '('  and ')' in the function call.

This is a very powerful construct.

A function would provide something like

sas.on(


)






--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to plot two histograms overlapped in the same plane coordinate

2010-07-09 Thread Frank E Harrell Jr
Empirical CDFs are much better for this purpose, and allow 
superpositioning (see e.g. the Ecdf function in the Hmisc package). 
Otherwise look at histbackback in Hmisc.


Frank

On 07/09/2010 11:40 AM, Andrew Miles wrote:



I'm not sure what you are trying to do. Do you want one histogram for
males and one for females on the same graph? If so, the simplest way to
put two histograms together is to simply use the add parameter:

age.males=age[which(sex==M)]
age.females=age[which(sex==F)]

hist(age.males, col=blue)
hist(age.females, add=T)

The only problem is that the hist() function does not do
semi-transparency. I am not sure if other packages do. The code above
will give you a blue histogram for males, and clear histogram for
females on top of it. You'll probably have to manually alter the axes of
the histogram to give the histograms for males and females the same
break points (i.e. where one bar stops and another begins). See ?hist
for more information about that.

Andrew Miles
Department of Sociology
Duke University

On Jul 9, 2010, at 9:29 AM, Mao Jianfeng wrote:


Dear R-help listers,

I am new. I just want to get helps on how to plot two histograms
overlapped in the same plane coordinate. What I did is very ugly.
Could you please help me to improve it? I want to got a plot with semi-
transparent overlapping region. And, I want to know how to specify the
filled colors of the different histograms.

I also prefer other solutions other than ggplot2.

Many thanks to you.


What I have done:

library(ggplot2)

age-c(rnorm(100, 1.5, 1), rnorm(100, 5, 1))
sex-c(rep(F,100), rep(M, 100))
mydata-cbind(age, sex)
mydata-as.data.frame(mydata)
head(mydata)


qplot(age, data=mydata, geom=histogram, fill=sex, xlab=age,
ylab=count, alpha=I(0.5))


Best,


Mao J-F

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Function similar to combine.levels in Hmisc package

2010-07-09 Thread Frank E Harrell Jr

On 07/09/2010 05:33 PM, Saeed Abu Nimeh wrote:

Is there a function similar to combine.levels ( in the Hmisc package)
that combines the levels of factors, but not based on their frequency.
Alternatively, I am looking into using the significance of the dummy
variables of factors based on their Pr(|t|) value using the linear
model, then deleting the non-significant levels. Any other
suggestions?
Thanks,
Saeed


That is not a valid statistical procedure.  You will not have the 
correct d.f. in the final F test.


Frank

--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] rms::ols I(.) in formulas

2010-07-02 Thread Frank E Harrell Jr
ols has always been fussy about this.  And don't use poly; use pol with 
rms/Design.


Frank

On 07/02/2010 07:28 AM, Peter Ehlers wrote:

Otto,

The current version of ols() is fairly fussy about the
way the predictors are used. I'm not fond of the I()
construction anyway and so I would either use poly()
or define a new predictor as you suggest in your
original post.

See also this thread:

https://stat.ethz.ch/pipermail/r-help/2010-June/241587.html


-Peter Ehlers


On 2010-07-02 4:51, Otto Kässi wrote:

Hi, Lexi!

I am aware that lm() is the standard way to do ols regression in R.
The reason why I opted for rms::ols() is that later on in my work I
need some rms functions which are not available for lm().

In retrospect, I should have mentioned this already in my original
post. Nonetheless, thanks for a good suggestion :-)

Br,
OK


On Fri, Jul 2, 2010 at 1:35 PM, Setlhare
Lekgatlhamangsetlha...@bob.bw wrote:

Try this
Lm(y~X + I(X^2)), data=dd) # this runs OLS regression and it worked
for me

Hope it helps

Lexi

-Original Message-
From: r-help-boun...@r-project.org
[mailto:r-help-boun...@r-project.org] On Behalf Of Otto Kässi
Sent: Thursday, July 01, 2010 3:28 PM
To: r-help@r-project.org
Subject: [R] rms::ols I(.) in formulas

Dear R-helpers,

To start I would like to thank Prof. Harrell for package rms. It is
one of the most useful packages for R that I have encountered.

Turning to my problem, I encountered a surprising problem when working
with rms::ols. It seems that insulating terms in a formula by using
I() to insulate terms in a formula seems to occasionally create a bug.
See the example below:
library(rms)
x- rnorm(100)
y- rnorm(100)
dd- data.frame(cbind(x,y))
ols(y ~ x + I(x^2), data=dd)

ols() function gives the error:
Error in if (!length(fname) || !any(fname == zname)) { :
missing value where TRUE/FALSE needed

Has anyone else encountered something similar? Is this a bug or does
this behavior have a reason?

There are of course trivial workarounds: one can either use poly(x, 2)
or save x^2 as a new column to dd, but trying to debug this was a
pain.

With kind regards,
Otto Kassi



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Visualization of coefficients

2010-07-02 Thread Frank E Harrell Jr
Please note that Statlib is about 10 years out of date with respect to 
my software.


See http://biostat.mc.vanderbilt.edu/Rrms

Frank


On 07/02/2010 01:12 PM, Tal Galili wrote:

BTW, another visualization that might be useful in your case is
Nomogramhttp://en.wikipedia.org/wiki/Nomogram
:
http://lib.stat.cmu.edu/S/Harrell/help/Design/html/nomogram.html

(I remember first encountering it on a lecture by Frank Harrell lecture and
being very happy for the discovery)



Tal

Contact
Details:---
Contact me: tal.gal...@gmail.com |  972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)
--




On Fri, Jul 2, 2010 at 9:10 AM, Wincentronggui.hu...@gmail.com  wrote:


Dear all,

I try to show a subset of coefficients in my presentation. It seems
that a standard table is not a good way to go. I found figure 9
(page 9) in this file (

http://www.destatis.de/jetspeed/portal/cms/Sites/destatis/Internet/DE/Content/Wissenschaftsforum/Kolloquien/VisualisierungModellierung__Beitrag,property=file.pdf
) looks pretty good. I wonder if there is any function for such plot?
Or any suggestion on how to present statistical models in a
presentation?

Thank you.

--
Wincent Rong-gui HUANG
Doctoral Candidate
Dept of Public and Social Administration
City University of Hong Kong
http://asrr.r-forge.r-project.org/rghuang.html



--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ROC curve in R

2010-07-01 Thread Frank E Harrell Jr

On 07/01/2010 01:33 PM, Changbin Du wrote:

Read the ROCR package, it is very good.


Just be sure you really need an ROC curve.  More often than not it gets 
in the way of understanding.


Frank







On Thu, Jul 1, 2010 at 9:50 AM, ashu6886ashu.infy.m...@gmail.com  wrote:



Hi,

i have a fairly large amount of genomic data. I have created a dataframe
which has Reference as one column and Variation as another. I want to
plot a ROC curve based on these 2 columns. I have serached the R manual but
I could not understand. Can anybody help me with the R code for plotting
ROC
curve.

Thnx
ashu6886
--
View this message in context:
http://r.789695.n4.nabble.com/ROC-curve-in-R-tp2275431p2275431.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.








--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic regression with multiple imputation

2010-06-30 Thread Frank E Harrell Jr
There are titanic datasets in R binary format at 
http://biostat.mc.vanderbilt.edu/DataSets


Note that the aregImpute function in the Hmisc package streamlines many 
of the steps, in conjunction with the fit.mult.impute function.


Frank


On 06/30/2010 05:02 AM, Chuck Cleland wrote:

On 6/30/2010 1:14 AM, Daniel Chen wrote:

Hi,

I am a long time SPSS user but new to R, so please bear with me if my
questions seem to be too basic for you guys.

I am trying to figure out how to analyze survey data using logistic
regression with multiple imputation.

I have a survey data of about 200,000 cases and I am trying to predict the
odds ratio of a dependent variable using 6 categorical independent variables
(dummy-coded). Approximatively 10% of the cases (~20,000) have missing data
in one or more of the independent variables. The percentage of missing
ranges from 0.01% to 10% for the independent variables.

My current thinking is to conduct a logistic regression with multiple
imputation, but I don't know how to do it in R. I searched the web but
couldn't find instructions or examples on how to do this. Since SPSS is
hopeless with missing data, I have to learn to do this in R. I am new to R,
so I would really appreciate if someone can show me some examples or tell me
where to find resources.


   Here is an example using the Amelia package to generate imputations
and the mitools and mix packages to make the pooled inferences.

titanic-
read.table(http://lib.stat.cmu.edu/S/Harrell/data/ascii/titanic.txt;,
sep=',', header=TRUE)

set.seed(4321)

titanic$sex[sample(nrow(titanic), 10)]- NA
titanic$pclass[sample(nrow(titanic), 10)]- NA
titanic$survived[sample(nrow(titanic), 10)]- NA

library(Amelia) # generate multiple imputations
library(mitools) # for MIextract()
library(mix) # for mi.inference()

titanic.amelia- amelia(subset(titanic,
select=c('survived','pclass','sex','age')),
  m=10, noms=c('survived','pclass','sex'),
emburn=c(500,500))

allimplogreg- lapply(titanic.amelia$imputations,
function(x){glm(survived ~ pclass + sex + age, family=binomial, data = x)})

mice.betas.glm- MIextract(allimplogreg, fun=function(x){coef(x)})
mice.se.glm- MIextract(allimplogreg, fun=function(x){sqrt(diag(vcov(x)))})

as.data.frame(mi.inference(mice.betas.glm, mice.se.glm))

# Or using only mitools for pooled inference

betas- MIextract(allimplogreg, fun=coef)
vars- MIextract(allimplogreg, fun=vcov)
summary(MIcombine(betas,vars))


Thank you!

Daniel

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.





--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Wilcoxon signed rank test and its requirements

2010-06-25 Thread Frank E Harrell Jr
The central limit theorem doesn't help.  It just addresses type I error,
not power.

Frank

On 06/25/2010 04:29 AM, Joris Meys wrote:
 As a remark on your histogram : use less breaks! This histogram tells
 you nothing. An interesting function is ?density , eg :
 
 x-rnorm(250)
 hist(x,freq=F)
 lines(density(x),col=red)
 
 See also this ppt, a very nice and short introduction to graphics in R :
 http://csg.sph.umich.edu/docs/R/graphics-1.pdf
 
 2010/6/25 Atte Tenkanenatte...@utu.fi:
 Is there anything for me?

 There is a lot of data, n=2418, but there are also a lot of ties.
 My sample n≈250-300
 
 You should think about the central limit theorem. Actually, you can
 just use a t-test to compare means, as with those sample sizes the
 mean is almost certainly normally distributed.

 i would like to test, whether the mean of the sample differ significantly 
 from the population mean.

 According to probability theory, this will be in 5% of the cases if
 you repeat your sampling infinitly. But as David asked: why on earth
 do you want to test that?
 
 cheers
 Joris
 


-- 
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Wilcoxon signed rank test and its requirements

2010-06-25 Thread Frank E Harrell Jr
You still are stating the effect of the central limit theorem
incorrectly.  Please see my previous note.

Frank

On 06/25/2010 10:27 AM, Joris Meys wrote:
 2010/6/25 Frank E Harrell Jrf.harr...@vanderbilt.edu:
 The central limit theorem doesn't help.  It just addresses type I error,
 not power.

 Frank
 
 I don't think I stated otherwise. I am aware of the fact that the
 wilcoxon has an Asymptotic Relative Efficiency greater than 1 compared
 to the t-test in case of skewed distributions. Apologies if I caused
 more confusion.
 
 The problem with the wilcoxon is twofold as far as I understood this
 data correctly :
 - there are quite some ties
 - the wilcoxon assumes under the null that the distributions are the
 same, not only the location. The influence of unequal variances and/or
 shapes of the distribution is enhanced in the case of unequal sample
 sizes.
 
 The central limit theory makes that :
 - the t-test will do correct inference in the presence of ties
 - unequal variances can be taken into account using the modified
 t-test, both in the case of equal and unequal sample sizes
 
 For these reasons, I would personally use the t-test for comparing two
 samples from the described population. Your mileage may vary.
 
 Cheers
 Joris
 

 On 06/25/2010 04:29 AM, Joris Meys wrote:
 As a remark on your histogram : use less breaks! This histogram tells
 you nothing. An interesting function is ?density , eg :

 x-rnorm(250)
 hist(x,freq=F)
 lines(density(x),col=red)

 See also this ppt, a very nice and short introduction to graphics in R :
 http://csg.sph.umich.edu/docs/R/graphics-1.pdf

 2010/6/25 Atte Tenkanenatte...@utu.fi:
 Is there anything for me?

 There is a lot of data, n=2418, but there are also a lot of ties.
 My sample n≈250-300

 You should think about the central limit theorem. Actually, you can
 just use a t-test to compare means, as with those sample sizes the
 mean is almost certainly normally distributed.

 i would like to test, whether the mean of the sample differ significantly 
 from the population mean.

 According to probability theory, this will be in 5% of the cases if
 you repeat your sampling infinitly. But as David asked: why on earth
 do you want to test that?

 cheers
 Joris



 --
 Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
  Department of Biostatistics   Vanderbilt University

 
 
 


-- 
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Wilcoxon signed rank test and its requirements

2010-06-24 Thread Frank E Harrell Jr

On 06/24/2010 12:40 PM, David Winsemius wrote:


On Jun 23, 2010, at 9:58 PM, Atte Tenkanen wrote:


Thanks. What I have had to ask is that

how do you test that the data is symmetric enough?
If it is not, is it ok to use some data transformation?

when it is said:

The Wilcoxon signed rank test does not assume that the data are
sampled from a Gaussian distribution. However it does assume that the
data are distributed symmetrically around the median. If the
distribution is asymmetrical, the P value will not tell you much about
whether the median is different than the hypothetical value.


You are being misled. Simply finding a statement on a statistics
software website, even one as reputable as Graphpad (???), does not mean
that it is necessarily true. My understanding (confirmed reviewing
Nonparametric statistical methods for complete and censored data by M.
M. Desu, Damaraju Raghavarao, is that the Wilcoxon signed-rank test does
not require that the underlying distributions be symmetric. The above
quotation is highly inaccurate.



To add to what David and others have said, look at the kernel that the 
U-statistic associated with the WSR test uses: the indicator (0/1) of xi 
+ xj  0.  So WSR tests H0:p=0.5 where p = the probability that the 
average of a randomly chosen pair of values is positive.  [If there are 
ties this probably needs to be worded as P[xi + xj  0] = P[xi + xj  
0], i neq j.


Frank

--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] glm

2010-06-23 Thread Frank E Harrell Jr

On 06/23/2010 06:30 AM, Samuel Okoye wrote:

Thank you ver much.

Is there is a function in R which is doing penalized cubic
regression, say spl.plr(), that if I have weeks = 1:9 I can use
somthing like pp- spl.plr(weeks,c(1,3,5,7)) and for 8 and 9 will be
linear? Is rcs() library(Design)  doing this?


rcs in Design (which you should replace with rms before long) does 
unpenalized cubic splines that are linear in the tails.  You can also 
penalize such fits using the lrm and ols functions in rms/Design.  Also 
see the pspline function in the survival package, the splines package, 
and others.


Frank




Many thanks, Samuel

--- On Tue, 22/6/10, Joris Meysjorism...@gmail.com  wrote:

From: Joris Meysjorism...@gmail.com Subject: Re: [R] glm To:
Samuel Okoyesamu...@yahoo.com Cc: r-help@r-project.org Date:
Tuesday, 22 June, 2010, 9:50

On Tue, Jun 22, 2010 at 1:00 AM, Samuel Okoyesamu...@yahoo.com
wrote:

Hi,

I have the following data

data1- data.frame(count = c(0,1,1,2,4,5,13,16,14), weeks = 1:9,
treat=c(rep(1mg,3),rep(5mg,3),rep(10mg,3))) and I am using

library(splines)

to fit

glm.m-
glm(count~weeks)+as.factor(treat),family=poisson,data=data1)

and I am interested in predicting the count variale for the weeks
10, 11 and 12 with treat 10mg and 15mg.


bad luck for you.

newdat-data.frame( weeks=rep(10:12,each=2),
treat=rep(c(5mg,10mg),times=3) )

preds- predict(glm.m,type=response,newdata=newdat,se.fit=T)
cbind(newdat,preds)

gives as expected : Warning message: In bs(weeks, degree = 3L, knots
= numeric(0), Boundary.knots = c(1L,  : some 'x' values beyond
boundary knots may cause ill-conditioned bases

weeks treat   fitse.fit residual.scale 110   5mg
5.934881  5.205426  1 210  10mg 12.041639  9.514347
1 311   5mg  4.345165  6.924663  1 411  10mg
8.816168 15.805171  1 512   5mg  2.781063  8.123436
1 612  10mg  5.642667 18.221007  1


Watch the standard errors on the predicted values. No, you shouldn't
predict outside your data space, especially when using splines. And
when interested in 15mg, well, you shouldn't put treatment as a
factor to start with.

Cheers Joris




__ R-help@r-project.org
mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do
read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Popularity of R, SAS, SPSS, Stata...

2010-06-21 Thread Frank E Harrell Jr
It may be worthwhile tracking citations in early adopter journals such 
as statistical methodology journals, then watching trends in later 
adopter subject matter journals (in medical research this might be JAMA, 
NEJM).


Frank


On 06/21/2010 08:11 AM, Kjetil Halvorsen wrote:

One should also take into account the other R list. For example, as of
today the number of subscribers to
R-help-es (R-help for spanish speakers) is 290, increasing.

Kjetil Halvorsen

On Sun, Jun 20, 2010 at 6:28 PM, Muenchen, Robert A (Bob)
muenc...@utk.edu  wrote:




-Original Message-
From: r-help-boun...@r-project.org

[mailto:r-help-boun...@r-project.org]

On Behalf Of Ted Harding
Sent: Sunday, June 20, 2010 3:42 PM
To: r-help@r-project.org
Subject: Re: [R] Popularity of R, SAS, SPSS, Stata...


I've given thought in the past to the question of estimating the R
user base, and came to the conclusion that it is impossible to get
an estimate of the number of users that one could trust (or even
put anything like a margin of error to).

I think one could get a number which represented a moderately
informative lower bound -- just count the number of different email
addresses that have ever posted to the R-help list. This will of
course include people who post (or have posted) from more than one
email address, and people who tried R for a while and then dropped
it, but my feeling is that these are likely to be outweighed by the
number of people who have used R but have never posted (for example
students who are getting their R help from their instructors, people
using R in a corporate context who are discouraged from posting to
public lists, etc.).


Ted, that's a very interesting suggestion. Do you know of a practical
way of getting that count?



The number of subscribers to R-help (currently about 10200) is
a definite lower bound for the number of R users, but many users
post to R-help without being subscribed.


10,200 is quite an amazing number! Here are the number of subscribers
to:

SAS-L3,251
SPSSX-L  2,103
Statlist 1,847
S-PLUS - havn't figured out how to get this yet

How did you get the R-help figure?



I would expect that the total number of different email addresses
that have posted to R-help would be considerably larger than 10200.

I don't think a Mark-Recapture approach is feasible.

Further, I don't know how one might take account of the fact that
some installations of R (e.g. on a corporate or institutional
or departmental server) may each be used by several users.


The server question in particular intrigues me. Research organizations
are stuffed with high performance clusters. The cost of all the
commercial packages is just incredible. Even at the heavily discounted
rate academia gets, they're still unaffordable. However, if queried we'd
find the commercial packages on them, but limited to 4 out of 2,500
nodes! You might see the reverse in industry, with one mainframe copy of
SAS serving hundreds of users.

Cheers,
Bob



Ted.


E-Mail: (Ted Harding)ted.hard...@manchester.ac.uk
Fax-to-email: +44 (0)870 094 0861
Date: 20-Jun-10   Time: 20:41:43
-- XFMail --


--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Extracting P-values from the lrm function in the rms library

2010-06-19 Thread Frank E Harrell Jr

On 06/19/2010 10:30 AM, David Winsemius wrote:


On Jun 19, 2010, at 7:45 AM, Christos Argyropoulos wrote:



Hi,

mod.poly3$coef/sqrt(diag(mod.poly3$var))

will give you the Wald stat values, so

pnorm(abs(mod.poly3$coef/sqrt(diag(mod.poly3$var))),lower.tail=F)*2

will yield the corresponding p-vals


It will, although it may appear as magic to those untutored in examining
R model objects. Josh B should also consider looking at the output of
str(mod.poly3) and then tracing through the logic used in the rms/Design
function, print.lrm(). It's not a hidden function, so simply tyoing its
name at the console will let him see what steps Harrell uses. They are a
bit different, but are mathematically equivalent. Stripped of quite a
bit of code that is not essential in this case:

print.lrm.simple - function (x, digits = 4)
{
# first a couple of utility formatting functions:
sg - function(x, d) {
oldopt - options(digits = d)
on.exit(options(oldopt))
format(x)
rn - function(x, d) format(round(as.single(x), d))
# Then the extraction of compoents from the model, x
cof - x$coef # using name completion, since the full name is
x$coefficients
vv - diag(x$var) # the diagonal of the variance-covariance matrix
z - cof/sqrt(vv) # the Wald Z value
stats - cbind(sg(cof, digits),
sg(sqrt(vv), digits),
rn(cof/sqrt(vv),
2))
stats - cbind(stats,
# This is the step that calculates the p-values
rn(1 - pchisq(z^2, 1), 4))
#
dimnames(stats) - list(names(cof), c(Coef, S.E., Wald Z,
P))
print(stats, quote = FALSE)
cat(\n)
# the regular print.lrm does not return anything, ... it just prints,
# but if you add this line you will be able to access the components of:
invisible(stats)
}

  print.lrm.simple(mod.poly3)[ , 4] # still prints first
Coef S.E. Wald Z P
Intercept -5.68583 5.23295 -1.09 0.2772
x1 1.87020 2.14635 0.87 0.3836
x1^2 -0.42494 0.48286 -0.88 0.3788
x1^3 0.02845 0.03120 0.91 0.3618
x2 3.49560 3.54796 0.99 0.3245
x2^2 -0.94888 0.82067 -1.16 0.2476
x2^3 0.06362 0.05098 1.25 0.2121

# the 4th column are the p-values:
Intercept x1 x1^2 x1^3 x2 x2^2 x2^3
0.2772 0.3836 0.3788 0.3618 0.3245 0.2476 0.2121



For once all the solutions offered are incorrect.  They only work in the 
special case where each variable has only one degree of freedom.


Just do a - anova(mod.poly3) and treat the result as matrix.  You'll 
get the needed multiple degree of freedom test and P-value.


Frank
--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Latex problem in Hmisc (3.8-1) and Mac Os X with R 2.11.1

2010-06-18 Thread Frank E Harrell Jr

On 06/18/2010 12:32 AM, moleps islon wrote:

Dear all,

I did post this more or less identical mail in a follow up to another
question I posted, but under another heading. I try again, but now
under the correct header.


upon running this code (from the Hmisc library-latex function) I
believe the call to summary.formula is allright and produces wonderful
tables, but the latex command results in a correct formatted table but
where all the numbers and the test columns are wrong. I've pasted in
both the R code and the resulting latex code annotated with comments
from the run. Does the same code produce correct cell-entries in other
installation ?

//M


I could reproduce your problem.  We'll get to work on it.
Frank





library(Hmisc)

options(digits=3)
set.seed(173)
sex- factor(sample(c(m,f), 500, rep=TRUE))
age- rnorm(500, 50, 5)
treatment- factor(sample(c(Drug,Placebo), 500, rep=TRUE))
symp- c('Headache','Stomach Ache','Hangnail',
  'Muscle Ache','Depressed')
symptom1- sample(symp, 500,TRUE)
symptom2- sample(symp, 500,TRUE)
symptom3- sample(symp, 500,TRUE)
Symptoms- mChoice(symptom1, symptom2, symptom3, label='Primary Symptoms')
table (Symptoms)
table(symptom1,symptom2)
f- summary(treatment ~ age + sex + Symptoms, method=reverse, test=TRUE)
g- summary(treatment ~ age + sex + symptom1, method=reverse, test=TRUE)

latex(g)


latex(g,file=)

% latex.default(cstats, title = title, caption = caption, rowlabel =
rowlabel,  col.just = col.just, numeric.dollar = FALSE,
insert.bottom = legend,  rowname = lab, dcolumn = dcolumn,
extracolheads = extracolheads,  extracolsize = Nsize, ...)
%
\begin{table}[!tbp]
  \caption{Descriptive Statistics by treatment\label{g}}
  \begin{center}
  \begin{tabular}{lccc}\hline\hline
\multicolumn{1}{l}{}\multicolumn{1}{c}{Drug}\multicolumn{1}{c}{Placebo}\multicolumn{1}{c}{Test
Statistic}\tabularnewline

\multicolumn{1}{c}{{\scriptsize
$N=263$}}\multicolumn{1}{c}{{\scriptsize $N=237$}}\tabularnewline
\hline
age114\tabularnewline
sex~:~m672\tabularnewline
symptom1~:~Depressed433\tabularnewline
Hangnail561\tabularnewline
Headache421\tabularnewline
Muscle~Ache351\tabularnewline
Stomach~Ache241\tabularnewline
\hline
\end{tabular}

\end{center}

\noindent {\scriptsize $a$\ }{$b$\ }{\scriptsize $c$\ } represent the
lower quartile $a$, the median $b$, and the upper quartile $c$\ for
continuous variables.\\Numbers after percents are
frequencies.\\\indent Tests used:\\\textsuperscript{\normalfont
1}Wilcoxon test; \textsuperscript{\normalfont 2}Pearson test
\end{table}


###Then I did another example from Harrell´s statistical tables and plots

rm(list=ls())



library(Hmisc)
getHdata(prostate)
# Variables in prostate had units in ( ) inside variable labels. Move
# these units of measurements to separate units attributes
# wt is an exception. It has ( ) in its label but this does not denote units
# Also make hg have a legal R plotmath expression
prostate-upData(prostate, moveUnits=TRUE,units=c(wt=,
hg=g/100*ml),labels=c(wt=Weight Index = wt(kg)-ht(cm)+200))

attach(prostate)
stage- factor(stage, 3:4, c(Stage 3,Stage 4))
s6-summary(stage~rx+age+wt+pf+hx+sbp+dbp+ekg+hg+sz+sg+ap+bm,method=reverse,
overall=TRUE, test=TRUE)
options(digits=2)
w-latex(s6, size=smaller[3], outer.size=smaller,
Nsize=smaller,long=TRUE, prmsd=TRUE,
msdsize=smaller,middle.bold=TRUE, ctable=TRUE)


##This refused to run ( as long as the ctable=T was included), but without it

latex (s6)

##I do get a nicely formated table, but again the numbers are all wrong... Also

##latex(s6, long=TRUE, prmsd=TRUE, msdsize=smaller,middle.bold=TRUE)

##makes no difference from latex(s6) alone with regards to formatting...



Quite frustrating-Any suggestions??


//M

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Fitting a polynomial using lrm from the Design library

2010-06-18 Thread Frank E Harrell Jr

On 06/18/2010 04:36 PM, David Winsemius wrote:


On Jun 18, 2010, at 5:16 PM, David Winsemius wrote:



On Jun 18, 2010, at 5:13 PM, David Winsemius wrote:



On Jun 18, 2010, at 12:02 PM, Josh B wrote:


Hi all,

I am looking to fit a logistic regression using the lrm function
from the Design library. I am interested in this function because I
would like to obtain pseudo-R2 values (see
http://tolstoy.newcastle.edu.au/R/help/02b/1011.html).

Can anyone help me with the syntax?

If I fit the model using the stats library, the code looks like this:
model - glm(x$trait ~ x$PC1 + I((x$PC1)^2) + I((x$PC1)^3), family =
binomial)

What would be the equivalent syntax for the lrm function?


Not sure if the code you gave above produces an orthogonal set, but
perhaps this will be meaningful to some of r-help's readers (but not
necessarily to me):

require(Design)
mod.poly3 - lrm( trait ~ poly(PC1, 3), data=x)

This does report results, but I'm not sure how you would interpret.
(See below for one attempt)

I think Harrell would probably recommend using restricted cubic
splines, however.

mod.rcs3 - lrm( trait ~ rcs(PC1, 3), data=x)

For plotting with Design/Hmisc functions, you will get better results
with the datadist facilities.
 ddx - datadist(x)
 options(datadist=ddx)
 plot(mod3, PC1=NA)


Forgot to fix this:
plot(mod.rcs3, PC1=NA)


# Perfectly sensible plot which includes the OR=0 line that would be
the theoretically ideal result.


That would be log(odds) = 0 or OR=1. I wonder how many other errors I
committed?


Best to convert to the rms package - see 
http://biostat.mc.vanderbilt.edu/Rrms for differences with Design.


If using ordinary polynomials, use e.g.

mod.poly3 - lrm(trait ~ pol(PC1, 3), data=x) # not pol not poly

Frank


--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R licensing query

2010-06-17 Thread Frank E Harrell Jr
Pardon my english but you're working for idiots.  I'd look elsewhere if 
there are other options.  IT departments should be here to help get 
things done, not to help prevent good work from being done.


Frank

On 06/17/2010 04:28 AM, McAllister, Gina wrote:

I have recently started a new job at an NHS hospital in Scotland.  Since
I took up this post 6 months ago I have had an ongoing dispute with the
IT secutiry dept. who refuse to install R on my computer.  I previously
worked in another branch of the NHS where R was widely used and yet
there is nothing I can say which will persuade the IT dept here to even
visit the website!  With some help from our head of department, they
have now agreed to install R but only if they receive an email from 'R'
ensuring that it is licensed for commercial use, is compaitable with
Windows XP and will not affect the networked computer system here.  My
only other option for data anlaysis is Excel, we have no money for
S-plus or any other stats programme.  Can anyone suggest anything or
send me a suitable email?

Many thanks,
Georgina




--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Is there a non-parametric repeated-measures Anova in R ?

2010-06-17 Thread Frank E Harrell Jr

On 06/17/2010 07:12 PM, David Winsemius wrote:


On Jun 16, 2010, at 1:43 PM, Tal Galili wrote:


Hello Jeremy,
Thank you for replying.

I came across friedman test (I even wrote and published R code to easily
perform a post-hoc analysis of friedman
testhttp://www.r-statistics.com/2010/02/post-hoc-analysis-for-friedmans-test-r-code/

).
But what I am after is *multi-way* repeated-measures anova. Thank you for
your reply which allowed me to clarify my intentions.


Many years ago I remember reading advice in Conover and Iman's
Practical Non-Parametric Statistics that one could apply a rank
transformation to the dependent and independent variables and then run a
typical anova test. This is probably inferior in many ways to doing
quantile regression (don't know if this has a repeated measures
extension) or to the use of robust standard errors for examining
inferential issues in regression models, but it certainly represents a
useful consistency check when all you are worried about is influential
points in a skew distributions. I cannot comment on how it would
theoretically behave in a repeated-measures analysis, but I suspect that
there are readers of this list who can comment with greater authority,
and I invite them to do so.



David - the rank transform method doesn't handle interactions properly, 
among other problems.  The proportional odds model is the logical 
extension of the Wilcoxon-Kruskal-Wallis approach.  It relies only on 
the rank of Y and reduces to the regular nonparametric tests as special 
cases.


Frank
--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Is there a non-parametric repeated-measures Anova in R ?

2010-06-16 Thread Frank E Harrell Jr
The Friedman test lacks power.  When there are only 2 blocks it reduces 
to the inefficient sign test.


Frank

On 06/16/2010 12:43 PM, Tal Galili wrote:

Hello Jeremy,
Thank you for replying.

I came across friedman test (I even wrote and published R code to easily
perform a post-hoc analysis of friedman
testhttp://www.r-statistics.com/2010/02/post-hoc-analysis-for-friedmans-test-r-code/
).
But what I am after is *multi-way* repeated-measures anova.  Thank you for
your reply which allowed me to clarify my intentions.

Best,
Tal




Contact
Details:---
Contact me: tal.gal...@gmail.com |  972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)
--




On Wed, Jun 16, 2010 at 8:30 PM, Jeremy Milesjeremy.mi...@gmail.comwrote:


It's possible to use the ordinal regression model if your data are
ordered categories.  The standard non-parametric test is the Friedman
test.

?friedman.test

Jeremy


On 16 June 2010 10:22, Tal Galilital.gal...@gmail.com  wrote:

Hello Prof. Harrell and dear R-help mailing list,

I wish to perform a non-parametric repeated measures anova.

If what I read online is true, this could be achieved using a mixed

Ordinal

Regression model (a.k.a: Proportional Odds Model).
I found two packages that seems relevant, but couldn't find any vignette

on

the subject:
http://cran.r-project.org/web/packages/repolr/
http://cran.r-project.org/web/packages/ordinal/

So being new to the subject matter, I was hoping for some directions from
people here.

Are there any tutorials/suggested-reading on the subject?  Even better,

can

someone suggest a simple example code for how to run and analyse this in

R

(e.g: non-parametric repeated measures anova) ?

I waited a week to repost this question.  If I should have waited longer,

or

not repost this at all - then I am truly sorry.

Thanks for any help,
Tal







Contact
Details:---
Contact me: tal.gal...@gmail.com |  972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew)

|

www.r-statistics.com (English)



--






[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide

http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.





--
Jeremy Miles
Psychology Research Methods Wiki: www.researchmethodsinpsychology.com



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ols function in rms package

2010-06-08 Thread Frank E Harrell Jr

On 06/08/2010 05:29 AM, Mark Seeto wrote:



On 06/06/2010 10:49 PM, Mark Seeto wrote:

Hello,

I have a couple of questions about the ols function in Frank Harrell's
rms
package.

Is there any way to specify variables by their column number in the data
frame rather than by the variable name?

For example,

library(rms)
x1- rnorm(100, 0, 1)
x2- rnorm(100, 0, 1)
x3- rnorm(100, 0, 1)
y- x2 + x3 + rnorm(100, 0, 5)
d- data.frame(x1, x2, x3, y)
rm(x1, x2, x3, y)
lm(y ~ d[,2] + d[,3], data = d)  # This works
ols(y ~ d[,2] + d[,3], data = d) # Gives error
Error in if (!length(fname) || !any(fname == zname)) { :
missing value where TRUE/FALSE needed

However, this works:
ols(y ~ x2 + d[,3], data = d)

The reason I want to do this is to program variable selection for
bootstrap model validation.

A related question: does ols allow y ~ . notation?

lm(y ~ ., data = d[, 2:4])  # This works
ols(y ~ ., data = d[, 2:4]) # Gives error
Error in terms.formula(formula) : '.' in formula and no 'data' argument

Thanks for any help you can give.

Regards,
Mark


Hi Mark,

It appears that you answered the questions yourself.  rms wants real
variables or transformations of them.  It makes certain assumptions
about names of terms.   The y ~ . should work though; sometime I'll have
a look at that.

But these are the small questions compared to what you really want.  Why
do you need variable selection, i.e., what is wrong with having
insignificant variables in a model?  If you indeed need variable
selection see if backwards stepdown works for you.  It is built-in to
rms bootstrap validation and calibration functions.

Frank



Thank you for your reply, Frank. I would have reached the conclusion
that rms only accepts real variables had this not worked:
ols(y ~ x2 + d[,3], data = d)


Hi Mark - that probably worked by accident.



The reason I want to program variable selection is so that I can use the
bootstrap to check the performance of a model-selection method. My
co-workers and I have used a variable selection method which combines
forward selection, backward elimination, and best subsets (the forward and
backward methods were run using different software).

I want to do bootstrap validation to (1) check the over-optimism in R^2,
and (2) justify using a different approach, if R^2 turns out to be very
over-optimistic. The different approach would probably be data reduction
using variable clustering, as you describe in your book.


The validate.ols function which calls the predab.resample function may 
give you some code to start with.  Note however that the performance of 
the approach you are suggestion has already been shown to be poor in 
many cases.  You might run the following in parallel: full model fits 
and penalized least squares using penalties selected by AIC (using 
special arguments to ols along with the pentrace function).


Frank



Regards,
Mark



--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ols function in rms package

2010-06-07 Thread Frank E Harrell Jr

On 06/06/2010 10:49 PM, Mark Seeto wrote:

Hello,

I have a couple of questions about the ols function in Frank Harrell's rms
package.

Is there any way to specify variables by their column number in the data
frame rather than by the variable name?

For example,

library(rms)
x1- rnorm(100, 0, 1)
x2- rnorm(100, 0, 1)
x3- rnorm(100, 0, 1)
y- x2 + x3 + rnorm(100, 0, 5)
d- data.frame(x1, x2, x3, y)
rm(x1, x2, x3, y)
lm(y ~ d[,2] + d[,3], data = d)  # This works
ols(y ~ d[,2] + d[,3], data = d) # Gives error
Error in if (!length(fname) || !any(fname == zname)) { :
   missing value where TRUE/FALSE needed

However, this works:
ols(y ~ x2 + d[,3], data = d)

The reason I want to do this is to program variable selection for
bootstrap model validation.

A related question: does ols allow y ~ . notation?

lm(y ~ ., data = d[, 2:4])  # This works
ols(y ~ ., data = d[, 2:4]) # Gives error
Error in terms.formula(formula) : '.' in formula and no 'data' argument

Thanks for any help you can give.

Regards,
Mark


Hi Mark,

It appears that you answered the questions yourself.  rms wants real 
variables or transformations of them.  It makes certain assumptions 
about names of terms.   The y ~ . should work though; sometime I'll have 
a look at that.


But these are the small questions compared to what you really want.  Why 
do you need variable selection, i.e., what is wrong with having 
insignificant variables in a model?  If you indeed need variable 
selection see if backwards stepdown works for you.  It is built-in to 
rms bootstrap validation and calibration functions.


Frank

--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Continous variables with implausible transformation?

2010-06-03 Thread Frank E Harrell Jr

On 06/03/2010 11:32 AM, Joris Meys wrote:

You're right, it is the same. using I() won't work for the same reason sqrt
don't, so :


x2- x^2
lrm(y~x+x2)


Thx for the correction.
Cheers
Joris

On Thu, Jun 3, 2010 at 6:14 PM, Bert Guntergunter.ber...@gene.com  wrote:


Below. -- Bert

Bert Gunter
Genentech Nonclinical Biostatistics
--
But you wrote linear+ square. Don't you mean:
lrm(Y~x+x^2)

--- I believe this is the same as lrm(Y ~ x).
You must protect the x^2 via

lrm(Y ~ x + I(x^2))


But don't use that construct.  Use lrm(Y ~ pol(x, 2))

Frank



--

--

Joris Meys
Statistical Consultant

Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control

Coupure Links 653
B-9000 Gent



--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ICD9 codes

2010-05-29 Thread Frank E Harrell Jr
Thanks for both your note Ted and Vishwanath.  I had no idea this was 
going on.


Sincerely,
Frank

On 05/29/2010 05:43 AM, (Ted Harding) wrote:

It is perhaps time to bring to people's general attention that
there is an issue with postings from gmail.com addresses being
held for moderation because The headers matched a filter rule.
It is not only gmail.com, but there is a marked excess of these.
Vishwanath (@gmail.com) is one of many victims.

There is a group of about a dozen people who act as moderators,
approving or discarding messages which have been held. Reasons
for holding are almost always either Posting from Non-Member
or Matched a Filter Rule. We do this somewhat non-systematically,
looking at the list of held messages on a convenience basis.
However, it would be unlikely that any valid message would be
held for more than an hour or so; usually it would be less.

The situation has been discussed recently amongst the moderators,
including the list manager Martin Maechler. We are undecided
about the best action (if any) -- while the excessive false
trapping of gmail.com postings is at least an inconvenience
for the trapped, and sometimes leads to visible distress or
embarrassment, the presence of the filter rules does prevent
spam from reaching the list.

Vishwanath's reaction in repeatedly re-posting is perhaps
understandable on the part of someone who is not familiar with
how the moderation system operates. In fact he posted the same
message 4 times (morning  evening on 27 May, and again on 28 May)
as can be seen from the archives. He is subscribed to the list,
so would normally receive copies of his own postings via the list
(and therefore know that it had got through) unless he has opted
to not receive his own postings. In that case, having received
the held for moderation message, not receiving his posting via
the list, and not thinking of looking in the archives, he could
form the impression that it/they had vanished into a black hole,
and try again.

Since we began to discuss this about a fortnight ago, I have
been keeping a count of those held messages which I have seen
myself. Not counting true spam, only genuine messages, the results
are as follows. NM denote a posting from a non-member, FR
denotes a posting that matched a filter rule, GM denotes a
posting from a gmail.com address, NGM one not from gmail.com.

   FR NM
+-+-
GM  | 45  |   4
+-+-
NGM | 19  |  24

Fisher test: P=7.3e-07, OR=13.7, CI=(4.0, 62.0)

About 30% of R-help subscribers have gmail.com addresses, so they
are clearly over-represented in the FR group (70%)!

It is suspected that the gmail.com filter rule may perhaps be
triggered by HTML, though the reason is not yet clear.

Meanwhile, a reminder to people who receive notification that
their posting has been held for moderation: Please check the
archives to see whether your message has reached the list within
a reasonable time (say 2 hours -- there can be a delay before a
message is placed in the archives) beforetrying to do anything
about it: https://stat.ethz.ch/pipermail/r-help

Hoping this helps to clarify an issue which can lead to unwanted
consequences.

Ted.

On 29-May-10 05:42:27, Vishwanath Sindagi wrote:

Dear Prof Frank Harrel:

I am extremely sorry for having reposted the same question numerous
times. Earlier when I had posted I got  replies stating that my post
had matched a filter rule and hence was being held by the moderator.
So I assumed that the question was never posted and I reposted with
different subject lines just to make sure that it gets posted.

I sincerely apologise for the inconvenience caused.

Regards,
Vishwanath

On Sat, May 29, 2010 at 4:10 AM, Frank E Harrell Jr
f.harr...@vanderbilt.edu  wrote:

Your notes are bordering on harassment. _Do you expect that everyone
who
reads this list will reply I do not have anything that will help you
if
they don't? _By my count this is your 4th note asking for this help.

That being said I hope that you do find help somewhere or implement it
yourself and share the result, as your question is an important one.

Also, please be sure to state your affiliation on your notes.

Frank

On 05/28/2010 02:19 PM, Vishwanath Sindagi wrote:


Hello:

I am working on getting some statistics related to clinical trials
and
stuff. I have to work with ICD9 codes.

Is anyone aware of any R method that deals with ICD9 codes
verification and manipulation.

Thanks
Vishwanath




--
Frank E Harrell Jr _ Professor and Chairman _ _ _ _School of Medicine
_ _ _ _ _ _ _ _ _ _ Department of Biostatistics _ Vanderbilt
University





__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] median test

2010-05-28 Thread Frank E Harrell Jr

Linda,

There are different views about whether someone doing statistical 
analysis should first take a certain number of statistics course.  I 
think for your issue some background information would certainly help. 
You have not correctly interpreted the paper.  The main point is that 
for most cases likely to be seen in practice, the median test is 
tantamount to discarding about 1/3 of your animals.  The Wilcoxon test 
is a good choice for a huge variety of situations.  Even if the data are 
Gaussian it has efficiency 3/pi whereas the median test has efficiency 
2/pi in that case.


Frank

On 05/28/2010 08:58 AM, linda Porz wrote:

Hello,

I can't have different data these data came from mice that have lived under
certain condition in the lab! I have just read the mentioned publication
Should the median test be retired from general use? It says in the
conclusion If one felt that the data could not come from a Cauchy or slash
distribution, the Wilcoxon should be used.! What is this? Is there is any
test in R for a Cauchy or slash distribution? Can I used the unpaired
Wilcoxon, or I have a Cauchy distributed data?

Many thanks,
Linda

2010/5/27 Joshua Wileyjwiley.ps...@gmail.com


Hello Linda,

The problem is actually the median of your data.  What the function
median.test() does first is combine both groups.  Look at this:

median(c(group1, group2))

the median is 1, but the lowest value of the groups is also 1.  So
when the function does the logical check z  m where z = c(group1,
group2) and m is the median, there are no values that are less than
the median value.  Therefore there is only 1 level, and the fisher
test fails.

You would either need different data or adjust the function to be:

fisher.test(z= m, g)$p.value

that way it's less than or equal to the median.

Hope that helps,

Josh

On Thu, May 27, 2010 at 7:24 AM, linda Porzlinda.p...@gmail.com  wrote:

Hi all,

I have found the following function online

median.test-function(y1,y2){
  z-c(y1,y2)
  g- rep(1:2, c(length(y1),length(y2)))
  m-median(z)
  fisher.test(zm,g)$p.value
}

in

http://www.mail-archive.com/r-help@r-project.org/msg95278.html

I have the following data


group1- c(2, 2, 2, 1, 4, 3, 1, 1)
group2- c(3, 1, 3, 1, 4, 1, 1, 1, 7, 1, 1, 1, 1, 1, 2)
median.test(w1,group1)

[1] 1

median.test(group1,group2)

Error in fisher.test(z  m, g) : 'x' and 'y' must have at least 2 levels

I am very thankful in advance for any suggestion and help.

Regards,
Linda

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide

http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.





--
Joshua Wiley
Senior in Psychology
University of California, Riverside
http://www.joshuawiley.com/



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] latex.rms and models fit with GLS

2010-05-28 Thread Frank E Harrell Jr

On 05/28/2010 03:49 PM, Dylan Beaudette wrote:

Hi,

I have fit a model using the rms package with the Gls() function.

Is there a way to get the model estimates, std errors, and p-values (i.e. what
you get with print(fit)) into latex format?

I have tried:

f- Gls(...)
latex(f, file='')

... but I get the following error

Error in replace.substring.wild(s, old, new, test = test, front = front,  :
   does not handle  1 * in old

I think that this might have something to do with the fact that I fit the
model with GLS. (?)


What is GLS?  I thought you fit the model with Gls.  And please follow 
the posting guide instead of submitting only parts of lines of code.


The latex method for a Gls object typesets the mathematical form of the 
fitted model formula.  To get what you want:


w - nlme:::summary.gls(f)$tTable
w
latex(w, )

Frank



I am looking for a simple way to get a table of estimated coeficients without
hand-making a table in a text editor for inclusion within a LATEX document.

Thanks!





--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ICD9 codes

2010-05-28 Thread Frank E Harrell Jr
Your notes are bordering on harassment.  Do you expect that everyone who 
reads this list will reply I do not have anything that will help you 
if they don't?  By my count this is your 4th note asking for this help.


That being said I hope that you do find help somewhere or implement it 
yourself and share the result, as your question is an important one.


Also, please be sure to state your affiliation on your notes.

Frank

On 05/28/2010 02:19 PM, Vishwanath Sindagi wrote:

Hello:

I am working on getting some statistics related to clinical trials and
stuff. I have to work with ICD9 codes.

Is anyone aware of any R method that deals with ICD9 codes
verification and manipulation.

Thanks
Vishwanath




--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] latex.rms and models fit with Gls

2010-05-28 Thread Frank E Harrell Jr

On 05/28/2010 05:31 PM, Dylan Beaudette wrote:

On Friday 28 May 2010, Frank E Harrell Jr wrote:

On 05/28/2010 03:49 PM, Dylan Beaudette wrote:

Hi,

I have fit a model using the rms package with the Gls() function.

Is there a way to get the model estimates, std errors, and p-values (i.e.
what you get with print(fit)) into latex format?

I have tried:

f- Gls(...)
latex(f, file='')

... but I get the following error

Error in replace.substring.wild(s, old, new, test = test, front = front,
: does not handle   1 * in old

I think that this might have something to do with the fact that I fit the
model with GLS. (?)


Thanks for the quick reply Frank.


What is GLS?  I thought you fit the model with Gls.  And please follow
the posting guide instead of submitting only parts of lines of code.


Right, that would have been helpful. I didn't want to paste in the enormous
example from the Gls() manual page...


The latex method for a Gls object typesets the mathematical form of the
fitted model formula.


OK. That would have been nice too, but why did I get the above error message?
Lets recreate my data with an example:

# fake data
w- rnorm(50)
x- rnorm(50)
y- rnorm(50)
z- rnorm(50)
p- runif(50)
g- rep(c('a','b'), each=25)

# sprinkle in some NA
w[sample(1:50, 4)]- NA

# fit data with similar correlation structure
f- Gls(p ~ y * x * (z + w), na.action='na.omit', cor=corCAR1(form= ~ z | g))

# hmmm Error in terms.default(f) : no terms component


Thanks for the good example and sorry about the error.  I've posted a 
fix for Gls.  Use the following to override the current version with the 
fixed one:


source('http://biostat.mc.vanderbilt.edu/tmp/Gls.s')

I've also changed the default na.action to na.omit.  For some reason 
model.frame doesn't work with the output of the Hmisc package's more 
comprehensive na.delete function when used with Gls.



latex(f, file='')

The attached packages are:
nlme_3.1-96 rms_3.0-0   Hmisc_3.8-0 survival_2.35-8



To get what you want:

w- nlme:::summary.gls(f)$tTable
w
latex(w, )


Got it. Thanks-- this does exactly what I was looking for. I wonder if it
might be worth the effort to include a small snippet on this in the manual
pages...


done

Frank



Cheers,
Dylan



Frank


I am looking for a simple way to get a table of estimated coeficients
without hand-making a table in a text editor for inclusion within a LATEX
document.

Thanks!







--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] median test

2010-05-27 Thread Frank E Harrell Jr
Please note that the median test is considered obsolete by many, because 
of its very low power.  See the reference below.  -Frank


@Article{fri00sho,
  author =   {Freidlin, Boris and Gastwirth, Joseph L.},
  title ={Should the median test be retired from general use?},
  journal =  The American Statistician,
  year = 2000,
  volume =   54,
  number =   3,
  pages ={161-164},
  annote =   {inefficiency of Mood median test}
}


On 05/27/2010 10:20 AM, Joshua Wiley wrote:

Hello Linda,

The problem is actually the median of your data.  What the function
median.test() does first is combine both groups.  Look at this:

median(c(group1, group2))

the median is 1, but the lowest value of the groups is also 1.  So
when the function does the logical check z  m where z = c(group1,
group2) and m is the median, there are no values that are less than
the median value.  Therefore there is only 1 level, and the fisher
test fails.

You would either need different data or adjust the function to be:

fisher.test(z= m, g)$p.value

that way it's less than or equal to the median.

Hope that helps,

Josh

On Thu, May 27, 2010 at 7:24 AM, linda Porzlinda.p...@gmail.com  wrote:

Hi all,

I have found the following function online

median.test-function(y1,y2){
  z-c(y1,y2)
  g- rep(1:2, c(length(y1),length(y2)))
  m-median(z)
  fisher.test(zm,g)$p.value
}

in

http://www.mail-archive.com/r-help@r-project.org/msg95278.html

I have the following data


group1- c(2, 2, 2, 1, 4, 3, 1, 1)
group2- c(3, 1, 3, 1, 4, 1, 1, 1, 7, 1, 1, 1, 1, 1, 2)
median.test(w1,group1)

[1] 1

median.test(group1,group2)

Error in fisher.test(z  m, g) : 'x' and 'y' must have at least 2 levels

I am very thankful in advance for any suggestion and help.

Regards,
Linda

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.








--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] validation logistic regression

2010-05-26 Thread Frank E Harrell Jr

On 05/26/2010 07:00 AM, Joris Meys wrote:

Hi,

first of all, you shouldn't backtransform your prediction, use the option
type=response instead :

salichpred-predict(salic.lr, newdata=profilevalidation,type=response)

limit- 0.5
salichpredcat- ifelse(salichpredlimit,0,1) # prediction of categories.

Read in on sensitivity, specificity and ROC-curves. With changing the limit,
you can calculate sensitivity and specificity, and you can construct a ROC
curve that will tell you how well your predictions are. It all depends on
how much error you allow on the predictions.

Cheers
Joris


If you want to use split-sample validation, your validation sample is 
perhaps 100 times too small.


There are more direct ways to validate predictions than using 
sensitivity, specificity, and ROC, for example smooth calibration curves 
and various indexes of predictive accuracy.  These are implemented in 
the rms package.  See the validate.lrm and calibrate.lrm functions.


Frank




On Wed, May 26, 2010 at 10:04 AM, azam jaafariazamjaaf...@yahoo.comwrote:


Hi

I did validation for prediction by logistic regression according to
following:

validationsize- 23
set.seed(1)
random-runif(123)
order(random)
nrprofilesinsample-sort(order(random)[1:100])
profilesample- data[nrprofilesinsample,]
profilevalidation- data[-nrprofilesinsample,]
salich-profilesample$SALIC.H.1
salic.lr-glm(salich~wetnessindex, profilesample,
family=binomial('logit'))
summary(salic.lr)
salichpred-predict(salic.lr, newdata=profilevalidation)
expsalichpred-exp(salichpred)
salichprediction-(expsalichpred/(1+expsalichpred))

So,
  table(salichprediction, profilevalidation$SALIC.H.1)

in result:
salichprediction0 1
   0.0408806327422231 1 0
   0.094509645033899  1 0
   0.118665480273383  1 0
   0.129685441514168  1 0
   0.135452955695111 0
   0.137580612201769  1 0
   0.197265822234215  1 0
   0.199278585548248  0 1
   0.202436276322278  1 0
   0.211278767985746  1 0
   0.261036846823867  1 0
   0.283792703256058  1 0
   0.362229486187581  0 1
   0.362795636267779  1 0
   0.409067386115694  1 0
   0.410860613509484  0 1
   0.423960962956254  1 0
   0.428164288793652  1 0
   0.448509687866763  0 1
   0.538401659478058  0 1
   0.557282539294224  1 0
   0.603881788227797  0 1
   0.63633478460736   0 1

So, I have salichprediction between 0 to 1 and binary variable(observed
values) 0 or 1. I want to compare these data together and I want to know is
ok this model(logistic regression) for prediction or no?

please help me?

Thanks alot

Azam




[[alternative HTML version deleted]]


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.








--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] validation logistic regression

2010-05-26 Thread Frank E Harrell Jr
Better would be 100 repeats of 10-fold cross-validation, or 
bootstrapping, as implemented in the rms package.


Frank

On 05/26/2010 08:21 AM, azam jaafari wrote:


Hi

Thank you for your reply.

I'm new in R. So I'm slow

If I want to do leave-one-out cross validation with these data(100), how I tell 
R that omit one by one data? Is validationsize=100?

  Thanks alot

Azam

--- On Wed, 5/26/10, Joris Meysjorism...@gmail.com  wrote:


From: Joris Meysjorism...@gmail.com
Subject: Re: [R] validation logistic regression
To: azam jaafariazamjaaf...@yahoo.com
Cc: r-help@r-project.org
Date: Wednesday, May 26, 2010, 5:00 AM


Hi,

first of all, you shouldn't backtransform your prediction, use the option 
type=response instead :

salichpred-predict(salic.lr, newdata=profilevalidation,type=response)

limit- 0.5
salichpredcat- ifelse(salichpredlimit,0,1) # prediction of categories.

Read in on sensitivity, specificity and ROC-curves. With changing the limit, 
you can calculate sensitivity and specificity, and you can construct a ROC 
curve that will tell you how well your predictions are. It all depends on how 
much error you allow on the predictions.

Cheers
Joris



On Wed, May 26, 2010 at 10:04 AM, azam jaafariazamjaaf...@yahoo.com  wrote:

Hi

I did validation for prediction by logistic regression according to following:

validationsize- 23
set.seed(1)
random-runif(123)
order(random)
nrprofilesinsample-sort(order(random)[1:100])
profilesample- data[nrprofilesinsample,]
profilevalidation- data[-nrprofilesinsample,]
salich-profilesample$SALIC.H.1
salic.lr-glm(salich~wetnessindex, profilesample, family=binomial('logit'))
summary(salic.lr)
salichpred-predict(salic.lr, newdata=profilevalidation)
expsalichpred-exp(salichpred)
salichprediction-(expsalichpred/(1+expsalichpred))

So,
  table(salichprediction, profilevalidation$SALIC.H.1)

in result:
salichprediction0 1
   0.0408806327422231 1 0
   0.094509645033899  1 0
   0.118665480273383  1 0
   0.129685441514168  1 0
   0.135452955695111 0
   0.137580612201769  1 0
   0.197265822234215  1 0
   0.199278585548248  0 1
   0.202436276322278  1 0
   0.211278767985746  1 0
   0.261036846823867  1 0
   0.283792703256058  1 0
   0.362229486187581  0 1
   0.362795636267779  1 0
   0.409067386115694  1 0
   0.410860613509484  0 1
   0.423960962956254  1 0
   0.428164288793652  1 0
   0.448509687866763  0 1
   0.538401659478058  0 1
   0.557282539294224  1 0
   0.603881788227797  0 1
   0.63633478460736   0 1

So, I have salichprediction between 0 to 1 and binary variable(observed values) 
0 or 1. I want to compare these data together and I want to know is ok this 
model(logistic regression) for prediction or no?

please help me?

Thanks alot

Azam




[[alternative HTML version deleted]]


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.







__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Problem with plotting survival predictions from cph model

2010-05-26 Thread Frank E Harrell Jr

On 05/26/2010 10:55 AM, Michal Figurski wrote:

Dear R-helpers,

I am working with 'cph' models from 'rms' library. When I build simple
survival models, based on 'Surv(time, event)', everything is fine and I
can make nice plots using plot(Predict(f, time=3)).

However, recently I tried to be more specific and used 'Surv(start,
stop, event)' type model. Using this model 'plot(Predict(f))' works OK,
but 'plot(Predict(f, time=3))' throws an error:

Error in summary.survfit(g, print.it = FALSE, times = times) :
object 'n.enter' not found

Actually, using 'survest(f, time=3)' has the same result.

Has anybody encountered this kind of error? Is there a workaround?

Best regards,



Please provide a tiny self-contained reproducible example, e.g., using 
simulated data.


Frank

--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Relative Risk/Hazard Ratio plots for continuous variables

2010-05-25 Thread Frank E Harrell Jr

On 05/25/2010 07:28 AM, Laura Bonnett wrote:

Dear all,

I am using Windows and R 2.9.2 for my analyses.  I have a large dataset and
I am particularly interested in looking at time to an event for a continuous
variable.  I would like to produce a plot of log(relative risk) or relative
risk (also known as hazard ratio) against the continuous variable.


Please use correct terminology.  A risk is a probability (except perhaps 
in finance) whereas a hazard is a rate (instantaneous conditional risk). 
 What you want is relative hazard.




I have spent a long time looking for advice on how to do this but my search
has proved fruitless - sorry if I've missed something obvious.  It seems
that there are options such as muhaz, survfit, coxph and cph that may enable
some plots to be produced but none that specifically look at the relative
risk one.

In addition to the survival analysis, I have incorporated the mfp function
(from package mfp).

I currently use code such as,

library(mfp)
library(Design)

coxfit1- coxph(Surv(rtime,rcens)~cts,data=data1)
or
coxfit2-
mfp(Surv(rtime,rcens)~fp(cts),family=cox,data=data1,select=0.05,verbose=TRUE)

plot(coxfit1) nor plot(coxfit2) produce the relevant relative risk vs.
continuous variable that I need.


Replace Design with the new rms package and see 
http://biostat.mc.vanderbilt.edu/Rrms


The plot you need is a basic one provided by rms (or its ancestor Design).

Frank


Can anyone help?

Thank you,
Laura

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ROC curve

2010-05-24 Thread Frank E Harrell Jr

On 05/24/2010 02:14 AM, Claudia Beleites wrote:

Dear Changbin,


I want to know how to select the optimal decision threshold from the ROC
curve?

Depends on what optimal means. I think there are a bunch of different
criteria used:

- point closest to the ideal model
- point furthest from the guessing model
- these criteria may include costs, i.e. a FP/FN ratio != 1
- ...

More practical:
If you use ROCR: the help of the performance class explains the slots in
the object. You find there the data of the curve, incl. the thresholds.


At what threshold will give the highest accuracy?

to know that, optmize the accuracy as function of the threshold.

Remember: finding the optimal threshold from a ROC curve is a
data-driven optimization. You need to validate the resulting model with
independent test data afterwards.


That point is excellent.  In addition, such decision analysis assumes 
that (1) a forced yes/no decision is acceptable, i.e., a predicted 
probability in the middle is forced to be categorized as low or high 
as opposed to no decision; get more data, and (2) the 
utility/cost/loss function is identical across subjects (which it almost 
never is).


Frank

--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Re : Re : Re : Nomogram with multiple interactions (package rms)

2010-05-23 Thread Frank E Harrell Jr
 argue that nomogram is not a good a
choice... I could use the parameter
interact=list(sex=(male,female),x1=c(a,b,c))... but with rcs or
pol transformations of x1, x2 and x3, the choice of the
categorization (a,b,c,...) is arbitrary and the nomogram not so
useful... Considering that sex is the problem, I thought I could draw
two nomograms, one for each sex... based on one model. These would be
great. Do you think it's possible ?


Yes, you can specify constant predictors not to draw with the omit=
argument.  But try first to draw everything.  Shouldn't you just get 2
axes each for x1 x2 x3?



Taking the exemple of the help of nomogram() (package rms) : f-
psm(Surv(d.time,death) ~ sex*age, dist=if(.R.)'lognormal' else
'gaussian')


Drop the if(.R.) which was just corrected in the documentation.  Use
dist='lognormal'

Frank



Let's add the previously defined blood.pressure effect with an
interaction with sex too (with cph) : f2- cph(Surv(d.time,death) ~
sex*(age+blood.pressure))

I thought of the parameter adt.to : plot(nomorgam(f2,
adj.to=list(sex=male)) #and female for the other one

But nomogram() still wants me to define interact=list(...) Thanks for
any advice you might have (with adj.to or any alternative...)

Marc Carpentier













--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regression with sparse matricies

2010-05-22 Thread Frank E Harrell Jr

On 05/22/2010 02:19 PM, Robin Jeffries wrote:

I would like to run a logistic regression on some factor variables (main
effects and eventually an interaction) that are very sparse. I have a
moderately large dataset, ~100k observations with 1500 factor levels for one
variable (x1) and 600 for another (X2), creating ~19000 levels for the
interaction (X1:X2).

I would like to take advantage of the sparseness in these factors to avoid
using GLM. Actually glm is not an option given the size of the design
matrix.

I have looked through the Matrix package as well as other packages without
much help.

Is there some option, some modification of glm, some way that it will
recognize a sparse matrix and avoid large matrix inversions?

-Robin



Robin,

It is doubtful that fixed effects are appropriate for your situation, 
but if you do want to use them there is experimental code in the lrm 
function in the rms package to handle strat (strata) factors that 
makes use of the sparse matrix representation.  Not sure if it handles 
more than one factor, and you'll have to play with the code to make sure 
this method is activated.  Take a look at lrm.fit.strat.s that comes 
with the source package, the see what is needed in lrm to use it.


Frank

--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Re : Re : Nomogram with multiple interactions (package rms)

2010-05-20 Thread Frank E Harrell Jr

On 05/20/2010 01:42 AM, Marc Carpentier wrote:

Thank you for your responses, but I don't think you're right about the doc...
I carefully looked at it before posting and ran the examples, looked in 
Vanderbilt Biostat doc, and just looked again example(nomogram) :
1st example : categorical*continous : two axes for each sex
f- lrm(y ~ lsp(age,50)+sex*rcs(cholesterol,4)+blood.pressure)


Hi Marc,

My apologies; I misread my own example.  This will take some digging 
into the code.  If you have time to do this before I do, code change 
suggestions welcomed.


Frank




2nd : continous*continous : one age axe for each specified value of 
cholesterol
g- lrm(y ~ sex + rcs(age,3)*rcs(cholesterol,3))

3rd and 4th : categorical*continous : two axes for each sex (4th with fun)
f- psm(Surv(d.time,death) ~ sex*age, dist='lognormal')

5th : categorical*continous : two axes for each sex (with fun)
g- lrm(Y ~ age+rcs(cholesterol,4)*sex)

I'm desperately trying to represent a case of categorical*(continous+continous) 
:
f2- cph(Surv(d.time,death) ~ sex*(rcs(cholesterol,4)+blood.pressure)
The best solution I can think of is to draw one nomogram for each sex :
Assuming 'male' is the ref level of sex :
1st nomogram : one axe for rcs(cholesterol,4), one axe for blood.pressure
2nd nomogram : one axe for sex:rcs(cholesterol,4), one axe for 
sex:blood.pressure, both shifted because of the sex own effect.
(I badly draw it in my previous mail)
I didn't see any example of this adjustement of nomogram to 'male' or 
'female'...

I hope I gave a clearer explanation and I'm not wrong about this unmentioned 
case.

Marc




- Message d'origine 
De : Frank E Harrell Jrf.harr...@vanderbilt.edu
À : Marc Carpentiermarc.carpent...@ymail.com
Cc : r-help-request Mailing Listr-help@r-project.org
Envoyé le : Jeu 20 mai 2010, 0h 55min 32s
Objet : Re: Re : [R] Nomogram with multiple interactions (package rms)

On 05/19/2010 04:36 PM, Marc Carpentier wrote:

I'm sorry. I don't understand the omit solution, and maybe I mislead you with 
my explanation.

With the data from the f exemple of nomogram() :
Let's declare :
f2- cph(Surv(d.time,death) ~ sex*(age+blood.pressure))
I guess the best (and maybe the only) way to represent it with a nomogram is to 
plot two nomograms (I couldn't find better).
Is there a way to have :

Nomogram1 : male :
- points 1-100 ---
- age (men) ---
- blood.pressure (men) ---
- linear predictor ---

And nomogram2 : female :
- points 1-100 ---
- age (female) ---
- blood.pressure (female) ---
- linear predictor ---

As I said I tried and failed (nomogram() still wants me to define
interact=list(...)) with :
plot(nomorgam(f2, adj.to=list(sex=male)) #and female for the other one

Marc


I think the documentation tells you how to do this.  But you failed to
look at the output from example(nomogram).  In one of the examples two
continuous predictors have two axes each, with male and female in close
proximity.  Or maybe I'm just missing your point.

Frank





- Message d'origine 
De : Frank E Harrell Jrf.harr...@vanderbilt.edu
À : Marc Carpentiermarc.carpent...@ymail.com; r-help-request Mailing 
Listr-help@r-project.org
Envoyé le : Mer 19 mai 2010, 22h 28min 51s
Objet : Re: [R] Nomogram with multiple interactions (package rms)

On 05/19/2010 03:17 PM, Marc Carpentier wrote:

Dear list, I'm facing the following problem : A cox model with my sex
variable interacting with several continuous variables :
cph(S~sex*(x1+x2+x3)) And I'd like to make a nomogram. I know it's a
bit tricky and one mights argue that nomogram is not a good a
choice... I could use the parameter
interact=list(sex=(male,female),x1=c(a,b,c))... but with rcs or
pol transformations of x1, x2 and x3, the choice of the
categorization (a,b,c,...) is arbitrary and the nomogram not so
useful... Considering that sex is the problem, I thought I could draw
two nomograms, one for each sex... based on one model. These would be
great. Do you think it's possible ?


Yes, you can specify constant predictors not to draw with the omit=
argument.  But try first to draw everything.  Shouldn't you just get 2
axes each for x1 x2 x3?



Taking the exemple of the help of nomogram() (package rms) : f-
psm(Surv(d.time,death) ~ sex*age, dist=if(.R.)'lognormal' else
'gaussian')


Drop the if(.R.) which was just corrected in the documentation.  Use
dist='lognormal'

Frank



Let's add the previously defined blood.pressure effect with an
interaction with sex too (with cph) : f2- cph(Surv(d.time,death) ~
sex*(age+blood.pressure))

I thought of the parameter adt.to : plot(nomorgam(f2,
adj.to=list(sex=male)) #and female for the other one

But nomogram() still wants me to define interact=list(...) Thanks for
any advice you might have (with adj.to or any alternative...)

Marc Carpentier










--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine

Re: [R] Generating all possible models from full model

2010-05-19 Thread Frank E Harrell Jr
+sinmonth+coslunar+sinlunar+plankton, 
data=mydata))

   m23456-glm.convert(glm.nb(mantas~year+cosmonth+sinmonth+coslunar+sinlunar, 
data=mydata))
   m23457-glm.convert(glm.nb(mantas~year+cosmonth+sinmonth+coslunar+plankton, 
data=mydata))

   m23467-glm.convert(glm.nb(mantas~year+cosmonth+sinmonth+sinlunar+plankton, 
data=mydata))

   m23567-glm.convert(glm.nb(mantas~year+cosmonth+coslunar+sinlunar+plankton, 
data=mydata))

   m24567-glm.convert(glm.nb(mantas~year+sinmonth+coslunar+sinlunar+plankton, 
data=mydata))

   
m34567-glm.convert(glm.nb(mantas~cosmonth+sinmonth+coslunar+sinlunar+plankton, 
data=mydata))

#Six terms - 7 models
   
m123456-glm.convert(glm.nb(mantas~site*year+cosmonth+sinmonth+coslunar+sinlunar,
 data=mydata))
   
m123457-glm.convert(glm.nb(mantas~site*year+cosmonth+sinmonth+coslunar+plankton,
 data=mydata))

   
m123467-glm.convert(glm.nb(mantas~site*year+cosmonth+sinmonth+sinlunar+plankton,
 data=mydata))

   
m123567-glm.convert(glm.nb(mantas~site*year+cosmonth+coslunar+sinlunar+plankton,
 data=mydata))

   
m124567-glm.convert(glm.nb(mantas~site*year+sinmonth+coslunar+sinlunar+plankton,
 data=mydata))

   
m134567-glm.convert(glm.nb(mantas~site+cosmonth+sinmonth+coslunar+sinlunar+plankton,
 data=mydata))

   
m234567-glm.convert(glm.nb(mantas~year+cosmonth+sinmonth+coslunar+sinlunar+plankton,
 data=mydata))

#Seven terms - 1 model
   
m1234567-glm.convert(glm.nb(mantas~site*year+cosmonth+sinmonth+coslunar+sinlunar+plankton,
 data=mydata))


Tim Clark
Department of Zoology
University of Hawaii


Please read the large number of notes in the e-mail archive about the 
invalidity of such modeling procedures.


Frank

--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Generating all possible models from full model

2010-05-19 Thread Frank E Harrell Jr

On 05/19/2010 01:39 PM, Ben Bolker wrote:

Frank E Harrell Jrf.harrellat  Vanderbilt.Edu  writes:



Please read the large number of notes in the e-mail archive about the
invalidity of such modeling procedures.

Frank



   I'm curious: do you have an objection to multi-model averaging
a la Burnham, Anderson, and White (as implemented in the MuMIn
package)?  i.e., *not* just picking the
best model, and *not* trying to interpret statistical significance
of particular coefficients, but trying to maximize predictive
capability by computing the AIC values of many candidate models
and weighting predictions accordingly (and incorporating among-model variation
when computing prediction uncertainty)?  (I would look for the
answer in your book, but I have lost my copy by loaning it out
  haven't got a new one yet ...)


Hi Ben,

I think that model averaging (e.g., Bayesian model averaging) works 
extremely well.  But if you are staying within one model family, it is a 
lot more work than the equally excellent penalized maximum likelihood 
estimation of a single (big) model.  The latter uses more standard tools 
and can isolate the effect of one variable and result in ordinary model 
graphics.


I haven't seen a variable selection method that works well without 
penalization (shrinkage).


Frank

--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Nomogram with multiple interactions (package rms)

2010-05-19 Thread Frank E Harrell Jr

On 05/19/2010 03:17 PM, Marc Carpentier wrote:

Dear list, I'm facing the following problem : A cox model with my sex
variable interacting with several continuous variables :
cph(S~sex*(x1+x2+x3)) And I'd like to make a nomogram. I know it's a
bit tricky and one mights argue that nomogram is not a good a
choice... I could use the parameter
interact=list(sex=(male,female),x1=c(a,b,c))... but with rcs or
pol transformations of x1, x2 and x3, the choice of the
categorization (a,b,c,...) is arbitrary and the nomogram not so
useful... Considering that sex is the problem, I thought I could draw
two nomograms, one for each sex... based on one model. These would be
great. Do you think it's possible ?


Yes, you can specify constant predictors not to draw with the omit= 
argument.  But try first to draw everything.  Shouldn't you just get 2 
axes each for x1 x2 x3?




Taking the exemple of the help of nomogram() (package rms) : f-
psm(Surv(d.time,death) ~ sex*age, dist=if(.R.)'lognormal' else
'gaussian')


Drop the if(.R.) which was just corrected in the documentation.  Use 
dist='lognormal'


Frank



Let's add the previously defined blood.pressure effect with an
interaction with sex too (with cph) : f2- cph(Surv(d.time,death) ~
sex*(age+blood.pressure))

I thought of the parameter adt.to : plot(nomorgam(f2,
adj.to=list(sex=male)) #and female for the other one

But nomogram() still wants me to define interact=list(...) Thanks for
any advice you might have (with adj.to or any alternative...)

Marc Carpentier




--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Re : Nomogram with multiple interactions (package rms)

2010-05-19 Thread Frank E Harrell Jr

On 05/19/2010 04:36 PM, Marc Carpentier wrote:

I'm sorry. I don't understand the omit solution, and maybe I mislead you with 
my explanation.

With the data from the f exemple of nomogram() :
Let's declare :
f2- cph(Surv(d.time,death) ~ sex*(age+blood.pressure))
I guess the best (and maybe the only) way to represent it with a nomogram is to 
plot two nomograms (I couldn't find better).
Is there a way to have :

Nomogram1 : male :
- points 1-100 ---
- age (men) ---
- blood.pressure (men) ---
- linear predictor ---

And nomogram2 : female :
- points 1-100 ---
- age (female) ---
- blood.pressure (female) ---
- linear predictor ---

As I said I tried and failed (nomogram() still wants me to define
interact=list(...)) with :
plot(nomorgam(f2, adj.to=list(sex=male)) #and female for the other one

Marc


I think the documentation tells you how to do this.  But you failed to 
look at the output from example(nomogram).  In one of the examples two 
continuous predictors have two axes each, with male and female in close 
proximity.  Or maybe I'm just missing your point.


Frank





- Message d'origine 
De : Frank E Harrell Jrf.harr...@vanderbilt.edu
À : Marc Carpentiermarc.carpent...@ymail.com; r-help-request Mailing 
Listr-help@r-project.org
Envoyé le : Mer 19 mai 2010, 22h 28min 51s
Objet : Re: [R] Nomogram with multiple interactions (package rms)

On 05/19/2010 03:17 PM, Marc Carpentier wrote:

Dear list, I'm facing the following problem : A cox model with my sex
variable interacting with several continuous variables :
cph(S~sex*(x1+x2+x3)) And I'd like to make a nomogram. I know it's a
bit tricky and one mights argue that nomogram is not a good a
choice... I could use the parameter
interact=list(sex=(male,female),x1=c(a,b,c))... but with rcs or
pol transformations of x1, x2 and x3, the choice of the
categorization (a,b,c,...) is arbitrary and the nomogram not so
useful... Considering that sex is the problem, I thought I could draw
two nomograms, one for each sex... based on one model. These would be
great. Do you think it's possible ?


Yes, you can specify constant predictors not to draw with the omit=
argument.  But try first to draw everything.  Shouldn't you just get 2
axes each for x1 x2 x3?



Taking the exemple of the help of nomogram() (package rms) : f-
psm(Surv(d.time,death) ~ sex*age, dist=if(.R.)'lognormal' else
'gaussian')


Drop the if(.R.) which was just corrected in the documentation.  Use
dist='lognormal'

Frank



Let's add the previously defined blood.pressure effect with an
interaction with sex too (with cph) : f2- cph(Surv(d.time,death) ~
sex*(age+blood.pressure))

I thought of the parameter adt.to : plot(nomorgam(f2,
adj.to=list(sex=male)) #and female for the other one

But nomogram() still wants me to define interact=list(...) Thanks for
any advice you might have (with adj.to or any alternative...)

Marc Carpentier







--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Getting dates in an SPSS file in right format.

2010-05-18 Thread Frank E Harrell Jr

On 05/18/2010 11:52 AM, Chuck Cleland wrote:

On 5/18/2010 12:38 PM, Praveen Surendran wrote:

Dear all,



I am trying to read an SPSS file into a data frame in R using method
read.spss(),

sample- read.spss(file.name,to.data.frame=TRUE)



But dates in the data.frame 'sample' are coming as integers and not in the
actual date format given in the SPSS file.

Appreciate if anyone can help me to solve this problem.


   Date variables in SPSS contain the number of seconds since
October 14, 1582.  You might try something like this:

sample$MYDATE- as.Date(as.POSIXct(sample$MYDATE, origin=1582-10-14,
tz=GMT))


Kind Regards,



Praveen Surendran

2G, Complex and Adaptive Systems Laboratory (UCD CASL)

School of Medicine and Medical Sciences

University College Dublin

Belfield, Dublin 4

Ireland.



Office : +353-(0)1716 5334

Mobile : +353-(0)8793 13071



The spss.get function in the Hmisc package handles SPSS dates.
Frank

--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Box-Cox Transformation: Drastic differences when varying added constants

2010-05-18 Thread Frank E Harrell Jr

On 05/18/2010 10:41 PM, Greg Snow wrote:

Have you read the BoxCox paper?  It has the theory in there for dealing with an 
offset parameter (though I don't know of any existing functions that help in 
estimating both lambdas at the same time).  Though another important point (in 
the paper as well) is that the lambda values used should be based on sound 
science, not just what fits best.



Sensitivity of log-like and exponential transformations to the choice of 
origin is a significant limitation in my view.  If there is no subject 
matter theory to back up a particular origin, then either the origin 
should be a parameter to be estimated or you should consider a 
nonparametric transformation of Y.  avas is one such approach, based on 
variance stabilization.  The areg.boot function in the Hmisc package 
gives you confidence bands for various quantities using avas and ace 
transform-both-sides nonparametric regression approaches.


Frank

--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cubic B-spline, how to numerically integrate?

2010-05-14 Thread Frank E Harrell Jr

On 05/14/2010 11:36 AM, David Winsemius wrote:


On May 14, 2010, at 11:57 AM, David Winsemius wrote:



On May 14, 2010, at 11:41 AM, Claudia Penaloza wrote:


(corrected version of previous posting)

I fit a GAM to turtle growth data following methods in Limpus 
Chaloupka
1997 (http://www.int-res.com/articles/meps/149/m149p023.pdf).

I want to obtain figures similar to Fig 3 c  f in Limpus  Chaloupka
(1997), which according to the figure legend are expected size-at-age
functions obtained by numerically integrating size-specific growth
rate
functions derived using cubic B-spline fit to GAM predicted values.

I was able to fit the cubic-B spline, but I do not know how to
numerically
integrate it.


You need to give us the function and the appropriate limits of
integration.


Maybe it is even easier than I thought. Assuming your interest lies with
the last fitted line ... the one on the third page ... you could perhaps
just see how successful this strategy would be:

# Presumably you have already done:
# requite(mgcv)

Int.fit - seq(100, 600, by=0.1)*predict(bspline3,
newdata=data.frame(size=seq(100, 600, by=0.1) ) )

You would need to do a sensibility check by comparing the result to a
back of the envelope estimate: say 55*500=27500 . I'm a bit concerned
that a dimensional analysis suggest this is an estimate of mm^2/yr,
although I suppose a yearly surface area increase could be a meaningful
value in some domain or another





Can anybody help please?

Code and figures here:
https://docs.google.com/fileview?id=0B1cQ7z9xYFl2ZTZhMmMyMjAtYTA3Zi00N2QyLTkxNzMtOGYyMjdiOGU2ZWE4hl=en


Thank you,
Claudia

--

David Winsemius, MD
West Hartford, CT


There are advantages to using the truncated power basis for splines. 
Derivatives and antiderivatives are trivial to write down.  The Hmisc 
package has rcspline.eval and rcspline.restate functions to help, the 
latter having an option to express the antiderivative in text form.


Frank


--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Simple question on binning data

2010-05-14 Thread Frank E Harrell Jr

On 05/14/2010 07:35 PM, kMan wrote:

Wow! This definitely contributed to my evening.

If you could indulge, I would like some clarification on this matter of
binning and distortion, particularly wrt time series (perhaps related to
long-memory processes?). I had thought binning was standard practice in
spectral analysis and ANPOW.

...snip
Bert is correct.  Binning is seldom needed and usually distorts.  It is the
statistical equivalent of a former governor from Alaska.

Frank
...snip

Sincerely,
KeithC.




My experience in time series is limited so I'll defer to others.
Frank

--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] read table for Fisher Exact

2010-05-13 Thread Frank E Harrell Jr

On 05/12/2010 03:31 PM, visser wrote:


i have 2 groups i want to compare: group A and group B
each group contains let's say 20 patients
i want to perform a Fisher Exact test on genotype distribution
so, see if there is a sign diff in genotpe frequency/distribution (#AA, #AB,
#BB) between group A and B
not for 1, but for 1000 different genes

my question: how should i build my table so i can do:

test- read.table(table1.txt)
fisher.test(test)

i know a lot is still missing in the syntax, but i do not know what. any
help would be soo much appreciated!!


Note that in this case, Fisher's exact test has a good chance of being 
less accurate than an approximate Pearson chi-square test.


--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Simple question on binning data

2010-05-13 Thread Frank E Harrell Jr

On 05/13/2010 04:05 PM, Bert Gunter wrote:

?hist

However, I'm going to go way out on a limb here, as I know absolutely
nothing of what you're up to, and suggest that you do NOT need to bin your
data.

-- Bert


Bert is correct.  Binning is seldom needed and usually distorts.  It is 
the statistical equivalent of a former governor from Alaska.


Frank



Bert Gunter
Genentech Nonclinical Biostatistics



-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On
Behalf Of Erik Iverson
Sent: Thursday, May 13, 2010 1:51 PM
To: george5000
Cc: r-help@r-project.org
Subject: Re: [R] Simple question on binning data

There would be several people who could help if you gave us a minimal,
reproducible example like the posting guide asks for.

If you have a vector of continuous data, and need to create a
categorical variable (in R, a factor) from that continuous variable,
then ?cut can help you.



george5000 wrote:

Hello everyone,

I have a data set, and I need to bin my data using a bin width of say

g(n).

Would anyone be willing to tell me how to do this in R?

Thanks





--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] read table for Fisher Exact

2010-05-13 Thread Frank E Harrell Jr

On 05/13/2010 08:16 PM, Shi, Tao wrote:

Hi Prof. Harrell,

Could you please elaborate on why chi-square test is more appropriate in this 
case?  Thank you very much!

...Tao


Exact tests tend to not be very accurate.  Typically their P-values are 
too large.  See


@Article{cra08how,
  author =   {Crans, Gerald G. and Shuster, Jonathan J.},
  title = 		 {How conservative is {Fisher's} exact test? {A} 
quantitative evaluation of the two-sample comparative binomial trial},

  journal =  Statistics in Medicine,
  year = 2008,
  volume =   27,
  pages ={3598-3611},
  annote = 	 {Fisher's exact test; $2\times 2$ contingency table;size 
of test; comparative binomial experiment;first paper to truly quantify 
the conservativeness of Fisher's test;``the test size of FET was less 
than 0.035 for nearly all sample sizes before 50 and did not approach 
0.05 even for sample sizes over 100.'';conservativeness of ``exact'' 
methods;see \emph{Stat in Med} \textbf{28}:173-179, 2009 for a criticism 
which was unanswered}

}








- Original Message 

From: Frank E Harrell Jrf.harr...@vanderbilt.edu
To: r-help@r-project.org
Sent: Thu, May 13, 2010 5:35:07 AM
Subject: Re: [R] read table for Fisher Exact

On 05/12/2010 03:31 PM, visser wrote:

i have 2 groups i want to
compare: group A and group B
each group contains let's say 20
patients
i want to perform a Fisher Exact test on genotype
distribution
so, see if there is a sign diff in genotpe
frequency/distribution (#AA, #AB,
#BB) between group A and B
not
for 1, but for 1000 different genes

my question: how should i
build my table so i can do:

test-
read.table(table1.txt)
fisher.test(test)

i know a lot
is still missing in the syntax, but i do not know what. any
help would
be soo much appreciated!!


Note that in this case, Fisher's exact test has

a good chance of being

less accurate than an approximate Pearson chi-square

test.





--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] a question about latex in Hmisc

2010-05-12 Thread Frank E Harrell Jr

On 05/12/2010 03:33 PM, Shi, Tao wrote:

Hi Ista,

Thanks for the reply!

You actually misunderstood me.  I never objected the tmp- latex(x) method 
(in fact, that's what I'm doing now in my .Rnw file).  As I stated in my original post, I'm 
simply curious about what causes the error window and wanted to get to the bottom of it.

...Tao


Not clear why the earlier suggestion someone made to use latex(object, 
file='') is not the method of choice with Sweave.


Frank




=

If you would rather have a

separate file for your table you can continue to

use your original method,

e.g.,

=
tmp- latex(x,

file=x.tex)

@
\include{x.tex}

(not sure what your objection to

this was in the first place).


-Ista





--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] a matter of etiquette

2010-05-11 Thread Frank E Harrell Jr

On 05/11/2010 08:20 AM, Gabor Grothendieck wrote:

You could send a reminder to the author.  Its happened to me that
nothing occurred and 6 months later I sent another email and that
seemed enough to get it fixed.

You could also ask the maintainer if the package has been abandoned
and offer to take over as maintainer.


I know this can be frustrating.  I reported a critical but trivial to 
fix error in a package 3 years ago and offered to take over the 
maintenance of the package one year ago, but the current maintainer does 
not get around to responding.


Frank



If you just want to record the problems and offer some advice to
others on a workaround you could add a page to the the R wiki:
http://rwiki.r-project.org .   That way if its fixed you can later
modify the page to remove the note or remove the entire page.


2010/5/11 Adelchi Azzaliniazzal...@stat.unipd.it:

Dear useRs and expeRts,

I would like to ask your views on an issue for which there possibly
exists an established policy of the R-help list, but I am not aware
of it.

In 2008, I have spotted some errors in a package, one which is likely
to have many users (I am not one myself). The more serious errors are
in the documentation, since they lead to a completely distorted
interpretation of the outcome; in addition, there is (at least) one
programming error which produces some wrong computations. A few weeks
later, the maintainers of the package replied, with a promise to handle
these issues.

Today, I have looked at that package again, and everthing is as in 2008,
as for the points raised then. There has been at least one update of
that package, about one year after our e-mail exchange.

When I discover a bug in a program of a colleague, I point out it by a
private communication to the author. If the problem is of some
relevance, I expect that the bug is fixed in a reasonable time. In the
present case, this fix has not been provided after over 18 months, and
potentially users keep producing nonsense out of that package.

What is the appropriate way to proceed now? Announce these errors to
the whole R-help list, or what else?  I only rule out writing (again) to
the maintainers of the package.

Adelchi Azzalini

--
Adelchi Azzaliniazzal...@stat.unipd.it
Dipart.Scienze Statistiche, Università di Padova, Italia
tel. +39 049 8274147,  http://azzalini.stat.unipd.it/




--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] P values

2010-05-11 Thread Frank E Harrell Jr

On 05/11/2010 04:50 PM, Bert Gunter wrote:

Inline below.

-- Bert


Bert - I disagree with you.  You have understated the problems with 
P-values  :-)


Frank



Bert Gunter
Genentech Nonclinical Statistics

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On
Behalf Of Greg Snow
Sent: Tuesday, May 11, 2010 2:37 PM
To: Bak Kuss; murdoch.dun...@gmail.com; jorism...@gmail.com
Cc: R-help@r-project.org
Subject: Re: [R] P values

Bak,

...

  Small p-values indicate a hypothesis and data that are not very
consistent and for small enough p-values we would rather disbelieve the
hypothesis (though it may still be theoretically possible). 


This is false.

It is only true when the hypotheses are prespecified (not derived from
looking at the data first), when there's only one being tested (not,say,
10,000), etc. etc.

(Incidentally, small enough is not meaningful; making it so is usually
impossible).

IMHO far from being niggling details, these are the crux of the reason that
the conventional use of P values to decide what is scientifically valid
and what is not is a pox upon science. This is not an original view, of
course.

Don't want to stir up a hornet's nest, so feel free to silently dismiss. But
contrary views are welcome, as always.

No more from me on this -- I promise!

-- Bert

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to estimate whether overfitting?

2010-05-10 Thread Frank E Harrell Jr

On 05/10/2010 12:32 AM, bbslover wrote:


many thanks .  I can try to use test set with 100 samples.

anther question is that how can I rationally split my data to training set
and test set? (training set with 108 samples, and test set with 100 samples)

as I  know, the test set should the same distribute to the training set. and
what method can deal with it to rationally split?

and what packages in R can deal with splitting training/test set rationally
question?


if the split is random. it seems to need many times splits, and the average
results consider as the final results.

however, I want to several methods to perform split and get the firm
training set and test set instead of random split.

training set and test set should like this:ideally, the division must be
performed sunch that points representing both traing and training set are
distributed within the hole feature space occupied by the entire dataset,
and each point of the test set is close to at least one point of the
training set. this approach ensures that the similarity principle can be
enmployed for the output prediction of the test set. Certainly,this
condition can not always be satistied.

thus, generally, what algorithms often be perform to split? and more
rational? some paper often say, they split the data set  randomly, thus,
what is randomly?  just selection random? or have some clear method? e.g.
output order,  I really know, which package can do with split data
rationally?

other, if one want to get the better results, some tips can be done. e.g.
they can select test set again and again, and use the test set with best
results as final test set and say that the test set was selectd randomly,
but it is not true random, it is false.

thank you, sorry to so many questions. but it puzzled me always.  up to now,
I have no good method to split rationally my data into training set and test
set.

at last, split training and test set should be done before modeling, and it
seems that this can be done just from featrue? (som)  ( or feature and
output?(alogorithm spxy. paper:a method for calibration and validation
subset partioning)  or just output?(output order)).

but always, often there are many features to be calculated. and some featrue
is zero or low standard deviation(sd0.5),  should we delete these features
before split the whole data?

and use the remaining feature to split data, and just using the training set
to build the regression model and to perform feature selection as well as to
do cross-validation,  and the independent test set just used to test the
built model, yes?

maybe, my thinking is not clear about the whole model precess. but I think
it is like this:
1) get samples
2) calculate features
3) preprocess features calculated (e.g.remove zero)
4)rational split data into training and test set (always puzzle me, how to
split on earth?)
5)build model and at the same time tune parameter of model  based on the
resample methods using just training set. and get the final model.
6) test the model performance using independent test set (unseen samples).
7) estimate the model. good? or bad?  overfitting?  (generally, what case is
overfitting? can you give me a example? as i know, it is overfitting when
the trainging set fit good, but the independent test set is bad,but what is
good ? what is bad?r2=0.94 in the training set and r2=0.70 in the test,
in this case, the model is overfitting?  the model can be accepted?  and
generally what model can be well accetpt?)
8) conclusion. how is the model.

above is my thinking.  and many question wait for answering.

thanks

kevin





Kevin: I'm sorry I don't have time to deal with such a long note, but 
briefly data splitting is not a good idea no matter how you do it unless 
N  perhaps 20,000.  I suggest resampling, e.g., either the bootstrap 
with 300 resamples or 50-fold repeats of 10-fold cross-validation. 
Among other places these are implemented in my rms package.


Frank

--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to estimate whether overfitting?

2010-05-09 Thread Frank E Harrell Jr

On 05/09/2010 10:53 AM, David Winsemius wrote:


On May 9, 2010, at 9:20 AM, bbslover wrote:



1. is there some criterion to estimate overfitting? e.g. R2 and Q2 in the
training set, as well as R2 in the test set, when means overfitting. for
example, in my data, I have R2=0.94 for the training set and for the test
set R2=0.70, is overfitting?
2. in this scatter, can one say this overfitting?

3. my result is obtained by svm, and the sample are 156 and 52 for the
training and test sets, and predictors are 96, In this case, can svm be
employed to perform prediction? whether the number of the predictors are
too many ?


Your test sample is too small by a factor of 100 for split sample 
validation to work well.


Frank





I think you need to buy a copy of Hastie, Tibshirani, and Friedman and
do some self-study of chapters 7 and 12.



4.from this picture, can you give me some suggestion to improve model
performance? and is the picture bad?


5. the picture and data below.
thank you!


http://n4.nabble.com/file/n2164417/scatter.jpg scatter.jpg

http://n4.nabble.com/file/n2164417/pkc-svm.txt pkc-svm.txt
--



--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] help on hmisc

2010-05-08 Thread Frank E Harrell Jr

On 05/07/2010 10:12 AM, nvanzuy...@gmail.com wrote:

Hi,

I thought I would just jump in on this as I am running an i7 as well. I use
hmisc for the doBy functions and it would make a huge difference
particularly with large data sets to run this on 64bit windows. I'm not
sure how to compile from source and usually use the install.packages
option. At the moment I have two versions of R installed and switch between
them depending on what I'm working with. Having an hmisc package for 64bit
windows would really help.

Thanks Natalie


Natalie you must be thinking of another package.  doBy is not in Hmisc. 
 summarize, mApply, etc., are in Hmisc.


You might look at the data.table package too.

Frank



On May 7, 2010 1:52pm, Joris Meysjorism...@gmail.com  wrote:

Zach,





The R-gurus will correct me when I'm wrong, but as far as my very limited



experience goes, the 64bit version only gives you an advantage when
throwing



around huge datasets or doing very memory-intensive tasks. For most of the



things I do with R, there is no difference at all. Now the difference



between an old x86 and a new quadcore i7, that's another story...





Cheers



Joris





On Fri, May 7, 2010 at 2:32 PM, zach Li zach...@hotmail.com  wrote:





thanks Joris,







the reason I am looking for the instructions is that I hope 64 bit hmisc



will run better(faster) than 32 bit on 64 environment.







Regards,



Zach.







--



Date: Fri, 7 May 2010 11:10:36 +0200



Subject: Re: [R] help on hmisc



From: jorism...@gmail.com



To: zach...@hotmail.com



CC: r-help@r-project.org











Puzzling question. You install R, you click on install packages, you



select a mirror, you select hmisc, and done. There is a 64bit version

of R,



but a 32bit runs smooth on a Windows 7 64bit as well. if you love the



command line, look at ?install.packages.







I can't see why you would like to compile an R package yourself. So in

case



you have a specific problem, a bit more information would come handy.







Cheers



Joris







On Fri, May 7, 2010 at 3:30 AM, zach Li zach...@hotmail.com  wrote:











can anyone know where i can find information on compile hmisc on

windows,



especially 64 windows?















thanks,







_



The New Busy is not the too busy. Combine all your e-mail accounts with



Hotmail.







ID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_4



[[alternative HTML version deleted]]







__



R-help@r-project.org mailing list



https://stat.ethz.ch/mailman/listinfo/r-help



PLEASE do read the posting guide



http://www.R-project.org/posting-guide.html



and provide commented, minimal, self-contained, reproducible code.



















--



Joris Meys



Statistical Consultant







Ghent University



Faculty of Bioscience Engineering



Department of Applied mathematics, biometrics and process control







Coupure Links 653



B-9000 Gent







tel : +32 9 264 59 87



joris.m...@ugent.be



---



Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php







--



The New Busy is not the old busy. Search, chat and e-mail from your

inbox. Get





started.http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_3













--



Joris Meys



Statistical Consultant





Ghent University



Faculty of Bioscience Engineering



Department of Applied mathematics, biometrics and process control





Coupure Links 653



B-9000 Gent





tel : +32 9 264 59 87



joris.m...@ugent.be



---



Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php





[[alternative HTML version deleted]]







--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] What is the best way to have R output tables in an MS

2010-05-07 Thread Frank E Harrell Jr
 into

document preparation

software. One would not need the full functionality of

XML.


Up to a point (I'm far from being an XML guru) I'd be

prepared to

assist with this, and in particular to test it out

with groff.


Any comments? Might there be a better suggestion than

XML?


Ted.



__
R-help@r-project.org
mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained,
reproducible code.








--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] What is the best way to have R output tables in an MS

2010-05-06 Thread Frank E Harrell Jr
Ted I can't resist offering my $.02, which is that I'm puzzled why 
LaTeX, being free, flexible, and powerful, is used only by millions of 
people and not tens of millions.


Frank


On 05/06/2010 03:07 PM, (Ted Harding) wrote:

Replying to Chris's latest message for the sake of preserving the
thread, but deleting all of it to save space. Except:

I had sympathy with Chris's original query, on the grounds that
it was a good enquiry in principle, essentially pokinting towards
the problem of incorporating R's formatted output (be it tables,
graphics, ... ) into document-preparation software, whether it
be noddy WYSIWYG like Word, or more sophisticated typesetting
software such as TeX, the Adobe stable and other DTP software,
or even the ancient UNIX troff dinosaur (re-evolved as GNU groff,but
stil roaming the plains and consuming tough typesetting for breakfast
just as its ancestor did).

Given what he said in his latest message, I now have even more
sympathy. It's not about begging in the streets for someone to
charitably do the job for him! It's a job that could be a service
to many, and if it attracts enough enthusiasm from enough of those
who know how to do it then they will willingly plunge in. That's
how Free Software works.

The issue is about the enough enthusiasm and the enough of
those who know.

Many (possibly almost all) of the people who have developed R
are mainly working with TeX/LaTex. R clearly has a well-developed
interface to that language. But there are many people (of whom
Chris has raised his head) who have needs or preferences for other
software of whom some (as Chris spelled out) may totally lack
support for R and LaTex, etc., from their organisations.

I've pondered such issues many times myself, being one of the
old nomadic troff-herders and still herding the groffs.
My routine approach is as Chris described: grab the output from
R (be it mouse-copied off-screen, from a saved file, or for
graphics from a file of the coordinates of the graphical objects,
or an EPS file), plant this into a groff document, and wrap it
in formatting tags so that it comes out nicely. A bit time
consuming, but since it's fairly straightforward in a markup
language like g/troff, not so very time-consuming; and I dare
say the same would be true for TeX/LaTex if Sweave  Co did
not exist. However, I would hate to have to do it for Word and
the like! I bet that *is* time consuming.

All of which is leading on to a suggestion that has been lurking
in my mind for a while.

How about an R device called xml? This would implement the XML
extensible markup language which is basically capable of
encapsulating any formatted material.

The existing R devices seem to be confined to graphical output.
XML can in principle cope with anything. Naturally, its function
would be to save to file, not display on screen.

I believe that Word (and maybe other MS Office software) can import
XML. I know that XML can be converted to g/troff input (I've done it).
It can no doubt be converted to TeX/LateX input. I'm not familiar
enough with other document software to comment

Then we would have a universal language for formatted R output,
suitable for importing formatted R output into document preparation
software. One would not need the full functionality of XML.

Up to a point (I'm far from being an XML guru) I'd be prepared to
assist with this, and in particular to test it out with groff.

Any comments? Might there be a better suggestion than XML?

Ted.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] What is the best way to have R output tables in an MS

2010-05-06 Thread Frank E Harrell Jr

On 05/06/2010 07:28 PM, (Ted Harding) wrote:

On 06-May-10 23:44:50, Frank E Harrell Jr wrote:

Ted I can't resist offering my $.02, which is that I'm puzzled why
LaTeX, being free, flexible, and powerful, is used only by millions of
people and not tens of millions.

Frank


I think, Frank, that it's because when you use software like LaTex,
unless you have a resource like Sweave (or Lyx) which does a lot
of the hard work for you, you cannot avoid learning to use the reins
and spurs properly to control the beast.

It is the analogy of people who are happy to be helped onto a
donkey at the seaside, and enjoy a gentle comfortable donkey-ride
along the shore with the donkey in control (in turn led by the
donkey-master), as opposed to those who might try mounting on a
racehorse and seeing how they get along on their own!

And there is always the reason that Certain Other Software (COS)
is adopted as the Industry Standard (©  ®) for editable document
interchange. (PDF has become more prominent, and is now ackowledged
as a production format by COS, but for most people it's not editable
when sent to them by others; and PS has never been popular on that
front -- attempts to open a PS document in Word usually lead to
disaster: I've known it to be done).

That said, I am totally in agreement with you! (with a $0.02 for
my own preferences in that area, to which the same considerations
apply).

Ted.


Thanks Ted.  I wished that more people enjoyed learning for its own 
sake, and ultimately to have more command of how something works.


Frank

--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] sample size for survival curves

2010-05-06 Thread Frank E Harrell Jr

On 05/06/2010 07:20 PM, Kevin E. Thorpe wrote:

array chip wrote:

Dear R users, I am not asking questions specifically on R, but I know
there are many statistical experts here in the R community, so here it
goes my questions:

Freedman (1982) propose an approximation of sample size/power
calculation based on log-rank test using the formula below (This is
what nQuery does):
(Z(1-α/side)+Z(power))^2*(hazard.ratio+1)^2
N = -
(2-p1-p2)*(hazard.ratio-1)^2

Where Z is the standard normal cumulative distribution. p1 and p2 are
the survival probability of the 2 groups at a given time, say t.

As you can see, the sample size depends on the survival probabilities,
p1 and p2. This is where my question lies. Let’s say we have 2
survival curves. I can choose p1 and p2 at time 1 year, and calculate
a sample size. I can also choose p1 and p2 at time 5 years (still the
same hazard ratio since the same 2 survival curves), and calculate a
different sample size. How to interpret the 2 estimates of sample size?

This problem doesn’t occur when we calculate the number of events
required using this formula:
4*( Z(α/side)+Z(power))^2
--
(log(hazard.ratio))^2


Note that this formula makes an unnecessary approximation that the 
number of events is the same in both groups.


See the Hmisc package cpower, spower, ciapower functions for more info.

Frank



Because number of events required only depends on hazard ratio.

Thanks for any suggestions.

John


As I recall, the survival probability used in Freedman is not at some
arbitrary time of your choosing, but rather at the average length of
follow-up time anticipated in the study.

Kevin




--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Re : Re : aregImpute (Hmisc package) : error in matxv(X, xcof)...

2010-05-05 Thread Frank E Harrell Jr

Marc,

In this particular case the code alone would have been sufficient.

To isolate the problem try the following:

require(Hmisc)
v - varclus(~ variable1 + variable2 + ... all other variables given to 
aregImpute, data=...)

plot(v)

redun(~variable1 + variable2 + ... all others, data=...)

Frank

On 05/04/2010 02:52 PM, Marc Carpentier wrote:

With txt extensions for the server filter (hope it's not the size Message body is 
too big: 182595 bytes with a limit of 150 KB).
Sorry.


- Message d'origine 
De : David Winsemiusdwinsem...@comcast.net
À : Marc Carpentiermarc.carpent...@ymail.com
Cc : Uwe Liggeslig...@statistik.tu-dortmund.de; r-help@r-project.org
Envoyé le : Mar 4 mai 2010, 20 h 28 min 53 s
Objet : Re: [R] Re :  aregImpute (Hmisc package) : error in matxv(X, xcof)...


On May 4, 2010, at 2:13 PM, Marc Carpentier wrote:


Ok. I was afraid to refer to a known and obvious error.
Here is a testing dataset (pb1.csv) and commented code (pb1.R) with the 
problems.
Thanks for any help.


Nothing attached. In all likelihood had you given these file names with 
extensions of .txt, they would have made it through the server filter


Marc


De : Uwe Liggeslig...@statistik.tu-dortmund.de
À : Marc Carpentiermarc.carpent...@ymail.com
Cc : r-help@r-project.org
Envoyé le : Mar 4 mai 2010, 13 h 52 min 31 s
Objet : Re: [R] aregImpute (Hmisc package) : error in matxv(X, xcof)...

Having reproducible examples including data and the actual call that
lead to the error would be really helpful to be able to help.

Uwe Ligges

On 04.05.2010 12:23, Marc Carpentier wrote:

Dear r-help list,
I'm trying to use multiple imputation for my MSc thesis.
Having good exemples using the Hmisc package, I tried the aregImpute function. 
But with my own dataset, I have the following error :

Erreur dans matxv(X, xcof) : columns in a (51) must be= length of b (50)
De plus : Warning message:
In f$xcoef[, 1] * f$xcenter :
   la taille d'un objet plus long n'est pas multiple de la taille d'un objet 
plus court
   = longer object length is not a multiple of shorter object length

I first tried to I() all the continuous variables but the same error occurs 
with different numbers :
Erreur dans matxv(X, xcof) : columns in a (37) must be= length of b (36)...

I'm a student and I'm not familiar with possible constraints in a dataset to be 
effectively imputed. I just found this previous message, where the author's 
autoreply suggests that particular distributions might be an explanation of 
algorithms failure :
http://www.mail-archive.com/r-help@r-project.org/msg53534.html

Does anyone know if these messages reflect a specific problem in my dataset ? 
And if the number mentioned might give me a hint on which column to look at 
(and maybe transform or ignore for the imputation) ?
Thanks for any advice you might have.

Marc






David Winsemius, MD
West Hartford, CT






__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Latex and Stangle()

2010-05-05 Thread Frank E Harrell Jr

On 05/05/2010 07:21 AM, Silvano wrote:

Hi,

I'm using the Sweave and I would like include codes of the R in my LaTeX
file.

I extracts the R code with Stangle (), whose name is Relatorio.R but I
can't include it
in the Latex file as an appendix.

Suggests?

Thanks,


Silvano,

If you don't want to pretty-print the code and just want it included 
verbatim, try


\usepackage{moreverb}
...

{\small\verbatimtabinput{Stangleoutputfile whatever you call it}}

Frank



--
Silvano Cesar da Costa
Departamento de Estatística
Universidade Estadual de Londrina
Fone: 3371-4346

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Error messages with psm and not cph in Hmisc

2010-05-05 Thread Frank E Harrell Jr

On 05/05/2010 11:29 AM, David Foreman wrote:

While

sm4.6ll-fit.mult.impute(Surv(agesi, si)~partner+ in.love+ pubty+ FPA+
strat(gender),fitter = cph, xtrans = dated.sexrisk2.i, data =
dated.sexrisk2, x=T,y=T,surv=T, time.inc=16)

runs perfectly using Hmisc, Design and mice under R11 run via Sciviews-K,
with

library(Design)
library(mice)
ds2d-datadist(dated.sexrisk2)
options(datadist=ds2d)
options(contrasts=c(contr.treatment,contr.treatment))

the equivalent

sm4.6ll-fit.mult.impute(Surv(agesi, si)~partner+ in.love+ pubty+ FPA+
strat(gender),fitter = *psm*, xtrans = dated.sexrisk2.i, data =
dated.sexrisk2, x=T,y=T,surv=T, time.inc=16)

returns the error message

Error in  dimnames(X)- list(rnam, c((Intercept), atr$colnames)) :
   length of 'dimnames' [2] not equal to array extent


Using survreg{survival} for which psm is a wrapper, also runs perfectly on
the unimputed dataset.

Is this a bug, or am I doing something wrong?  I particularly want to make
use of Design's validation and calibration utilities on multiply imputed
data.

With many thanks for everyone's help

David Foreman


The rms and Design packages do not support the user specifying a 
contrast option.


validate and calibrate function do not explicitly support multiple 
imputation.


Think about converting the the rms package.

Frank


Consultant and Visiting Professor in Child and Adolescent Psychiatry

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] aregImpute (Hmisc package) : error in matxv(X, xcof)...

2010-05-04 Thread Frank E Harrell Jr

On 05/04/2010 06:52 AM, Uwe Ligges wrote:

Having reproducible examples including data and the actual call that
lead to the error would be really helpful to be able to help.

Uwe Ligges


In addition to that, this kind of message usually means that you have a 
singularity somewhere, e.g., you are using too many knots for spline 
terms or have a tiny cell in a categorical variable.


Frank



On 04.05.2010 12:23, Marc Carpentier wrote:

Dear r-help list,
I'm trying to use multiple imputation for my MSc thesis.
Having good exemples using the Hmisc package, I tried the aregImpute
function. But with my own dataset, I have the following error :

Erreur dans matxv(X, xcof) : columns in a (51) must be= length of b (50)
De plus : Warning message:
In f$xcoef[, 1] * f$xcenter :
la taille d'un objet plus long n'est pas multiple de la taille d'un
objet plus court
= longer object length is not a multiple of shorter object length

I first tried to I() all the continuous variables but the same error
occurs with different numbers :
Erreur dans matxv(X, xcof) : columns in a (37) must be= length of b
(36)...

I'm a student and I'm not familiar with possible constraints in a
dataset to be effectively imputed. I just found this previous message,
where the author's autoreply suggests that particular distributions
might be an explanation of algorithms failure :
http://www.mail-archive.com/r-help@r-project.org/msg53534.html

Does anyone know if these messages reflect a specific problem in my
dataset ? And if the number mentioned might give me a hint on which
column to look at (and maybe transform or ignore for the imputation) ?
Thanks for any advice you might have.

Marc




[[alternative HTML version deleted]]



--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Need help on having multiple distributions in one graph

2010-05-04 Thread Frank E Harrell Jr

On 05/03/2010 11:14 PM, Jorge Ivan Velez wrote:

Hi Joseph,

How about this?

matplot(cbind(m0, m1, m3, m4), type = 'l', lty = 1)
legend('topright', paste('m', c(0, 1, 3, 4), sep = ), lty = 1, col = 1:4)

See ?matplot and ?legend for details.

HTH,
Jorge


Also see the labcurve function in the Hmisc package, which will draw 
curves and label them where they are most separated.


Frank




On Mon, May 3, 2010 at 6:42 PM,  wrote:


R-listers:

I have searched the help files and everything I have related to R graphics.
I cannot find how to graph y against
several distributions on a single graph. Here is code for creating 4
Poisson distributions with different mean values, although I would prefer
having it in a loop: The top of the y axis for the first distribution, with
count of 0, is .6, which is the highest point for any  of the distributions.

obs- 1:20 y- obs-1
m0- (exp(-.5) * .5^y)/factorial(y)
m1- (exp(-1) * 1^y)/factorial(y)
m3- (exp(-3) * 3^y)/factorial(y)
m4- (exp(-5) * 5^y)/factorial(y)

How do I plot the graph of each distribution on y, all on a single graph? I
have spent so many hours on this,
which is really quite simple in applications such as Stata. Thanks very
much for the assistance:

Joseph Hilbe
hi...@asu.edu  or jhi...@aol.com





--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Visualizing binary response data?

2010-05-04 Thread Frank E Harrell Jr

On 05/04/2010 09:12 PM, Thomas Stewart wrote:

For binary w.r.t. continuous, how about a smoothing spline?  As in,

x-rnorm(100)
y-rbinom(100,1,exp(.3*x-.07*x^2)/(1+exp(.3*x-.07*x^2)))
plot(x,y)
lines(smooth.spline(x,y))

OR how about a more parametric approach, logistic regression?  As in,

glm1-glm(y~x+I(x^2),family=binomial)
plot(x,y)
lines(sort(x),predict(glm1,newdata=data.frame(x=sort(x)),type=response))

FOR binary w.r.t. categorical it depends.  Are the categories ordinal (is
there a natural ordering?) or are the categories nominal (no ordering)?  For
nominal categories, the data is essentially a contingency table, and
strength of the predictor is a test of independence.  You can still do a
graphical exploration: maybe plotting the proportion of Y=1 for each
category of X.   As in,

z-cut(x,breaks=-3:3)
plot(tapply(y,z,mean))

If your goal is to find strong predictors of Y, you may want to consider
graphical measures that look at the predictors jointly.  Maybe with a
generalized additive model (gam)?

There is probably a lot more you can do.  Be creative.

-tgs


And you have to decide why you would look to a graph to select 
predictors.  This can badly distort later inferences (confidence 
intervals, P-values, biased regression coefficients, biased R^2, etc.).


Frank




On Tue, May 4, 2010 at 9:04 PM, Kim Jung Hwakimhwamaill...@gmail.comwrote:


Hi All,

I'm dealing with binary response data for the first time, and I'm confused
about what kind of graphics I could explore in order to pick relevant
predictors and their relation with response variable.

I have 8-10 continuous predictors and 4-5 categorical predictors. Can
anyone
suggest what kind of graphics I can explore to see how predictors behave
w.r.t. response variable...

Any help would be greatly appreciated, thanks,
Kim


--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] What is the best way to have R output tables in an MS Word format?

2010-05-03 Thread Frank E Harrell Jr

On 05/03/2010 08:47 AM, Chris Evans wrote:

Thanks Tal  Thomas, I am now experimenting with both SWord and R2wd and
both are certainly a huge step forward for me, tied as I am to Word and
the Windoze/M$ world for now.

Chris


Note that many of the general solutions offered produce documents (.doc, 
.html, .rtf) that can be then converted to formats Word can take as 
described at http://biostat.mc.vanderbilt.edu/SweaveConvert.  That page 
has an example where pdflatex is used to produce a complex table 
containing graphics in some of the cells, and I used a web service to 
convert to .doc.


Frank





Tal Galili sent the following  at 01/05/2010 09:44:


Hi all,
I forwarded this question to the r-com mailing list, and received the
following reply from Thomas Baier :



Hi Tal,

two solutions immediately come to my mind: SWord
(http://rcom.univie.ac.at) and R2wd (from CRAN).

If creating a paper in Word, then SWord may be the better choice, if you
want to create reports controlled from R, R2wd might be the better one.

Best,
Thomas


They both look potentially very useful and can do wonderful embedding of
tabulated data frames and graphics to judge form the help page for R2wd
and that works on my set up.  However, I'm crash R2wd and hange R
passing lm output with:
lm.D9- lm(weight ~ group) # from the lm help page
wdBody(lm.D9)

I'll try to link up with whoever I should (Thomas, Christian?) to debug
this (and, of course, it may be particular to my set up) but I still
argue there's a problem letting these output capabilities go to packages
and not putting them in the core:
a) it's easy for us not to know of them, I didn't know of R2wd nor ascii
for example,
b) surely to have provided really excellent graphic output in the core
is a bit incongruent with having even provided tabs for matrices and tables?

I'll pick up more in response to Max Kuhn's message.

Very best,

Chris







--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] What is the best way to have R output tables in an MS Word format?

2010-04-30 Thread Frank E Harrell Jr

On 04/30/2010 05:13 PM, Max Gunther wrote:

Dear R list,

Our statisticians usually give us results back in a PDF format. I would like
to be able to copy and past tables from R output directly into a Microsoft
Word table since this will save us tons of time, be more accurate to
minimize human copying errors and help us update data in our papers more
easily.

Do people have suggestions for the best way to do this?

I am a novice to R but I do work with a couple of
very knowledgeable statisticians who do most of the heavy statistical
lifting for our research group.

Many thanks,
Max


Max Gunther, PhD

Vanderbilt University - Radiology
Institute of Imaging Sciences - VUIIS
Center for Health Services Research
Nashville, TN www.ICUdelirium.org

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



See http://biostat.mc.vanderbilt.edu/SweaveConvert

--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] What is the best way to have R output tables in an MS Word format?

2010-04-30 Thread Frank E Harrell Jr

On 04/30/2010 05:45 PM, Marc Schwartz wrote:

On Apr 30, 2010, at 5:13 PM, Max Gunther wrote:


Dear R list,

Our statisticians usually give us results back in a PDF format. I would like
to be able to copy and past tables from R output directly into a Microsoft
Word table since this will save us tons of time, be more accurate to
minimize human copying errors and help us update data in our papers more
easily.

Do people have suggestions for the best way to do this?

I am a novice to R but I do work with a couple of
very knowledgeable statisticians who do most of the heavy statistical
lifting for our research group.

Many thanks,
Max


Max,

I would urge you to consider using Sweave. It enables the use of R and LaTeX to 
facilitate reproducible research, which seems to be your goal here.

You might want to talk to your campus neighbor Frank Harrell, who has extensive 
information on the Vanderbilt Biostatistics department web site on this:



And note Max that we have an R clinic every Thursday at 2pm.  LaTeX and 
Sweave are frequently discussed during the clinic.


Frank


   http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/StatReport

Frank also has some pointers for converting between various formats:

   http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/SweaveConvert

HTH,

Marc Schwartz

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Robust ANOVA functions in R?

2010-04-28 Thread Frank E Harrell Jr

Jim Lemon wrote:

On 04/28/2010 03:02 PM, jinyan fan wrote:

Hello,

Anyone familiar with robust ANOVA in R? Please help.

I am conducting a between-between-within (2 X 2 X 2) ANOVA, with the 
focus on the significance of the three-way interaction. Because of 
very uneven sample sizes, and the violation of covariance homogeneity, 
I need to use some sort of robust ANOVA method, and I prefer 
bootstrap. I am aware of Rand Wilcox's robust package that can be run 
in R, but although it has the bootstrap functions for between-within 
ANOVA. it does not contain the bootstrap functions for 
between-between-within ANOVA. Thanks a lot.



Hi Fan,
Have a look at the ICSNP package.

Jim


Also consider the proportional odds model, a generalization of the 
Wilcoxon-Kruskal-Wallis test.


Frank



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.




--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Practical work with logistic regression

2010-04-22 Thread Frank E Harrell Jr

Claus O'Rourke wrote:

Dear all,

I have a couple of short noob questions for whoever can take them. I'm
from a very non-stats background so sorry for offending anybody with
stupid questions ! :-)

I have been using logistic regression care of glm to analyse a binary
dependent variable against a couple of independent variables. All has
gone well so far. In my work I have to compare the accuracy of
analysis to a C4.5 machine learning approach. With the machine
learning, a straight-forward measure of the quality of the classifier
is simply the percentage of correctly classified instances. I can
calculate this for the resultant model by comparing predictions to
original values 'manually'. My question: is this not automatically -
or easily - calculated in the produced model or the summary of that
model?


The percent classified correctly is an improper scoring rule that will 
lead to a selection of a bogus model.  You can easily find examples 
where adding a very important variable to a binary logistic model 
results in a decrease in the percent correct.


Frank



I want to use my model in real time to produce results for new inputs.
Basically this model is to be used as a classifier for a robot in real
time. Can anyone suggest the best way that a produced model can be
used directly in external code once the model has been developed in R?
If my external code is in Java, then using jri is one option. A more
efficient method would be to take the intercept and coefficients and
actually code up the function in the appropriate programming language.
Has anyone ever tried doing this?

Apologies again for the stupid questions, but the sooner I get some of
these things straight, the better.

Claus

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Results from clogit out of range?

2010-04-21 Thread Frank E Harrell Jr

Thomas Lumley wrote:

On Tue, 20 Apr 2010, Noah Silverman wrote:


I just read the help page for predict.coxph.

It indicates that the risk score is just exp(lp)

What I'm trying to find, and have seen with some other implementations
is the conditional probability within group.  Neither the lp or the
risk options seem to deliver this.


exp(lp)/(1+exp(lp))

or, as Martin Maechler would prefer,  qlogis(lp)


Minor correction: plogis

Frank



   -thomas



What am I missing?

-N


On 4/20/10 4:22 PM, Noah Silverman wrote:

Thanks David,

That explains a lot.  I appreciate it.

--
Noah


On 4/20/10 3:48 PM, David Winsemius wrote:


On Apr 20, 2010, at 5:59 PM, Noah Silverman wrote:



Hi,

I'm calculating a conditional logit on some data stratified by group.
My understanding was that a conditional logit by definition returns a
value between 0 and 1 a a probability.  Can anyone suggest why I'm
seeing results outside of the {0,1} range??


Probably because you did not read the help page for the function
predict.coxph. Pay special attention to the type argument.

type=c(lp, risk, expected, terms)



The call in R is:

m - clogit(score ~ val_1 + val_2 + strata(group), data=data)

Then

prediction - predict(m,newdata)


I do not see that you have defined any newdata.




A sample of the data with resulting predictive values is:
group  score   val_1   val_2prediction
1  2009-01-04_1 1 0.5913962 -1.121589  1.62455210
2  2009-01-04_1 1 0.6175472 -3.249820 -0.20093346
3  2009-01-04_1 1 0.5439640 -2.424501  0.46651849
4  2009-01-04_1 0 0.3745209 -2.477424  0.31263855
5  2009-01-04_1 0 0.6329855 -3.424174 -0.34200448
6  2009-01-04_1 0 0.4571999 -2.770247  0.11190788
7  2009-01-04_1 0 0.3822623 -2.259422  0.50627534
8  2009-01-04_1 0 0.2605742 -4.424806 -1.44566070
9  2009-01-04_1 0 0.4449604 -2.357060  0.46174993
10 2009-01-04_1 0 0.6595178 -2.246518  0.69427842
11 2009-01-04_1 0 0.5260032 -2.977887 -0.02393871


By default (which is what you have implicitly chosen) you are
requesting lp = linear predictors rather than risk.



Thomas LumleyAssoc. Professor, Biostatistics
tlum...@u.washington.eduUniversity of Washington, Seattle



--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Serverless databases in R

2010-04-19 Thread Frank E Harrell Jr

Barry Rowlingson wrote:

On Sun, Apr 18, 2010 at 11:30 PM, kMan kchambe...@gmail.com wrote:

It was my understanding that .Rdata files were not very portable, and do not
natively handle queries. Otherwise we'd all just use .RData files instead of
farming the work out to SQL drivers  external libraries, and colleagues who
use, e.g. SAS or SPSS would also have no trouble with them.


 The platform in cross-platform to me generally means the
operating system on which a program is running - and .Rdata files are
perfectly portable between R on Linux, MacOSX, Windows, Solaris etc
versions. You didn't mention portability to other statistical
packages. You also didn't mention needing SQL, or what you wanted to
do with your databases. I figured I'd just mention .Rdata files for
completeness!

 There's also RJDBC and RODBC which can interface to anything with a
JDBC or ODBC interface on your system.

 A .RData file could be considered as a serverless NoSQL database.
There's a GSOC proposal to investigate interfaces to NoSQL databases
and some info here:

http://rwiki.sciviews.org/doku.php?id=developers:projects:gsoc2010:nosql_interface

 Isn't it odd that the open-source R community has developed functions
for reading in proprietary SAS and SPSS format files, but (AFAIK) the
commercial sector doesn't seem to support reading data from
open-sourced and open-specced R .Rdata files?

Barry


Hi Barry,

Stat Transfer can read and write R binary data frames (.rda files).

Frank
--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Help: coxph() in {survival} package

2010-04-19 Thread Frank E Harrell Jr

Xin Ge wrote:

Hi All,

I'm runnning coxph() on 90 different datasets (in a loop).

1. I'm wondering how can I get log-likelihood value from coxph() output.
Currently I can only see following:

Likelihood ratio test =

Wald test =

Score (logrank) test =

2. Once I have likelihood value, I would like to extract these from R
output (for each data set).
Can any one help please, thanks in advance,
Xin
P.S. OR, may be if anyone can suggest me some other function which can help
me with this.


If you are trying to pick a winner it would be advisable to rank the 
multiple models and to get bootstrap confidence intervals for the ranks. 
   I provided some code for this a couple of weeks ago.


Frank

--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] problem with FUN in Hmisc::summarize

2010-04-16 Thread Frank E Harrell Jr

arnaud chozo wrote:

Hi all,

I'd like to use the Hmisc::summarize function, but it uses a function (FUN)
of a single vector argument to create the statistical summaries.

Consider an easy case: I'd like to compute the correlation between two
variables in my dataframe, grouped according to other variables in the same
dataframe.

For exemple, consider the following dataframe D:
V1  V2   V3
A 1-1
A 1 1
A-1-1
B 1 1
B 1 1

I'd like to use Hmisc::summarize(X=D, by=llist(myvar=D$V1), FUN=corr.V2.V3)

where corr.V2.V3 is defined as follows:

corr.V2.V3 = function(x) {
  d = cbind(x$V2, x$V3)

  out = c(cor(d))
  names(out) = c(CORR)
  return(out)
}

I was not able to use Hmisc::summarize in this case because FUN should be a
function of a matrix argument. Any idea?

Thanks in advance,
Arnaud


See the Hmisc mApply or summary.formula functions, or use tapply using a 
vector of possible subscripts (1:n) as the first argument; then you can 
use the subscripts selected to address multiple variables.


Frank

--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regression w/ interactions

2010-04-15 Thread Frank E Harrell Jr

Michael Dykes wrote:

I have a project due in my Linear Regression class re: regression on a data
set  my professor gave us a hint that there were *exactly *2 sig
interactions. The data set is attached. We have to find which predictors are
significant,  which 2 interactions are sig. Also, I nedd some guidance for
this  selecting the best model. I tried the `full' model, that being:
z=lm(y~x1+x2+x3+x4+x1*x2+x2*x3...+x3*x4). I then ran an anova(z), 
summary(z). My R^2  R^2_a were *really* low. I am not sure how to do PRESS,
AIC  Cp in R yet though. Any help would be appreciated.




Michael this is not really the place for help on homework other than 
perhaps on technical roadblocks.  Note that the strategy you are being 
told to follow is one whose statistical properties have been severely 
criticized in the statistical literature.  Only with a very high signal 
to noise ratio (e.g., high true R^2) can torturing data lead to a 
confession to something other than what the analyst wants to hear.  I 
suppose that in simulated data there is a true model out there waiting 
to be found, but beware of using this approach with real data with low 
signal to noise ratios.


Frank


--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regression w/ interactions

2010-04-15 Thread Frank E Harrell Jr

Michael Dykes wrote:
So, am i wrong to /assume /that the reasons my professor is asking us to 
find a high R^2  adjusted R^2, low Cp (near what p+1, if i remember 
correctly), low PRESS,  AIC is b/c the data is randomly generated (b/c 
he has stated that all of the data for *all *of these hw assignments are 
randomly generated)?  And i am not /exactly /sure to what you are 
referring when you say: 'low signal to noise ratios'. Do you mean /low 
/R^2 to epsilon_i's? or /low /predictors to epsilon i's? Please excuse 
my ignorance in these matters, but I am not only asking these questions 
for hw sakes's but for my future, as I hope to study for the actuarial 
exams  take the Probability Test sometime either next Spring or Summer 
[after taking this professors Calculus-based Prob  Stat sequence in the 
coming Fall  Spring Semester].


Thanks again for your help, Professor.


Let me just add that a valid test of whether any of the variables or 
interactions is associated with Y is to formulate a model with all the 
parameters in it and to use the global F test.


Stepwise techniques such as you are being asked to use are not 
scientific.  If the true R^2 (which you do not know) is not high, the 
low signal:noise ratio makes the data incapable to telling you the 
right variables to include with any reliability.  Unfortunately, most 
teachers of statistics do not understand this point, so you might be 
graded off for providing the right answer.


Frank



On Thu, Apr 15, 2010 at 8:26 AM, Frank E Harrell Jr 
f.harr...@vanderbilt.edu mailto:f.harr...@vanderbilt.edu wrote:


Michael Dykes wrote:

I have a project due in my Linear Regression class re:
regression on a data
set  my professor gave us a hint that there were *exactly *2 sig
interactions. The data set is attached. We have to find which
predictors are
significant,  which 2 interactions are sig. Also, I nedd some
guidance for
this  selecting the best model. I tried the `full' model, that
being:
z=lm(y~x1+x2+x3+x4+x1*x2+x2*x3...+x3*x4). I then ran an anova(z), 
summary(z). My R^2  R^2_a were *really* low. I am not sure how
to do PRESS,
AIC  Cp in R yet though. Any help would be appreciated.



Michael this is not really the place for help on homework other than
perhaps on technical roadblocks.  Note that the strategy you are
being told to follow is one whose statistical properties have been
severely criticized in the statistical literature.  Only with a very
high signal to noise ratio (e.g., high true R^2) can torturing data
lead to a confession to something other than what the analyst wants
to hear.  I suppose that in simulated data there is a true model
out there waiting to be found, but beware of using this approach
with real data with low signal to noise ratios.

Frank


-- 
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine

Department of Biostatistics   Vanderbilt University





__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Brier's score for bootstrap sample (coxph)

2010-04-10 Thread Frank E Harrell Jr

paaventhan jeyaganth wrote:

Dear all,

How can i get brier's score for the bootsrap sample for survival analysis.

this are the code i am using for the validation.

 

f1 - cph(Surv(time,dead ) ~ strata(x1)+strata(x2)+strata(x3), 
x=TRUE, y=TRUE, surv=TRUE, time.inc=12, data=new)


 


validate(f1,B=200,u=12,dxy=T)

 

 


Thanks

Paaveen



You did not following the posting guide.

You mixed two package; the functions you used did not match the subject 
line of your e-mail.


You are attempting to validate a model containing no covariates, which 
is not impossible but will be noisy.


You did not point to a function or package that claims to be able to 
compute a Brier score with censored data.


Frank


--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to get the penalty matrix for natural cubic spline?

2010-04-09 Thread Frank E Harrell Jr

Yan Li wrote:

Hi, all

I am trying to get the basis matrix and penalty matrix for natural
cubic splines. In the splines package of R,ns can
generate the B-spline basis matrix for a natural cubic spline. How can
I get the basis matrix and penalty matrix for natural cubic
spline.

Thanks a lot!

Lee


In usual practice the natural cubic spline does not use penalization.
Frank

--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] C-index and Cox model

2010-04-08 Thread Frank E Harrell Jr

Bessy wrote:

Dear all R users,

I am building a Cox PH model on a small dataset. I am wondering how to
measure the predictive power of my cox model? Normally the ROC curve or Gini
value are used in logistic regression model. Is there any similar
measurement suitable for Cox model?

Also if I use C-index statistic to measure the predictive power, is it a
time-dependent value (i.e. do I need to calculate it for each time period?)
or we can calculate it as a single value for the whole model ignoring the
time?

Thank you so much.

Bessy


install.packages('Hmisc')
?rcorr.cens

--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Overfitting/Calibration plots (Statistics question)

2010-04-08 Thread Frank E Harrell Jr

Mark Seeto wrote:

This isn't a question about R, but I'm hoping someone will be willing
to help. I've been looking at calibration plots in multiple regression
(plotting observed response Y on the vertical axis versus predicted
response [Y hat] on the horizontal axis).

According to Frank Harrell's Regression Modeling Strategies book
(pp. 61-63), when making such a plot on new data (having obtained a
model from other data) we should expect the points to be around a line
with slope  1, indicating overfitting. As he writes, Typically, low
predictions will be too low and high predictions too high.

However, when I make these plots, both with real data and with simple
simulated data, I get the opposite: the points are scattered around a
line with slope 1. Low predictions are too high and high predictions
are too low.

For example, I generated 200 cases, fitted models on the first half of
the data, and made calibration plots for those models on the second
half of the data:


x1 - rnorm(200, 0, 1)
x2 - rnorm(200, 0, 1)
x3 - rnorm(200, 0, 1)
x4 - rnorm(200, 0, 1)
x5 - rnorm(200, 0, 1)
x6 - rnorm(200, 0, 1)
y - x1 + x2 + rnorm(200, 0, 2)
d - data.frame(y, x1, x2, x3, x4, x5, x6)

lm1 - lm(y ~ ., data = d[1:100,])
lm2 - lm(y ~ x1 + x2, data = d[1:100,])

plot(predict(lm1, d[101:200, ]), d$y[101:200]); abline(0,1)
x11(); plot(predict(lm2, d[101:200, ]), d$y[101:200]); abline(0,1)


The plots for both lm1 and lm2 show the points scattered around a line
with slope  1, contrary to what Frank Harrell says should happen.

I am either misunderstanding something or making a mistake in the code
(I'm almost 100% certain I'm not mixing up the axes). I would be most
appreciative if anyone could explain where I'm going wrong.

Thanks for any help you can provide.

Mark Seeto


Mark,

Try

set.seed(1)
slope1 - slope2 - numeric(100)

for(i in 1:100) {
x1 - rnorm(200, 0, 1)
x2 - rnorm(200, 0, 1)
x3 - rnorm(200, 0, 1)
x4 - rnorm(200, 0, 1)
x5 - rnorm(200, 0, 1)
x6 - rnorm(200, 0, 1)
y - x1 + x2 + rnorm(200, 0, 2)
d - data.frame(y, x1, x2, x3, x4, x5, x6)

lm1 - lm(y ~ ., data = d[1:100,])
lm2 - lm(y ~ x1 + x2, data = d[1:100,])

slope1[i] - coef(lsfit(predict(lm1, d[101:200, ]), d$y[101:200]))[2]
slope2[i] - coef(lsfit(predict(lm2, d[101:200, ]), d$y[101:200]))[2]
}

mean(slope1)
mean(slope2)

I get

 [1] 0.8873878
 [1] 0.9603158

Frank





__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] SAS and R on multiple operating systems

2010-04-06 Thread Frank E Harrell Jr

Roger DeAngelis(xlr82sas) wrote:

Hi,

  About the forest plot.

   Some Phrarma companies demand the report and graphics follow very
restrictive layouts.


Thank goodness that the FDA does not require that.
Frank



   SAS allows uses to use one template for graphs and tables.

   Margins have to the same for all reports.
   Fonts, fontsizes, linewidths, boxing body, cell spacing, cell padding ...

   The output has to look perfect in MS-Word.

I am not a R graphics expert but I tried to run the forest plot and
output the graphs to png and svg.
In both cases the plots were mangled. This was probably not fair, because it
looked like the SVG issue was just a font size issue and the png issues was
a width issue, but I know SAS tends to produce graphs that can be scaled
post processing.

I also imported the png into word and is was right clipped.

 devSVG(file = c:\\temp\\Rplots.svg, width = 10, height = 8,  
 bg = white, fg = black, onefile=TRUE, xmlHeader=TRUE)  
 example(forest)
 dev.off()  


png(c:\\temp\\myplot.png, bg=transparent, width=700, height=500)
example(forest) 
dev.off()   


 Also I have seen 5,000 page listings in SAS.






--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] glmpath in R

2010-04-06 Thread Frank E Harrell Jr
 mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] fitness of regression tree: how to measure???

2010-04-01 Thread Frank E Harrell Jr

vibha patel wrote:

Hello,

I'm using rpart function for creating regression trees.
now how to measure the fitness of regression tree???

thanks n Regards,
Vibha


If the sample size is less than 20,000, assume that the tree is a 
somewhat arbitrary representation of the relationships in the data and 
that the form of the tree will not replicate in future datasets.


Frank

--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] sample size 20K? Was: fitness of regression tree: how to measure???

2010-04-01 Thread Frank E Harrell Jr
Good comments Bert.  Just 2 points to add: People rely a lot on the tree 
structure found by recursive partitioning, so the structure needs to be 
stable.  This requires a huge samples size.  Second, recursive 
partitioning is not competitive with other methods in terms of 
predictive descrimination unless the sample size is so large that the 
tree doesn't need to be pruned upon cross-validation.


Frank


Bert Gunter wrote:

Since Frank has made this somewhat cryptic remark (sample size  20K)
several times now, perhaps I can add a few words of (what I hope is) further
clarification.

Despite any claims to the contrary, **all** statistical (i.e. empirical)
modeling procedures are just data interpolators: that is, all that they can
claim to do is produce reasonable predictions of what may be expected within
the extent of the data. The quality of the model is judged by the goodness
of fit/prediction over this extent. Ergo the standard textbook caveats about
the dangers of extrapolation when using fitted models for prediction. Note,
btw, the contrast to mechanistic models, which typically **are** assessed
by how well they **extrapolate** beyond current data. For example, Newton's
apple to the planets. They are often validated by their ability to work
in circumstances (or scales) much different than those from which they were
derived.

So statistical models are just fancy prediction engines. In particular,
there is no guarantee that they provide any meaningful assessment of
variable importance: how predictors causally relate to the response.
Obviously, empirical modeling can often be useful for this purpose,
especially in well-designed studies and experiments, but there's no
guarantee: it's an accidental byproduct of effective prediction.

This is particularly true for happenstance (un-designed) data and
non-parametric models like regression/classification trees. Typically, there
are many alternative models (trees) that give essentially the same quality
of prediction. You can see this empirically by removing a modest random
subset of the data and re-fitting. You should not be surprised to see the
fitted model -- the tree topology -- change quite radically. HOWEVER, the
predictions of the models within the extent of the data will be quite
similar to the original results. Frank's point is that unless the data set
is quite large and the predictive relationships quite strong -- which
usually implies parsimony -- this is exactly what one should expect. Thus it
is critical not to over-interpret the particular model one get, i.e. to
infer causality from the model (tree)structure.

Incidentally, there is nothing new or radical in this; indeed, John Tukey,
Leo Breiman, George Box, and others wrote eloquently about this decades ago.
And Breiman's random forest modeling procedure explicitly abandoned efforts
to build simply interpretable models (from which one might infer causality)
in favor of building better interpolators, although assessment of variable
importance does try to recover some of that interpretability (however, no
guarantees are given).

HTH. And contrary views welcome, as always.

Cheers to all,

Bert Gunter
Genentech Nonclinical Biostatistics
 
 
-Original Message-

From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On
Behalf Of Frank E Harrell Jr
Sent: Thursday, April 01, 2010 5:02 AM
To: vibha patel
Cc: r-help@r-project.org
Subject: Re: [R] fitness of regression tree: how to measure???

vibha patel wrote:

Hello,

I'm using rpart function for creating regression trees.
now how to measure the fitness of regression tree???

thanks n Regards,
Vibha


If the sample size is less than 20,000, assume that the tree is a 
somewhat arbitrary representation of the relationships in the data and 
that the form of the tree will not replicate in future datasets.


Frank




--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Problem comparing hazard ratios

2010-03-31 Thread Frank E Harrell Jr

Ravi Varadhan wrote:

Frank,

Is there an article that discusses this idea of bootstrapping the ranks of
the likelihood ratio chi-square 
Statistics to assess relative importance of predictors in time-to-event data
(specifically Cox PH model)? 


Thanks,
Ravi.


Do require(rms); ?anova.rms and see related articles:

@Article{hal09usi,
  author =   {Hall, Peter and Miller, Hugh},
  title = 		 {Using the bootstrap to quantify the authority of an 
empirical ranking},

  journal =  Annals of Stat,
  year = 2009,
  volume =   37,
  number =   {6B},
  pages ={3929-3959},
  annote = 	 {confidence interval for ranks;genomics;high 
dimension;independent component bootstrap;$m$-out-of-$n$ 
bootstrap;ordering;overlap interval;prediction interval;synchronous 
bootstrap;ordinary bootstrap may not provide accurate confidence 
intervals for ranks;may need a different bootstrap if the number of 
parameters being ranked increases with $n$ or is large;estimating $m$ is 
difficult;in their first example, where $m=0.355n$, the ordinary 
bootstrap provided a lower bound to the lengths of more accurate 
confidence intervals of ranks}

}

@Article{xie09con,
  author =   {Xie, Minge and Singh, Kesar and Zhang, {Cun-Hui}},
  title = 		 {Confidence intervals for population ranks in the presence 
of ties and near ties},

  journal =  JASA,
  year = 2009,
  volume =   104,
  number =   486,
  pages ={775-787},
  annote = 	 {bootstrap ranks;ranking;nonstandard bootstrap 
inference;rank inference;slow convergence rate;smooth ranks in the 
presence of near ties;rank inference for fixed effects risk adjustment 
models}

}



-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On
Behalf Of Frank E Harrell Jr
Sent: Tuesday, March 30, 2010 3:57 PM
To: Michal Figurski
Cc: r-help@r-project.org
Subject: Re: [R] Problem comparing hazard ratios

Michal Figurski wrote:

Dear R-Helpers,

I am a novice in survival analysis. I have the following code:
for (i in 3:12) print(coxph(Surv(time, status)~a[,i], data=a))

I used it to fit the Cox Proportional Hazard models separately for every 
available parameter (columns 3:12) in my data set - with intention to 
compare the Hazard Ratios.


However, some of my variables are in range 0.1 to 1.6, others in range 
5000 to 9000. How do I compare HRs between such variables?


I have rescaled all the variables to be in 0 to 1 range - is this the 
proper way to go? Is there a way to somehow calculate the same HRs (as 
for rescaled parameters) from the HRs for original parameters?


Many thanks in advance.



There are a lot of issues related to this that will require a good bit 
of study, both in survival analysis and in regression.  I would start 
with bootstrapping the ranks of the likelihood ratio chi-square 
statistics of the competing biomarkers.


Frank




--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Problem comparing hazard ratios

2010-03-31 Thread Frank E Harrell Jr

P.S.  Here is some code that is more directly relevant.

Frank

cphadjchisq - function(adjust, candidates, S)
  {
require(rms)

f - coxphFit(adjust, S, type='right', method='efron')
if(f$fail) stop('could not fit a model with just adjust')
ll2 - -2*f$loglik[2]
chisq.adjust - 2*diff(f$loglik)

p - ncol(candidates)
chisq - numeric(p)
names(chisq) - colnames(candidates)

for(i in 1:p)
  {
f - coxphFit(cbind(adjust, candidates[,i]), S,
  type='right', method='efron')
chisq[i] - if(f$fail) NA else ll2 + 2*f$loglik[2]
  }
list(chisq=chisq, P=1 - pchisq(chisq, 1), chisq.adjust=chisq.adjust,
 padjust=ncol(adjust), pcandidates=ncol(candidates))
  }

bootrankcph - function(adjust, candidates, S, B=500, conf.int=0.95, 
pr=TRUE)

  {
adjust - as.matrix(adjust)
candidates - as.matrix(candidates)
i - is.na(rowSums(adjust) + rowSums(candidates) + rowSums(S))
if(any(i))
  {
i - !i
adjust - adjust[i,,drop=FALSE]
candidates - candidates[i,,drop=FALSE]
S  - S[i,,drop=FALSE]
  }
n - nrow(S)

orig  - cphadjchisq(adjust, candidates, S)
orig.rank - rank(-orig$chisq)
p - ncol(candidates)
brank - matrix(NA, ncol=p, nrow=B)

for(i in 1:B)
  {
if(pr) cat(i, '\r')
s - sample(1:n, n, replace=TRUE)
w - cphadjchisq(adjust[s,,drop=FALSE], candidates[s,,drop=FALSE],
 S[s,,drop=FALSE])
brank[i,] - rank(-w$chisq)
  }
if(pr) cat('\n')

lower - apply(brank, 2, function(w) quantile(w, (1-conf.int)/2, 
na.rm=TRUE))
upper - apply(brank, 2, function(w) quantile(w, (1+conf.int)/2, 
na.rm=TRUE))

structure(list(stats=orig,
   ranks=cbind(rank=orig.rank, lower=lower, upper=upper)),
  class='bootrankcph')
  }

print.bootrankcph - function(x, ...)
  {
s - x$stats
cat('\nOriginal Model L.R. Statistic for Adjustment Variables\n\n',
'Chi-square=', round(s$chisq.adjust,2),'with', s$padjust,
'd.f.\n\nOriginal Partial L.R. Statistics for Candidates (1 
d.f.)\n\n')

i - order(s$chisq)
print(cbind('chi-square'=round(s$chisq[i],2),
P=format.pval(s$P[i], digits=3, eps=.001)),
  quote=FALSE)
invisible()
  }

plot.bootrankcph - function(x, ...)
  {
x - x$ranks
cand - if(length(rownames(x))) as.factor(rownames(x)) else
 as.factor(1:nrow(x))
cand - reorder.factor(cand, x[,1])
Dotplot(cand ~ Cbind(x), pch=16, xlab='Rank', ylab='Candidate')
  }


require(rms)
n - 100
age - rnorm(n, 50, 10)
sex - sample(c('female','male'), n, TRUE)

adjust - model.matrix(~ rcs(age,3)*sex)
candidates - matrix(rnorm(n*50), ncol=50,
 dimnames=list(NULL, c(letters,LETTERS)[1:50]))
S - Surv(runif(n))
w - bootrankcph(adjust, candidates, S, B=100)
w$stats
plot(w)


Frank E Harrell Jr wrote:

Ravi Varadhan wrote:

Frank,

Is there an article that discusses this idea of bootstrapping the 
ranks of
the likelihood ratio chi-square Statistics to assess relative 
importance of predictors in time-to-event data

(specifically Cox PH model)?
Thanks,
Ravi.


Do require(rms); ?anova.rms and see related articles:

@Article{hal09usi,
  author =  {Hall, Peter and Miller, Hugh},
  title =  {Using the bootstrap to quantify the authority of an 
empirical ranking},

  journal =  Annals of Stat,
  year =  2009,
  volume =  37,
  number =  {6B},
  pages =  {3929-3959},
  annote =  {confidence interval for ranks;genomics;high 
dimension;independent component bootstrap;$m$-out-of-$n$ 
bootstrap;ordering;overlap interval;prediction interval;synchronous 
bootstrap;ordinary bootstrap may not provide accurate confidence 
intervals for ranks;may need a different bootstrap if the number of 
parameters being ranked increases with $n$ or is large;estimating $m$ is 
difficult;in their first example, where $m=0.355n$, the ordinary 
bootstrap provided a lower bound to the lengths of more accurate 
confidence intervals of ranks}

}

@Article{xie09con,
  author =  {Xie, Minge and Singh, Kesar and Zhang, {Cun-Hui}},
  title =  {Confidence intervals for population ranks in the 
presence of ties and near ties},

  journal =  JASA,
  year =  2009,
  volume =  104,
  number =  486,
  pages =  {775-787},
  annote =  {bootstrap ranks;ranking;nonstandard bootstrap 
inference;rank inference;slow convergence rate;smooth ranks in the 
presence of near ties;rank inference for fixed effects risk adjustment 
models}

}



-Original Message-
From: r-help-boun...@r-project.org 
[mailto:r-help-boun...@r-project.org] On

Behalf Of Frank E Harrell Jr
Sent: Tuesday, March 30, 2010 3:57 PM
To: Michal Figurski
Cc: r-help@r-project.org
Subject: Re: [R] Problem comparing hazard ratios

Michal Figurski wrote:

Dear R-Helpers,

I am a novice

Re: [R] Paik, et al., NEJM, 2004, Fig. 4, rate of event at 10 years as a function of covariate

2010-03-30 Thread Frank E Harrell Jr

Wittner, Ben, Ph.D. wrote:

Does anyone know how to make a plot like Fig. 4 of Paik, et al., New England
Journal of Medicine, Dec. 30, 2004?

Given survival data and a covariate, they plot a curve giving Rate of Distant
Recurrence at 10 Yr (% of patients) on the y-axis versus the covariate on the
x-axis. They also plot curves giving a 95% confidence interval.

Thanks very much.

-Ben





Such a plot is easy to do with the rms package if using a Cox or 
accelerated failure time model, e.g.


require(rms)
dd - datadist(mydata); options(datadist='dd')
f - cph(Surv(rtime, event) ~ rcs(covariate,4) + sex + ..., x=TRUE, 
y=TRUE)  # restricted cubic spline with 4 knots
plot(Predict(f, covariate, sex, time=10))  # separate curves for male 
and female; omit sex to make one curve; add age=50 to predict for a 50 
year old




--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Problem comparing hazard ratios

2010-03-30 Thread Frank E Harrell Jr

Michal Figurski wrote:

Dear R-Helpers,

I am a novice in survival analysis. I have the following code:
for (i in 3:12) print(coxph(Surv(time, status)~a[,i], data=a))

I used it to fit the Cox Proportional Hazard models separately for every 
available parameter (columns 3:12) in my data set - with intention to 
compare the Hazard Ratios.


However, some of my variables are in range 0.1 to 1.6, others in range 
5000 to 9000. How do I compare HRs between such variables?


I have rescaled all the variables to be in 0 to 1 range - is this the 
proper way to go? Is there a way to somehow calculate the same HRs (as 
for rescaled parameters) from the HRs for original parameters?


Many thanks in advance.



There are a lot of issues related to this that will require a good bit 
of study, both in survival analysis and in regression.  I would start 
with bootstrapping the ranks of the likelihood ratio chi-square 
statistics of the competing biomarkers.


Frank

--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to measure accuracy of regression tree?

2010-03-26 Thread Frank E Harrell Jr

vibha patel wrote:

Hello,

for constructing regression tree, I am using rpart function.

now after dividing dataset in to training and testing,
I'm using predict for forecasting.


Such data splitting yields unreliable validations unless N  20,000.
Frank



how to measure accuracy of the predicted data?

Thanks and Regards,
Vibha.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] I have a question on nomograms.

2010-03-25 Thread Frank E Harrell Jr

Marc Schwartz wrote:

On Mar 25, 2010, at 8:54 AM, 笑啸 wrote:


Dear volunteer:
   I am a graduate student of medcine in china.And now,I am devoting myself to 
constructing a nomograms of bladder cancer.I want to do it with 
R-project.However, I do not know how to construct a nomograms with R-project.I 
want to get yours help,thank you! I wish you can tell me the operating 
procedure of the R-project.
  And I apologize for my english,it is poor,sorry!

 truly 
yours
  ding ding  



If you are referring to a regression model based nomogram to be used for 
prediction, see the nomogram() function in the 'rms' CRAN package by Frank 
Harrell.

Also, if you go to rseek.org and search using the keyword 'nomogram' you will 
find this and some other functions that provide related functionality.

HTH,

Marc Schwartz



In addition, you will have to be expert in the type of regression 
modeling used to produce a reliable model that is then displayed in a 
nomogram.  The nomogram is the easy part; it is just a graphical display 
of a model.  It is not the model itself.


Frank

--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] I have a question on nomograms.

2010-03-25 Thread Frank E Harrell Jr

(Ted Harding) wrote:

On 25-Mar-10 17:44:55, R Heberto Ghezzo, Dr wrote:

On the same topic but from a different perspective. A Nomogram or
better a Line Aligned Nomogram is a graph with 2 or more scales,
maybe linear where by alignig values in each scale you can read
values in the other scale. The relationship can be linear, then
the scales are straight lines, or functional then the scale is
curvilinear or even a bidimensional surface with 2 grids.


This is precisely what I normally understand by nonogram.
Hence I was initially puzzled by ding ding's original query, asking
about a nomogram for bladder cancer. This got me googling, and
I found the sort of thing you can try out at

 http://www.nomogram.org

They call this sort of interface nomogram too, and you can have
a go at the one they offer for bladder cancer. This is not a
graphical nomogram in the sense that you describe, rather it is
a form-filling interface with boxes into which you insert
information, and then press the button.


It is very curious that a site named nomogram.org would not contain a 
single nomogram.  I have pointed this out to the author of the site, who 
maintains that disagreements with dictionaries are inconsequential.


Frank



Hence my question, in my original response, to ding ding as to
whether this was the sort of thing he means (rather than the
graphical one).

Until we hear further from him about what he precisely means,
I don't think it it worth while anyone spending effort on chasing
solutions to what may be the wrong interpretation (either the
graphical or the form -- we don't know).


Is there any program in R, S, C, Fortran, etc to compute such
a graph. I know there is an old book by Ott or something like
that in Nomograms but I could not find it.
Does anybody knows anything about constructing LAN? Any books or
already rograms to do it?
There are very usefull for a 2 digits solution and they can be
pin in the wall!
Thanks for any info
R.Heberto Ghezzo
McGill University


Constructing a good nomogram can demand considerable skill in
combinatorial art! The few I have made in my time were not done
quickly, nor did they come out well first time (nor second ... ).
I've always constructed such things by hand (good layout is
very impotant), with computer support for computation of relevant
numerical values. So I can't suggest general-purpose nomogram
generating software (and I doubt it would work well in many cases).

Ted.


E-Mail: (Ted Harding) ted.hard...@manchester.ac.uk
Fax-to-email: +44 (0)870 094 0861
Date: 25-Mar-10  

--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] I have a question on nomograms.

2010-03-25 Thread Frank E Harrell Jr

Gabor Grothendieck wrote:

On Thu, Mar 25, 2010 at 10:43 PM, Frank E Harrell Jr
f.harr...@vanderbilt.edu wrote:

(Ted Harding) wrote:

 http://www.nomogram.org

They call this sort of interface nomogram too, and you can have
a go at the one they offer for bladder cancer. This is not a
graphical nomogram in the sense that you describe, rather it is
a form-filling interface with boxes into which you insert
information, and then press the button.

It is very curious that a site named nomogram.org would not contain a single
nomogram.  I have pointed this out to the author of the site, who maintains
that disagreements with dictionaries are inconsequential.



It is strange although I think that in certain areas of medicine its
common to use the term nomogram to mean the model underlying the
nomogram as opposed to its graphical representation.

For example, in the following discussion the authors seem to be using
nomogram  merely to denote any model that does not involve forming
discrete levels of the inputs in the way that forming risk groups
(high risk group, low risk group, etc.) would:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2777059/



Gabor,

It is not very common to use it in that manner although I have seen it a 
few times.  But it is never correct, even according to a medical 
dictionary: http://medical-dictionary.thefreedictionary.com/nomogram


Frank


--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ordinal regression

2010-03-24 Thread Frank E Harrell Jr

Iasonas Lamprianou wrote:

Dear colleagues,

i am carrying out an ordinal regression model. I try it on SPSS but I flirt 
with R as well. I have a few questions.

1. What is the most reliable/tested/trusted package for ordinal regression in the R world? 


2. Also, I have a statistical question. What is the danger of having to many 
'empty cells' in ordinal regression? How many empty cells are too many? Do you 
have a reference for me to read?

Thank you for the support


Dr. Iasonas Lamprianou


Not sure what is the most trusted but the lrm function in the rms 
package does proportional odds and continuation ratio models (the latter 
with a trick to expand the dataset).  Nearly empty cells are not a 
problem unless you allow for non-proportionality (e.g., interaction 
between X and Y).


Frank




Assistant Professor (Educational Research and Evaluation)
Department of Education Sciences
European University-Cyprus
P.O. Box 22006
1516 Nicosia
Cyprus 
Tel.: +357-22-713178

Fax: +357-22-590539


Honorary Research Fellow
Department of Education
The University of Manchester
Oxford Road, Manchester M13 9PL, UK
Tel. 0044  161 275 3485
iasonas.lampria...@manchester.ac.uk






--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] summary.formula and continuous variables

2010-03-22 Thread Frank E Harrell Jr

Erik Iverson wrote:

Hello,

I am using the summary.formula function in the Hmisc package to produce 
tables.


With the method argument set to response, the help says,

Continuous independent variables (see the ‘continuous’ parameter below) 
are automatically stratified into ‘g’ (see below) quantile groups.


By my reading, this makes it impossible to summarize a continuous 
variable with, for example, its correlation with the response variable.


Is there some sort of functionality I'm missing here, or is this just 
not possible with how summary.formula is written now?


Thanks,
Erik Iverson



That's an excellent question Erik.  I think that summary.formula's 
default treatment of continuous predictors for method='response' needs 
to be improved.  The table cell should be a loess plot and it could be 
summarized with a Spearman correlation coefficient.  Anyone wanting to 
work on the code should get the master code from our subversion repository.


Frank
--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Find a rectangle of maximal area

2010-03-22 Thread Frank E Harrell Jr

A fast Fortran solution may be found by:
require(Hmisc)
?largest.empty

Frank

Ray Brownrigg wrote:

On Tue, 23 Mar 2010, Hans W Borchers wrote:

Barry Rowlingson b.rowlingson at lancaster.ac.uk writes:

On Mon, Mar 22, 2010 at 4:28 PM, Hans W Borchers

hwborchers at googlemail.com wrote:

Still I believe that a clever approach might be possible avoiding the
need to call a commercial solver. I am getting this hope from one of
Jon Bentley's articles in the series Programming Pearls.

Is this the 'Largest Empty Rectangle' problem?

http://en.wikipedia.org/wiki/Largest_empty_rectangle

Dear Barry,

thanks for this pointer. I never suspected this problem could have a name
of its own. Rethinking the many possible applications makes it clear: I
should have searched for it before.

I looked in some of the references of the late 80s and found two algorithms
that appear to be appropriate for implementation in R. The goal is to solve
the problem for n=200 points in less than 10-15 secs.

How about less than 2 seconds? [And 500 points in less than 15 seconds - on a 2-year-old 
DELL Optiplex GX755.]


The implementation below (at end) loops over all 'feasible' pairs of x values, then 
selects the largest rectangle for each pair, subject to your specified constraints.  I 
have no idea if it implements a previously published algorithm.


Other constraints are reasonably easily accommodated.

HTH,
Ray Brownrigg


Thanks again, Hans Werner


I had a look at some of the references from Wikipedia, but they all
follow a similar pattern, one I have noticed in many computer science
journal articles:

 1. State a problem that looks tricky.
 2. Say We have an efficient algorithm for the problem stated in #1
 3. Proceed to derive, using much algebra and theory, the efficient
algorithm. 4. Stop.

The idea of actually producing some dirty, filthy, actual code to
implement their shiny algorithms never seems to cross their minds.

 I also found a similar question from 2008 asked on the R-sig-geo
mailing list. That didn't get much help either!

Sorry.

Barry


N = 200
x - runif(N)
y - runif(N)
ox - order(x)
x - x[ox]
y - y[ox]
x - c(0, x, 1)
y - c(0, y, 1)
plot(x, y, xlim=c(0, 1), ylim=c(0, 1), pch=*, col=2)
omaxa - 0
for(i in 1:(length(x) - 1))
  for(j in (i+1):length(x)) {
x1 - x[i]
x2 - x[j]
XX - x2 - x1
if (XX  0.5) break
yy - c(0, y[i:j], 1)
oyy - order(yy)
yy - yy[oyy]
dyy - diff(yy)
whichdyy - (dyy = 0.5)   (dyy = 0.5*XX)  (dyy = 2*XX)
wy1 - yy[whichdyy]
if (length(wy1)  0) {
  wy2 - yy[(1:length(dyy))[whichdyy] + 1]
  k - which.max(dyy[whichdyy])
  maxa - (x2 - x1)*(wy2[k] - wy1[k])
  if (maxa  omaxa) {
omaxa - maxa
mx1 - x1
mx2 - x2
my1 - wy1[k]
my2 - wy2[k]
  }
}
  }
lines(c(mx1, mx2, mx2, mx1, mx1), c(my1, my1, my2, my2, my1), col=2)

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How good is R at making publication quality tables?

2010-03-17 Thread Frank E Harrell Jr

Hi Ista,

Our material on statlib is far out of date.  Please refer to the primary 
source at http://biostat.mc.vanderbilt.edu/StatReport


Thanks
Frank


Ista Zahn wrote:

Hi Paul,
For instructions and examples using the Hmisc latex() function you
might want to take a look at
http://lib.stat.cmu.edu/S/Harrell/doc/summary.pdf.

-Best,
Ista

On Wed, Mar 17, 2010 at 10:51 AM, Paul Miller pjmiller...@yahoo.com wrote:

Hello Everyone,

I have just started learning R and am in the process of figuring out what it 
can and can't do. I must say I am very impressed with R so far and am amazed 
that something this good can actually be free.

Recently, I finished reading R for SAS and SPSS Users and have begun reading 
SAS and R and Data Manipulation with R. Based on what I've read in these books 
and elsewhere, I get the impression that R is very good at drawing high quality 
graphs but maybe not so good at creating nice looking tables of the sort I'm 
used to getting through SAS ODS.

Am I right or wrong about this? If I am wrong, can anyone show me some examples 
of how R can be used to create really nice looking tables? I often make tables 
of adverse events in clinical trials that have n(%) values in the cells. I'd 
love to see an example that does a nice job of making that sort of table but 
would be happy to see any examples that someone might be willing to send to me.

Thanks,

Paul



 __
Looking for the perfect gift? Give the gift of Flickr!


   [[alternative HTML version deleted]]





--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic model diagnostics residuals.lrm {design}, residuals()

2010-03-11 Thread Frank E Harrell Jr

Chaudhari, Bimal wrote:

I am interested in a model diagnostic for logistic regression which is normally 
distributed (much like the residuals in linear regression with are ~ 
N(0,variance unknown).

My understanding is that most (all?) of the residuals returned by residuals.lrm 
{design} either don't have a well defined distribution or are distributed as 
Chi-Square.

Have I overlooked a residual measure or would it be possible to transform one 
of the residual measures into something reasonably 'normal' while retaining 
information from the residual so I could compare between models (obviously I 
could blom transform any of the measures, but then I'd always get a standard 
normal)?

Cheers,
bimal


Hi Bimal,

What would make it necessary for the residuals to have a certain 
distribution?  Why would you expect a categorical Y variable to give 
risk to residuals with a nice distributions?


You can do residual diagnostics without worrying about the distribution.

Frank



Bimal P Chaudhari, MPH
MD Candidate, 2011
Boston University
MS Candidate, 2010
Washington University in St Louis


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cox Calibration regression test double graphs

2010-03-09 Thread Frank E Harrell Jr
See the calibrate.* functions and the val.surv (for external validation) 
functions in the rms package.  Note especially the new continuous 
calibration curve methods using hazard spline regression.


Frank


David Winsemius wrote:

A) Please do not highjack threads. Post new topics for new questions.

B) Calibration plots are very easy for either logistic regression or Cox 
models when using the calibrate function in either Harrell's rms or 
Design packages. (Not sure about how well they play with survfit 
objects, but you ought to be able to apply the same methods.)





--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Robust SE for lrm object

2010-03-06 Thread Frank E Harrell Jr

Achim Zeileis wrote:

On Sat, 6 Mar 2010, David Winsemius wrote:


On Mar 5, 2010, at 11:54 PM, Patrick Shea wrote:



I'm trying to obtain the robust standard errors for a multinomial 
ordered logit model:


mod6 - lrm(wdlshea ~   initdesch + concap + capasst + qualrat + 
terrain,data=full2)


The model is fine but when I try to get the RSE I get an error.

coeftest(mod6, vcov = vcovHAC(mod6))

Error in match.arg(type) :
'arg' should be one of ?ordinary?, ?score?, ?score.binary?,  
?pearson?, ?deviance?, ?pseudo.dep?, ?partial?, etc.


I'm a novice R user and am not sure how to address this problem. I 
have also tried to use alternatives   (zelig, polr) but have had no 
luck. Any assistance on generating RSE for a multinomial order logit 
model would be appreciated


Have you loaded the library that contains the vcovHAC function?


That is in the sandwich package. However, I doubt that it makes sense 
in this context. Using HAC covariances would imply that you have time 
series data and want to correct for heteroskedasticity and 
autocorrelation. I'm not even sure whether sandwich standard errors 
would be terribly useful. Both would require that you correctly 
specified the estimating functions of your proportional odds logistic 
regression but misspecified a few other aspects of the remaining 
likelihood. Not sure whether that can be obtained for an ordinal 
multinomial response.



(And do you know whether coeftest works with Design/rms objects?)


It does (unlike its own summary function in some situations):

library(rms)
library(lmtest)
data(BankWages, package = AER)
fm - lrm(job ~ ., data = BankWages)
summary(fm)
coeftest(fm)

The reason why vcovHAC() or sandwich() do not work is that bread() and 
estfun() methods would need to be available for lrm objects which is 
currently not the case (dito for polr objects). In principle they 
could be written, see

  vignette(sandwich-OOP, package = sandwich)
but as I said above I'm not sure whether it would be very useful.
Z


Excellent posts.  I'll just add that the Huber-White sandwhich estimator 
is obtain in the rms and Design packages by using the robcov function on 
a fit object.  You can also use the bootstrap with bootcov.


Frank





--

David Winsemius, MD
West Hartford, CT

__



--
Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


  1   2   3   4   5   6   7   >