from:"Frank E Harrell Jr"

Re: [R] missing values in logistic regression

2004-10-29 Thread Frank E Harrell Jr

(Ted Harding) wrote:
On 29-Oct-04 Avril Coghlan wrote:
Dear R help list,
  I am trying to do a logistic regression
where I have a categorical response variable Y
and two numerical predictors X1 and X2. There
are quite a lot of missing values for predictor X2.
eg.,
Y X1   X2
red   0.6  0.2*
red   0.5  0.2*
red   0.5  NA
red   0.5  NA
green 0.2  0.1*
green 0.1  NA
green 0.1  NA
green 0.05 0.05   *
I am wondering can I combine X1 and X2 in
a logistic regression to predict Y, using
all the data for X1, even though there are NAs in
the X2 data?
Or do I have to take only the cases for which
there is data for both X1 and X2? (marked
with *s above)

I don't know of any R routine directly aimed at logistic regression
with missing values as you describe.
The aregImpute function in the Hmisc package can handle this, using 
predictive mean matching with weighted multinomial sampling of donor 
observations' binary covariate values.

. . ..
Ted.

E-Mail: (Ted Harding) [EMAIL PROTECTED]

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] time dependency of Cox regression

2004-11-02 Thread Frank E Harrell Jr

array chip wrote:
Hi,
How can I specify a Cox proportional hazards model
with a covariate which i believe its strength on
survival changes/diminishes with time? The value of
the covariate was only recorded once at the beginning
of the study for each individual (e.g. at the
diagnosis of the disease), so I do not have the time
course data of the covariate for any given individual.
For example, I want to state at the end of the
analysis that the hazard ratio of the covariate is 6
at the beginning, decrease to 3 after 2 years and
decrease to 1.5 after 5 years.
Is this co-called time-dependent covariate? I guess
not, because it's really about the influence of the
covariate (which was measured once at the beginning)
on survival changing over time.
Thanks for any input.
You might try a log-normal or log-logistic accelerated failure time 
model.  These models dictate decreasing hazard ratios over time for 
baseline covariates.
--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] OLS error

2004-11-11 Thread Frank E Harrell Jr

DrakeGis wrote:
Thanks for the help.
-I'm really sorry, but I'm affraid I can't publish any data in order to
allow a reproduction of the results (enterprise policies :( ).
That is silly.  You can surely simulate data that provides an example of 
the problem you are having.

Frank Harrell
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] an off-topic question - model validation

2004-11-11 Thread Frank E Harrell Jr

Wensui Liu wrote:
Currently, I am working on a data mining project and plan to divide
the data table into 2 parts, one for modeling and the other for
validation to compare several models.
But I am not sure about the percentage of data I should use to build
the model and the one I should keep to validate the model.
Is there any literature reference about this topic? 

Thank you so much!
Data splitting is very inefficient for model validation unless the 
sample size is extremely large.  Consider using Efron's optimism 
bootstrap as is used in the validate function in the Design package. 
validate will also do data splitting and cross-validation though.

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Barplot difficulties

2004-11-16 Thread Frank E Harrell Jr

Heather J. Branton wrote:
Hello. I am an R newbie struggling to learn and use R . I have read many 
portions of the R Reference Manual, as well as the FAQs. Given that I 
learn something new each time, I know I might be missing something 
obvious. But I appeal to your good nature to help me through this 
initial problem.

I have attached a pdf file to demonstrate what I desire and have listed 
what my data looks like in Excel (below). Following is the data and 
script I developed - which does not provide what I want.

Dear Heather - Please read Bill Cleveland's book The Elements of 
Graphing Data.  A MUCH better plot can be produced.  And let time be one 
of the first variables to vary.


--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Adding mean and SEM in Hmisc ecdf

2004-11-16 Thread Frank E Harrell Jr

Jean-Louis Abitbol wrote:
Dear R Gurus,
Sorry if this has been asked before but I did not find it in the
archives.
I would like to add a horizontal display of mean and SEM on Hmisc ecdf
plots done by group (ie variate~treatment).
Has anyone written some code to do that ?
Thanks and kind regards,
Jean-Louis
It is customary to show quantiles on ecdfs, and the functions make that 
easy.  You can add other points or reference lines to an existing plot 
easily if using the non-lattice ecdf.  To do it with the lattice version 
ecdf.formula you will have to change its default panel function.

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R-gui] Re: [R] The hidden costs of GPL software?

2004-11-17 Thread Frank E Harrell Jr

Patrick Burns wrote:
I'm a big advocate -- perhaps even fanatic -- of  making R easier for
novices in order to spread its use, but I'm not convinced that  a GUI
(at least in the traditional form) is the most valuable approach.
Perhaps an overly harsh summary of some of Ted Harding's statements
is: You can make a truck easier to get into by taking off the wheels, but
that doesn't make it more useful.
In terms of GUIs, I think what R should focus on is the ability for  user's
to make their own specialized GUI.  So that a knowledgeable programmer
at an installation can create a system that is easy for unsophisticated
users for the limited number of tasks that are to be done.  The ultimate
users may not even need to know that R exists.
I think Ted Harding was on  the mark when he said that it is the help
system that needs enhancement.  I can imagine a system that gets the
user to the right function and then helps fill in the arguments; all of the
time pointing them towards the command line rather than away from
it.
The author of the referenced article highlighted some hidden costs of R,
but did not highlight the hidden benefits (because they were hidden from
him).  A big benefit of R is all of the bugs that aren't in it (which 
may or
may not be due to its free status).

Patrick Burns
Burns Statistics
[EMAIL PROTECTED]
+44 (0)20 8525 0696
http://www.burns-stat.com
(home of S Poetry and A Guide for the Unwilling S User)
Jan P. Smit wrote:
Dear Phillippe,
Very interesting. The URL of the article is 
http://www.scientific-computing.com/scwsepoct04free_statistics.html.

Best regards,
Jan Smit
Philippe Grosjean wrote:
Hello,
In the latest 'Scientific Computing World' magazine (issue 78, p. 
22), there
is a review on free statistical software by Felix Grant (doesn't 
have to
pay good money to obtain good statistics software). As far as I 
know, this
is the first time that R is even mentioned in this magazine, given 
that it
usually discuss commercial products.

[ ...]

I really agree with you Patrick.  To me the keys are having better help 
search capabilities, linking help files to case studies or at least 
detailed examples, having a navigator by keywords (a rudimentary one is 
at http://biostat.mc.vanderbilt.edu/s/finder/finder.html), having a 
great library of examples keyed by statistical goals (a la BUGS examples 
guides), and having a menu-driven skeleton code generator that gives 
beginners a starting script to edit to use their variable names, etc. 
Also I think we need a discussion board that has a better memory for 
new users, like some of the user forums currently on the web, or using a 
wiki.

Frank
--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] The hidden costs of GPL software?

2004-11-18 Thread Frank E Harrell Jr

Mike Prager wrote:
At 11/18/2004 07:01 AM Thursday, Thomas Schönhoff wrote:
To sum up, what I am in need to is an extensive example based 
help-system, focused on how to do things in R. In parts this is 
already there, i.e. SimpleR from Verzani (contributed docs area) etc.

Hopefully I can  contribute to this in future, since it is seems to me 
invaluable to learn R by going through example-based lessons (some are 
found in vignette() ).
These are much more comprehensible to me than those short reference 
like entries in the current help-system, mostly due to their very 
technical approach (same is to be said about the official GNU R 
manuals, especially The R Language, which wasn't a great help for me 
when I took my first look at GNU R). In this context something like 
the GuideMaps of Vista come to my mind!

But to be as clear as possible, I think GNU R is great and I 
appreciate all the efforts done by the R core team and associates!

Nevertheless it seems to be valuable to re-think the help-system in R 
with respect to those who may have a good understanding in statistics, 
but lacking some basic experiences in how to introduce themselves to 
sophisticated world of R/S languages.

(I posted similar material before, but it was moved to R-devel, and I 
wanted to express a bit of it here.)

I have frequently felt, like Thomas, that what could make R easier to 
use is not a GUI, but a help system more focused on tasks and examples, 
rather than on functions and packages.  This has obvious and large costs 
of development, and I am unlikely to contribute much myself, for reasons 
of time and ability.  Yet, I mention it for the sake of this discussion.

Such a help system could be a tree (or key) structure in which through 
making choices, the user's description of the desired task is gradually 
narrowed.  At the end of each twig of the tree would be a list of 
suggested functions for solving the problem, hyperlinked into the 
existing help system (which in many ways is outstanding and has evolved 
just as fast as R itself).  This could be coupled with the continued 
expansion of the number of examples in the help system.

Now I must express appreciation for what exists already that helps in 
this regard:  MASS (in its many editions), Introductory Statistics with 
R, Simple R, and the other free documentation that so many authors have 
generously provided.  Not to mention the superlative contribution of R 
itself, and the work of the R development team.  It is beyond my 
understanding how something so valuable and well thought out has been 
created by people with so many other responsibilities.

Mike
...
I second all of that.  What you are describing Mike could be done with 
a community-maintained wiki, with easy to add hyperlinks to other sites. 
 Just think what a great value it would be to the statistical community 
to have an ever-growing set of examples with all code and output, taking 
a cue from the BUGS examples guides.  The content could be broken down 
by major areas (data import examples, data manipulation examples, many 
analysis topics, many graphics topics, etc.).  Ultimately the more 
elaborate case studies could be peer-reviewied (a la the Journal of 
Statistical Software) and updated.

Frank
--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] SAS or R software

2004-11-20 Thread Frank E Harrell Jr

neela v wrote:
Hi all there
 
Can some one clarify me on this issue, features wise which is better R or SAS, leaving the commerical aspect associated with it. I suppose there are few people who have worked on both R and SAS and wish they would be able to help me in deciding on this.
 
THank you for the help
I estimate that SAS is about 11 years behind R in statistical analysis 
and graphics capabilities, and that gap is growing.  Also, code in SAS 
is not peer-reviewed as is code in R.  SAS has advantages in two areas: 
dealing with massive datasets (generally speaking,  1GB) and getting 
more accurate P-values in mixed effect models.

See http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RS/sintro.pdf for 
one comparison of SAS and S on technical features.

There are companies whose yearly license fees to SAS total millions of 
dollars.  Then those companies hire armies of SAS programmers to program 
an archaic macro language using old statistical methods to produce ugly 
tables and the worst graphics in the statistical software world.

Frank Harrell
SAS User, 1969-1991
--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

[R] Error with strwidth after lattice graphic drawn

2004-11-20 Thread Frank E Harrell Jr

In
platform i386-pc-linux-gnu
arch i386
os   linux-gnu
system   i386, linux-gnu
status
major2
minor0.1
year 2004
month11
day  15
language R
I'm getting an error when using strwidth after a lattice graphic is drawn:
library(lattice)
xyplot(runif(20) ~ runif(20))
strwidth('xxx')
Error in strwidth(xxx) : invalid graphics state
Any help appreciated.  I have version 2.0.1 of grid and version 0.10-14 
of lattice.
--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] SAS or R software

2004-11-20 Thread Frank E Harrell Jr

BXC (Bendix Carstensen) wrote:
Two major advantages of SAS that seems to have been overlooked in
the previous replies are:
1) The data-set language is SAS for data manipulation is more
   human-readable than R-code in general. 
   R is not a definite write-only laguage as APL, but in particular
   in datamanipulation it is easy to write code that is impossible
   to decipher after few weeks. 
   You can also produce unreadable code in SAS, but it generally takes 
   more of an effort.

   Thus: Data manupulation is easier to document in SAS than in R.
I agree with the readability of data manipulation code, especially for 
novice users.  As far as functionality for data manipulation is 
concerned, R has more flexibility and is faster to program once one has 
experience.  I do think that learning data manipulation techniques in R 
takes a while.

2) proc tabulate.
   This procedure enables you to do extensive sensible tabulation
   of your data if you are prepared to read the manual.
   (This is not a product of the complexity of the software,
but of the complexity of the tabulation features).
   Compared to this only rudimentay tools exist in R (afaik).
I could not disagree more with that statement.  Look for example at 
http://biostat.mc.vanderbilt.edu/twiki/pub/Main/StatReport/summary.pdf . 
 With R you can customize tables by specifying your own function for 
computing any statistic of interest.  Let's see how to use PROC TABULATE 
to display stratified Kaplan-Meier 4-year survival estimates along with 
median and median life length, not to mention match the fine control of 
formatting available using the combination of R and LaTeX.

So if you want to do well documented data manipulation and clear
and compact tables go for SAS.
Readability is different from being well-documented.  And for clear and 
compact tables, R is the winner hands down.

Frank Harrell
If you want to do statistical analyses and graphics (in finite time)
go for R.
Bendix Carstensen
--
Bendix Carstensen
Senior Statistician

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Help with ooplot(gplots) and error bars

2004-11-21 Thread Frank E Harrell Jr

Jean-Louis Abitbol wrote:
Dear All
I am trying to graph a proportion and CI95% by a factor with ooplot (any
other better solution ?)
It works well until I try to add the confidence interval.
this is the error message and and a description of the data:
  
  dat1
PointEst
TT1   1  3.6
TT2   2  5.0
TT3   3  5.8
TT4   4 11.5
TT5   5  7.5
TT5   6  8.7
TT7   7 17.4

dat2
Lower
TT1   1   1.0
TT2   2   2.2
TT3   3   2.7
TT4   4   6.7
TT5   5   3.9
TT5   6   4.6
TT7   7  11.5
dat3
Upper
TT1   1  12.3
TT2   2  11.2
TT3   3  12.1
TT4   4  19.1
TT5   5  14.2
TT5   6  15.6
TT7   7  25.6
ooplot(dat1,type=barplot,col=rich.colors(7,temperature),names.arg=c(X,Y,Z,A,B,C,D),plot.ci=T,
+ ci.l=dat2,ci.u=dat3, xlab=Treatment, ylab=Percent Normalized
Patients)
Error in ooplot.default(dat1, type = barplot, col = rich.colors(7,
temperature),  : 
'height' and 'ci.u' must have the same dimensions.

I have tried various ways of supplying ci.l and ci.u (including a
vector)
Thanks for the help that anyone can bring,
Regards, JL
One way is to look at the examples for Dotplot in the Hmisc package. 
Those examples display bootstrap percentile confidence intervals for a mean.

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Help with ooplot(gplots) and error bars

2004-11-21 Thread Frank E Harrell Jr

Jean-Louis Abitbol wrote:
Dear Pr Harrel,
Thanks for your help at this occasion as well as for previous questions
to the list.
I just looked at the example in your intro doc.
However I am dealing with proportions (ie % of patients responding to a
given treatment).
In this case I am not sure I can use summarize and then the xYplot.
I am not aware of R graphing tools that can deal directly with
proportions adding CI not to mention producing by factor/trellis plots.
This is why I why trying to do it by hand (using binconf) with ooplot,
without much success I am afraid. 

Best regards,
JL Abitbol, MD
Jean-Louis,
Here is an example.
# Plot proportions and their Wilson confidence limits
set.seed(3)
d - expand.grid(continent=c('USA','Europe'), year=1999:2001,
 reps=1:100)
# Generate binary events from a population probability of 0.2
# of the event, same for all years and continents
d$y - ifelse(runif(6*100) = .2, 1, 0)
s - with(d,
  summarize(y, llist(continent,year),
function(y) {
 n - sum(!is.na(y))
 s - sum(y, na.rm=T)
 binconf(s, n)
}, type='matrix')
)
Dotplot(year ~ Cbind(y) | continent,  data=s, ylab='Year',
xlab='Probability')
I did have to temporarily override a function in Hmisc to fix a problem. 
 This will be corrected in an upcoming release of Hmisc:

mApply - function(X, INDEX, FUN=NULL, ..., simplify=TRUE) {
## Matrix tapply
## X: matrix with n rows; INDEX: vector or list of vectors of length n
## FUN: function to operate on submatrices of x by INDEX
## ...: arguments to FUN; simplify: see sapply
## Modification of code by Tony Plate [EMAIL PROTECTED] 10Oct02
## If FUN returns more than one number, mApply returns a matrix with
## rows corresponding to unique values of INDEX
nr - nrow(X)
if(!length(nr)) {  ## X not a matrix
  r - tapply(X, INDEX, FUN, ..., simplify=simplify)
  if(is.matrix(r)) r - drop(t(r)) else
  if(simplify  is.list(r))
r - drop(matrix(unlist(r), nrow=length(r),
 dimnames=list(names(r),names(r[[1]])), byrow=TRUE))
} else {
  idx.list - tapply(1:nr, INDEX, c)
  r - sapply(idx.list, function(idx,x,fun,...) 
fun(x[idx,,drop=FALSE],...),
  x=X, fun=FUN, ..., simplify=simplify)
  if(simplify) r - drop(t(r))
}
dn - dimnames(r)
if(length(dn)  !length(dn[[length(dn)]])) {
  fx - FUN(X,...)
  dnl - if(length(names(fx))) names(fx) else dimnames(fx)[[2]]
  dn[[length(dn)]] - dnl
  dimnames(r) - dn
}

if(simplify  is.list(r)  is.array(r)) {
  ll - sapply(r, length)
  maxl - max(ll)
  empty - (1:length(ll))[ll==0]
  for(i in empty) r[[i]] - rep(NA, maxl)
  ## unlist not keep place for NULL entries for nonexistent categories
  first.not.empty - ((1:length(ll))[ll  0])[1]
  nam - names(r[[first.not.empty]])
  dr - dim(r)
  r - aperm(array(unlist(r), dim=c(maxl,dr),
   dimnames=c(list(nam),dimnames(r))),
 c(1+seq(length(dr)), 1))
}
r
}

Frank
On Sun, 21 Nov 2004 07:48:58 -0500, Frank E Harrell Jr
[EMAIL PROTECTED] said:
Jean-Louis Abitbol wrote:
Dear All
I am trying to graph a proportion and CI95% by a factor with ooplot (any
other better solution ?)
It works well until I try to add the confidence interval.
this is the error message and and a description of the data:
 
 dat1
   PointEst
TT1   1  3.6
TT2   2  5.0
TT3   3  5.8
TT4   4 11.5
TT5   5  7.5
TT5   6  8.7
TT7   7 17.4


dat2
   Lower
TT1   1   1.0
TT2   2   2.2
TT3   3   2.7
TT4   4   6.7
TT5   5   3.9
TT5   6   4.6
TT7   7  11.5

dat3
   Upper
TT1   1  12.3
TT2   2  11.2
TT3   3  12.1
TT4   4  19.1
TT5   5  14.2
TT5   6  15.6
TT7   7  25.6

ooplot(dat1,type=barplot,col=rich.colors(7,temperature),names.arg=c(X,Y,Z,A,B,C,D),plot.ci=T,
+ ci.l=dat2,ci.u=dat3, xlab=Treatment, ylab=Percent Normalized
Patients)
Error in ooplot.default(dat1, type = barplot, col = rich.colors(7,
temperature),  : 
   'height' and 'ci.u' must have the same dimensions.

I have tried various ways of supplying ci.l and ci.u (including a
vector)
Thanks for the help that anyone can bring,
Regards, JL
One way is to look at the examples for Dotplot in the Hmisc package. 
Those examples display bootstrap percentile confidence intervals for a
mean.

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Error with strwidth after lattice graphic drawn

2004-11-21 Thread Frank E Harrell Jr

Uwe Ligges wrote:
Prof Brian Ripley wrote:
On Sat, 20 Nov 2004, Frank E Harrell Jr wrote:
In
platform i386-pc-linux-gnu
arch i386
os   linux-gnu
system   i386, linux-gnu
status
major2
minor0.1
year 2004
month11
day  15
language R
I'm getting an error when using strwidth after a lattice graphic is 
drawn:

library(lattice)
xyplot(runif(20) ~ runif(20))
strwidth('xxx')
Error in strwidth(xxx) : invalid graphics state
Any help appreciated.  I have version 2.0.1 of grid and version 
0.10-14 of lattice.

The advice is `don't do that'!
strwidth() is a base graphics command, and will only work if a device 
is currently plotting base graphics.  Lattice is built on grid, which has
stringWidth().

... and convertWidth() is useful to display stuff afterwards in an 
interpretable way  ...

Uwe Ligges
Thanks Uwe and Brian.  Before the latest versions, grid would let me use 
ordinary graphics functions to deal with the currently rendered plot 
after I did par(usr=c(0,1,0,1)), so I was lazy.  I changed my code to 
use specific grid functions.

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] SAS or R software

2004-11-21 Thread Frank E Harrell Jr

Michael Friendly wrote:
I can't resist dipping my oar in here:
For me, some significant advantages of SAS are
- Ability to input data in almost *any* conceivable form using the 
combination of features available through
input/infile statements, SAS informats and formats, data step 
programming, etc.  Dataset manipulation
(merge, join, stack, subset, summarize, etc.) also favors SAS in more 
complex cases, IMHO.
I respectfully disagree Michael, in cases where the file sizes are not 
enormous.  And in many cases repetitive SAS data manipulation can be 
done much easier using vectors (e.g., vectors of recodes) and matrices 
in R, as shown in the examples in Alzola  Harrell 
(http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RS/sintro.pdf).

We are working on a project in which a client sends us 50 SAS datasets. 
 Not only can we do all the complex data manipulations needed in R, but 
we can treat the 50 datasets as a array (actually a list( )) to handle 
repetitive operations over many datasets.

- Output delivery system (ODS):  *Every* piece of SAS output is an 
output object that can be captured as
a dataset, rendered in RTF, LaTeX, HTML, PDF, etc. with a relatively 
simple mechanism (including graphs)
   ods pdf file='mystuff.pdf'';
  any sas stuff
   ods pdf close;
- If you think the output tables are ugly, it is not difficult to define 
a template for *any* output to display
it the way you like.
I would like to see SAS ODS duplicate Table 10 in 
http://biostat.mc.vanderbilt.edu/twiki/pub/Main/StatReport/summary.pdf

Cheers,
Frank
- ODS Graphics (new in SAS 9.1) will extend much of this so that 
statistical procedures will produce many
graphics themselves with ease.

One significant disadvantage for me is the difficulty of composing 
multipanel graphic displays
(trellis graphics, linked micromaps, etc.) due to the lack of an 
overall, top-down graphics environment.
As well, there are a variety of kinds of graphics I've found 
extraordinarily frustrating to try to do
in SAS because of lack of coherence or generality in the output 
available from procedures --- an
example would be effect displays, such as implemented in R in the 
effects package. 
I can't agree, however, with Frank Harrell that SAS produces 'the worst 
graphics in the statistical software world.'
One can get ugly graphs in R almost as easily in SAS just by accepting 
the 80-20 rule:  You can typically get 80% of
what you want with 20% of the effort.  To get what you really want takes 
the remaining 80% of effort.
On the other hand, the active hard work of many R developers has led to 
R graphics for which the *default*
results for many graphs avoid many of the egregious design errors 
introduced in SAS in the days of pen-plotters
(+ signs for points, cross-hatching for area fill).

-Michael

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Weibull survival regression

2004-11-23 Thread Frank E Harrell Jr

Eric Lim wrote:
Dear R users,
 
Please can you help me with a relatively straightforward problem that I
am struggling with? I am simply trying to plot a baseline survivor and
hazard function for a simple data set of lung cancer survival where
`futime' is follow up time in months and status is 1=dead and 0=alive.
 
Using the survival package:
 
lung.wbs - survreg( Surv(futime, status)~ 1, data=lung, dist='weibull')
 
plot (lung.wbs)
 
Returns the error msg:
 
Error in xy.coords(x, y, xlabel, ylabel, log) : 
x and y lengths differ
 
Using the Design package:
 
lung.wbd - psm (Surv (futime, status)~ 1, dist=weibull, data=lung,
na.action=na.omit)
You don't need to specify na.action here.
 
survplot(lung.wbd)
 
Returns the error msg:
 
Error in survplot.Design(lung.wbd) : fit does not have design
information
survplot only works when there is at least one covariate.  Sorry.  Maybe 
someday ...

 -Frank Harrell
Regards,
 
Eric Lim
Papworth Hospital
Cambridge, UK
--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] number of pairwise present data in matrix with missings

2004-11-23 Thread Frank E Harrell Jr

Andreas Wolf wrote:
is there a smart way of determining the number of pairwise present data
in a data matrix with missings (maybe as a by-product of some
statistical function?)
so far, i used several loops like:
for (column1 in 1:99) {
  for (column2 in 2:100) {
for (row in 1:500) {
  if (!is.na(matrix[row,column1])  !is.na(matrix[row,column2])) {
pairs[col1,col2] - pairs[col1,col2]+1
  }
}
  }
}
but this seems neither the most elegant nor an utterly fast solution.
thanks for suggestions.
andreas wolf
library(Hmisc)
n - naclus(mydataframe)
plot(n)   # show pairwise missingness in a dendogram
naplot(n) # show more details in multiple plots
Frank
--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] R vs SPSS

2004-11-25 Thread Frank E Harrell Jr

Vito Ricci wrote:
Dear all,
in last weeks you discussed about R vs SAS. 
I want to ask your opinion about a comparison between
R and SPSS. I don't know this software, but some weeks
ago I went to a presentation of this product. I found
it really user-friendly with GUI (even if I'd prefer
command line) and very usefull and simple to use in
creation and managing tables, OLAP tecniques, pivot
table.
What you think about?
Cordially
Vito

What worries me about SPSS is that it often results in poor statistical 
practice.  The defaults in dialog boxes are not very good in some cases, 
and like SAS, SPSS tends to lead users to make to many assumptions 
(linearity in regression being one of the key ones).
--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] problem with using transace

2004-11-29 Thread Frank E Harrell Jr

[EMAIL PROTECTED] wrote:
I am trying to use the Hmisc function transace to transform predictors
test-cbind(flowstress,pressres,alloy)
xtrans-transace(x,binary=pressres',monotonic='flowstress', categorical='alloy')
and I am getting the following message¨
Error in ace(x[, -i], x[, i], monotone = im, categorical = ic) : 
	unused argument(s) (monotone ...)

Any idea?

thanks anne
thank for your help
Anne
The corrected version (below) will fix that problem but note that there 
is a bug in ace causing it not to allow a monotonicity constraint when a 
variable is on the left hand side.  This is inconsistent with the ace 
documentation.  There are other problems in ace in which it checks 
column numbers against the number of rows in the x matrix instead of the 
number of columns.  The internal version of ace defined inside areg.boot 
fixes the latter problem.  Note that I reported these problems a long 
time ago.

Frank
transace - function(x, monotonic=NULL, categorical=NULL, binary=NULL, 
pl=TRUE) {
  if(.R.) require('acepack')  # provides ace, avas

nam - dimnames(x)[[2]]
omit - is.na(x %*% rep(1,ncol(x)))
omitted - (1:nrow(x))[omit]
if(length(omitted)) x - x[!omit,]
p - ncol(x)
xt - x  # binary variables retain original coding
if(!length(nam)) stop(x must have column names)
rsq - rep(NA, p)
names(rsq) - nam
for(i in (1:p)[!(nam %in% binary)]) {
  lab - nam[-i]
  w - 1:(p-1)
  im - w[lab %in% monotonic]
  ic - w[lab %in% categorical]
  if(nam[i] %in% monotonic) im - c(0, im)
  if(nam[i] %in% categorical) ic - c(0, ic)
  m - 10*(length(im)0)+(length(ic)0)
  if(m==11) a - ace(x[,-i], x[,i], mon=im, cat=ic)
  else if (m==10) a - ace(x[,-i], x[,i], mon=im)
  else if(m==1) a - ace(x[,-i], x[,i], cat=ic)
  else a - ace(x[,-i], x[,i])
  xt[,i] - a$ty
  rsq[i] - a$rsq
  if(pl)plot(x[,i], xt[,i], xlab=nam[i], ylab=paste(Transformed,nam[i]))
}   
cat(R-squared achieved in predicting each variable:\n\n)
print(rsq)
attr(xt, rsq) - rsq
attr(xt, omitted) - omitted
invisible(xt)
}


__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] impute missing values in correlated variables: transcan?

2004-11-30 Thread Frank E Harrell Jr

Jonathan Baron wrote:
I would like to impute missing data in a set of correlated
variables (columns of a matrix).  It looks like transcan() from
Hmisc is roughly what I want.  It says, transcan automatically
transforms continuous and categorical variables to have maximum
correlation with the best linear combination of the other
variables. And, By default, transcan imputes NAs with best
guess expected values of transformed variables, back transformed
to the original scale.
But I can't get it to work.  I say
m1 - matrix(1:20+rnorm(20),5,)  # four correlated variables
colnames(m1) - paste(R,1:4,sep=)
m1[c(2,19)] - NA# simulate some missing data
library(Hmisc)
transcan(m1,data=m1)
and I get
Error in rcspline.eval(y, nk = nk, inclx = TRUE) : 
  fewer than 6 non-missing observations with knots omitted
Jonathan - you would need many more observations to be able to fit 
flexible additive models as transcan does.  Also note that single 
imputation has problems and you may want to consider multiple imputation 
as done by the Hmisc aregImpute function, if you had more data.

Frank
I've tried a few other things, but I think it is time to ask for
help.
The specific problem is a real one.  Our graduate admissions
committee (4 members) rates applications, and we average the
ratings to get an overall rating for each applicant.  Sometimes
one of the committee members is absent, or late; hence the
missing data.  The members differ in the way they use the rating
scale, in both slope and intercept (if you regress each on the
mean).  Many decisions end up depending on the second decimal
place of the averages, so we want to do better than just averging
the non-missing ratings.
Maybe I'm just not seeing something really simple.  In fact, the
problem is simpler than transcan assumes, since we are willing to
assume linearity of the regression of each variable on the other
variables.  Other members proposed solutions that assumed this,
but they did not take into account the fact that missing data at
the high or low end of each variable (each member's ratings)
would change its mean.
Jon

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

[R] [R-pkgs] New Hmisc Package Available

2004-12-01 Thread Frank E Harrell Jr

An updated version of Hmisc is now available from CRAN.  The Web site 
for Hmisc is http://biostat.mc.vanderbilt.edu/s/Hmisc.  The change log 
may be found at http://biostat.mc.vanderbilt.edu/changelog/Hmisc.html. 
Changes made after 2004-11-24 should be ignored; these will be in the 
next version.

The most major change in Hmisc is that thanks to discussions with a 
highly respected, persistent, and convincing member of the R community, 
Hmisc no longer overrides [.factor to drop unused factor levels by 
default upon subsetting.  You can get the old behavior at any time by 
typing dropUnusedLevels().  A message to that effect appears when you 
attach the package.  Also Hmisc no longer overrides the interaction 
function, as the R builtin version provides the sep argument.  In the 
next version, the summary.formula function will be updated to 
automatically drop unused levels when this is appropriate.

Some other key changes are the addition of bubble plot capability in 
xYplot and addition of the capability of csv.get to correct certain date 
formatting inconsistencies.  sasxport.get now has keep and drop 
arguments to restrict attention to a subset of datasets when many 
datasets are being imported.  Other changes include bug and 
documentation fixes.

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
___
R-packages mailing list
[EMAIL PROTECTED]
https://stat.ethz.ch/mailman/listinfo/r-packages
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] What is the most useful way to detect nonlinearity in lo

2004-12-05 Thread Frank E Harrell Jr

(Ted Harding) wrote:
On 05-Dec-04 Patrick Foley wrote:
It is easy to spot response nonlinearity in normal linear
models using plot(something.lm).
However plot(something.glm) produces artifactual peculiarities
since the diagnostic residuals are constrained  by the fact
that y can only take values 0 or 1.
What do R users find most useful in checking the linearity
assumption of logistic regression (i.e. log-odds =a+bx)?
Patrick Foley
[EMAIL PROTECTED]

The most useful way to detect nonlinearity in logistic
regression is:
a) have an awful lot of data
b) have the x (covariate) values judiciously placed.
Don't be optimistic about this prohlem. The amount of
information, especially about non-linearity, in the binary
responses is often a lot less than people intuitively expect.
This is an area where R can be especially useful for
self-education by exploring possibilities and simulation.
For example, define the function (for quadratic nonlinearity):
  testlin2-function(a,b,N){
x-c(-1.0,-0.5,0.0,0.5,1.0)
lp-a*x+b*x^2; p-exp(lp)/(1+exp(lp))
n-N*c(1,1,1,1,1)
r-c(rbinom(1,n[1],p[1]),rbinom(1,n[2],p[2]),
 rbinom(1,n[3],p[3]),rbinom(1,n[4],p[4]),
 rbinom(1,n[5],p[5])
)
resp-cbind(r,n-r)
X-cbind(x,x^2);colnames(X)-c(x,x2)
summary(glm(formula = resp ~ X - 1,
family = binomial),correlation=TRUE)
  }
This places N observations at each of (-1.0,0.5,0.0.5,1.0),
generates the N binary responses with probability p(x)
where log(p/(1-p)) = a*x + b*x^2, fits a logistic regression
forcing the intercept term to be 0 (so that you're not
diluting the info by estimating a parameter you know to be 0),
and returns the summary(glm(...)) from which the p-values
can be extracted:
The p-value for x^2 is testlin2(a,b,N)$coefficients[2,4]}
You can run this function as a one-off for various values of
a, b, N to get a feel for what happens. You can run a simulation
on the lines of
  pvals-numeric(1000);
  for(i in (1:1000)){
pvals[i]-testlin2(1,0.1,500)$coefficients[2,4]
  }
so that you can test how often you get a significant result.
For example, adopting the ritual sigificant == P0.05,
power = 80%, you can see a histogram of the p-values
over the conventional significance breaks with
hist(pvals,breaks=c(0,0.01,0.03,0.1,0.5,0.9,0.95,0.99,1),freq=TRUE)
and you can see your probability of getting a significant result
as e.g. sum(pvals  0.05)/1000
I found that, with testlin2(1,0.1,N), i.e. a = 1.0, b = 0.1
corresponding to log(p/(1-p)) = x + 0.1*x^2 (a possibly
interesting degree of nonlinearity), I had to go up to N=2000
before I was getting more than 80% of the p-values  0.05.
That corresponds to 2000 observations at each value of x, or
10,000 observations in all.
Compare this with a similar test for non-linearity with
normally-distributed responses [exercise for the reader].
You can write functions similar to testlin2 for higher-order
nonlinearlities, e.g. testlin3 for a*x + b*x^3, testlin23 for
a*x + b*x^2 + c*x^3, etc., (the modifications required are
obvious) and see how you get on. As I say, don't be optimistic!
In particular, run testlin3 a few times and see the sort of
mess that can come out -- in particular gruesome correlations,
which is why correlation=TRUE is set in the call to
summary(glm(...),correlation=TRUE).
Best wishes,
Ted.
library(Design)  # also requires Hmisc
f - lrm(sick ~ sex + rcs(age,4) + rcs(blood.pressure,5))
# Restricted cubic spline in age with 4 knots, blood.pressure with 5
anova(f)  # automatic tests of linearity of all predictors
latex(f)  # see algebraic form of fit
summary(f)# get odds ratios for meaningful changes in predictors
But beware of using the tests of linearity.  If non-significant results 
cause you to reduce an effect to linear, confidence levels and type I 
errors are no longer preserved.  I use tests of linearity mainly to 
demonstrate that effects are more often nonlinear than linear, given 
sufficient sample size.  I.e., good analysts are needed.  I usually 
leave non-significant nonlinearities in the model.

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Gam() function in R

2004-12-06 Thread Frank E Harrell Jr

Yves Magliulo wrote:
so mgcv package is the one i need! indeed, i want integrated smoothness
selection and smooth interactions rather than stepwise selection. i have
a lot of predictor, and i use gam to select those who are efficient
and exclude others. (using p-value)
It is interesting that you use P-values but do not care that the 
strategy you use (variable selection as opposed to pre-specifying models 
or just using shrinkage) does not preserve type I error or confidence 
interval coverage probabilities in subsequent analyses with mgcv.
--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] cbind() and factors.

2004-12-11 Thread Frank E Harrell Jr

Gabor Grothendieck wrote:
michael watson (IAH-C michael.watson at bbsrc.ac.uk writes:
: 
: Hi
: 
: I'm seeing some odd behaviour with cbind().  My code is:
: 
:  cat - read.table(cogs_category.txt, sep=\t, header=TRUE,
: quote=NULL, colClasses=character)
:  colnames(cat)
: [1] CodeDescription
:  is.factor(cat$Code)
: [1] FALSE
:  is.factor(cat$Description)
: [1] FALSE
:  is.factor(rainbow(nrow(cat)))
: [1] FALSE
:  cat - cbind(cat,Color=rainbow(nrow(cat)))
:  is.factor(cat$Color)
: [1] TRUE
:  ?cbind
: 
: I read a text file in which has two columns, Code and Description.
: Neither of these are factors.  I want to add a column of colours to the
: data frame using rainbow().  The rainbow function also does not return a
: factor.  However, if I cbind my data frame (which has no factors in it)
: and the results of rainbow() (which is a vector, not a factor), then for
: some reason the new column is a factor...??

Others have already explained the problem and given what is likely
the best solution but here is one other idea, just in case.
You may require a data frame depending on what you want to do but
if you don't then you could alternately use a character matrix
since that won't result in any conversions to factor.
Lets call the data frame from read.table, Cat.df, and our 
matrix, Cat.m.  cat is not wrong but its confusing 
since there is a common R function called cat.  Now we can 
write the following and don't have to worry about factors:

Cat.df - read.table(...)
# create a character matrix and cbind Colors to it
Cat.m - cbind(as.matrix(Cat.df), Color = rainbow(nrow(Cat.df)))
If you do find you need a data frame later you can convert it back
like this:
Cat.df - as.data.frame(Cat.m)
Cat.df[] - Cat.m  # clobber factors with character data
For speed, the mApply function in the Hmisc package (used by the Hmisc 
summarize function) does looping for stratified statistical summaries by 
operating on matrices rather than data frames.   factors are converted 
to numerics, and service routines can save and restore the levels and 
other attributes.  Here is an example from the summarize help file, plus 
related examples:

# To run mApply on a data frame:
m - mApply(asNumericMatrix(x), race, h)
# Here assume h is a function that returns a matrix similar to x
at - subsAttr(x)  # get original attributes and storage modes
matrix2dataFrame(m, at)
# Get stratified weighted means
g - function(y) wtd.mean(y[,1],y[,2])
summarize(cbind(y, wts), llist(sex,race), g, stat.name='y')
mApply(cbind(y,wts), llist(sex,race), g)
# Compare speed of mApply vs. by for computing
d - data.frame(sex=sample(c('female','male'),10,TRUE),
country=sample(letters,10,TRUE),
y1=runif(10), y2=runif(10))
g - function(x) {
  y - c(median(x[,'y1']-x[,'y2']),
 med.sum =median(x[,'y1']+x[,'y2']))
  names(y) - c('med.diff','med.sum')
  y
}
system.time(by(d, llist(sex=d$sex,country=d$country), g))
system.time({
 x - asNumericMatrix(d)
 a - subsAttr(d)
 m - mApply(x, llist(sex=d$sex,country=d$country), g)
})
system.time({
 x - asNumericMatrix(d)
 summarize(x, llist(sex=d$sex, country=d$country), g)
})
--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] how to fit a weighted logistic regression?

2004-12-15 Thread Frank E Harrell Jr

Kerry Bush wrote:
I tried lrm in library(Design) but there is always
some error message. Is this function really doing the
weighted logistic regression as maximizing the
following likelihood:
\sum w_i*(y_i*\beta*x_i-log(1+exp(\beta*x_i)))
Does anybody know a better way to fit this kind of
model in R?
FYI: one example of getting error message is like:
x=runif(10,0,3)
y=c(rep(0,5),rep(1,5))
w=rep(1/10,10)
fit=lrm(y~x,weights=w)
Warning message: 
currently weights are ignored in model validation and
bootstrapping lrm fits in: lrm(y ~ x, weights = w) 

although the model can be fit, the above output
warning makes me uncomfortable. Can anybody explain
about it a little bit?
The message means exactly what it says.  Model validation in Design 
currently cannot incorporate weights for lrm.  Everything else is OK.

Frank Harrell
Best wishes,
Feixia
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] MIME decoding in Mozilla Thunderbird

2004-12-15 Thread Frank E Harrell Jr

CG Pettersson wrote:
Hello all!
I recently switched mail program to Mozilla Thunderbird, running on W2k.
Everything works fine, except the MIME-digests from this list. The 
decoding doesn´t work properly.

I had contact with Martin Maechler some time ago, and he suggested a try 
on this list for ideas on how to do the decoding in Windows, even if 
it´s not a proper R-question.

Outlook do the proper job on the MIME, but using that is to go a bit 
far! Or?

/CG
Thunderbird, which is an otherwise wonderful mail client, does not work 
for r-help digests.  The reason is that if you receive 100 messages in a 
day, Thunderbird inefficiently handles all the mime 'attachments', and 
navigating them all is incredibly slow.  I didn't have the decoding 
problem you mentioned though.

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] AUC for logistic regression [was: (no subject)]

2004-12-15 Thread Frank E Harrell Jr

Joe Nocera wrote:
I believe that Roman is referring to AUC as the Area Under Curve from a Receiver
Operating Characteristic.  

If this indeed your quantity of interest - it can be calculated in R.  You can 
download
code at:
http://www.bioconductor.org/repository/release1.5/package/Win32/
and/or
http://biostat.ku.dk/~bxc/SPE/library/
Check out the archives - I'm sure there is more there if you search ROC 
instead.
Cheers,
Joe
Quoting Spencer Graves [EMAIL PROTECTED]:

 What's AUC?  If you mean AIC (Akaike Information Criterion), and 
if you fit logistic regression using glm, the help file says that glm 
returns an object of class glm, which is a list containing among other 
things an attribute aic.  For example, suppose you fit a model as follows: 

 fit - glm(y~x, famil=binomial()...)
 Then fit$aic returns the AIC. 

 You may also wish to consider anova and anova.glm. 

 hope this helps.  spencer graves
[EMAIL PROTECTED] wrote:

Dear R-helper,
I would like to compare the AUC of two logistic regression models (same 
population). Is it possible with R ?

Thank you
Roman Rouzier
[[alternative HTML version deleted]]
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

--
Spencer Graves, PhD, Senior Development Engineer
O:  (408)938-4420;  mobile:  (408)655-4567
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Joseph J. Nocera
AUC is standard output in the lrm function in the Design package (the C 
Index).  validate.lrm computes the overfitting-corrected C index.

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] SAS or R software

2004-12-17 Thread Frank E Harrell Jr

Alexander C Cambon wrote:
I apologize for adding this so late to the SAS or R software  thread.
This is a question, not a reply, but it seems to me to fit in well with
the subject of this thread.
I would like to know anyone's experiences in the following two areas
below.  I should add I have no experience myself in these areas:
1) Migrating from SAS to R in the choice of statistical software used
for FDA  reporting.
 (For example, was there more effort involved in areas of
documentation, revision tracking,  or validation of software codes?)
FDA has no requirements.  They accept Minitab and even accept Excel. 
Requirements are to be a good statistician doing quality reproducible 
work for its own sake.


2) Migrating from SAS to R in the choice of statistical software used
for NIH reporting  (or other US or non-US) government agencies) .
No issues.
Frank
I find myself using R more and more and being continually amazed by its
breadth of capabilities, though I have not tried ordering pizza yet. I
use SAS, S-Plus, and, more recently, R for survival analysis and
recurrent events in clinical trials.
Alex Cambon
Biostatistician
School of Public Health and Information Sciences
University of Louisville

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] SAS or R software

2004-12-18 Thread Frank E Harrell Jr

 imported against the number reported by PROC CONTENTS, so this 
problem is easily detected and corrected with emacs.

Note that with literally billions of dollars at their disposal, SAS 
didn't take the time to really write a procedure for PROC EXPORT.  Like 
the R sas.get function, it generates voluminous SAS DATA step code to do 
the work.

Regarding CDISC, the SAS transport format that is now accepted by FDA is 
deficient because there is no place for certain metadata (e.g., units of 
measurement, value labels are remote from the datasets, variable names 
are truncated to 8 characters).  The preferred format for CDISC will 
become XML.
--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] SAS or R software

2004-12-18 Thread Frank E Harrell Jr

Henric Nilsson wrote:
Marc Schwartz said the following on 2004-12-18 01:19:
As you are likely aware, other statistically relevant issues are
contained in various ICH guidance documents regarding GCP considerations
and principles for clinical trials:
http://www.ich.org/[EMAIL PROTECTED]@_TEMPLATE=272

ICH E9 states that (p. 27):
The computer software used for data management and statistical analysis
should be reliable, and documentation of appropriate software testing
procedures should be available.

Some commercial software vendors (SAS, Insightful, and StatSoft) offer
white papers stating that their software can work within an 21 CFR Part
11 compliant system.

http://www.sas.com/industry/pharma/develop/papers.html
http://www.insightful.com/industry/pharm/21cfr_part11_Final.pdf
http://www.statsoft.com/support/whitepapers/pdf/STATISTICA_CFR.pdf
Some commercial vendors (SAS and Insightful) also offers tools for
validation of the installation and operation of the software. SAS has

http://support.sas.com/documentation/installcenter/common/91/ts1m3/qualification_tools_guide.pdf

and S-PLUS has validate().
As a statistical consultant working within the pharamceutical industry,
I think that our clients find the white papers being some kind of
quality seal. It signals that someone has actually thought about the
issues involved, written a document about it, and even stated that it
can be done. Of course, there's a lot of FUD going on here. But if our
lives can be made simpler by producing similar white papers and QA
tools, why not?

(But for some people, only SAS will do:
Last week we were audited on behalf of a client. One of the specific
issues discussed were validation and the Part 11 compliance of S-PLUS.
In this specific trial, data are to be transferred from Oracle Clinical
- SAS - SPLUS, and they auditors were really worried about the first
and last link of that chain. Finally, they suggested using only SAS...
And in this particular case, Part 11 is really a non-issue since
physical records exists (i.e. case report forms) and all final S-PLUS
output and code will also be stored physically (i.e. print-outs) -- no
need for electronic signatures here!)

There is also a general guidance document for computer systems used in
clinical trials here:
http://www.fda.gov/ora/compliance_ref/bimo/ffinalcct.htm
Though it is to be superseded by a draft document here:
http://www.fda.gov/cder/guidance/6032dft.htm

From the introduction (p. 2):
This document provides guidance about computerized systems that are
used to create, modify, maintain, archive, retrieve, or transmit
clinical data required to be maintained and/or submitted to the Food and
Drug Administration (FDA)

The `retrieve' part is certainly applicable.

...
Henric
That is not clear. And since FDA allows submissions using Excel, with
not even an audit trail, and with known major statistical computing
errors in Excel, I am fairly certain that it is not applicable or at the
least is not enforced in any meaningful way.

--
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] SAS or R software

2004-12-18 Thread Frank E Harrell Jr

Henric Nilsson wrote:
Frank E Harrell Jr said the following on 2004-12-18 15:03:
That is not clear.

Perhaps. And I think this is the issue. From the clients' perspective, 
not a single FDA document states that you can use other software than 
SAS. They haven't really thought about the fact that there isn't any FDA 
documents encouraging the use of SAS for statistical analyses.
Right.  This reminds me of the worst movie of all time, Plan 9 From 
Outer Space, in which the psychic Creskin closes the movie by saying 
Can you prove that this DIDN'T happen?.

I don't think that the real problem is convincing regulatory authorities 
that R (or any other (open-source) software for that matter) is 
operating adequately. But clients and auditors seems to reason along the 
lines of rather being safe than sorry and nobody's ever been critized 
for using SAS. From their perspective, when we propose using `some 
other' software they start thinking that it perhaps may jeopardize their 
trial results (and, all to often, but doesn't FDA require SAS?).
Yes that is the hurdle.
How to fight this? I don't know. Right now I'm thinking, If you can 
beat 'em, join 'em and that the way of proving that `some other' 
software works is through having similar documents and tools as the 
commercial vendors.
With the job market for statisticians being excellent, I've often 
wondered why clinical statisticians in industry are so often timid. 
Statisticians need to show strength and stamina, along with good 
teaching skills, on this issue.


 And since FDA allows submissions using Excel, with not even an audit 
trail, and with known major statistical computing errors in Excel, I 
am fairly certain that it is not applicable or at the least is not 
enforced  in any meaningful way.

The general preconception seems to be that neither SAS nor Excel needs 
validation. E.g. the British guideline referenced in my previous email 
states on p. 12 that
It is generally considered that there is no requirement for validation 
of commercial hardware and established operating systems or for packages 
such as the SAS system, Oracle and MS Excel, as entities in their own 
right. However, most are configurable systems and so need adequate 
control of installation and their configuration parameters.
This makes me wonder about the British system.  Have they not seen the 
serious calculation errors documented to be in Excel?

Luckily for Excel, not a single word about precision and adequacy...
Right.  Thanks for your note Henric -Frank

Henric

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] SAS or R software

2004-12-19 Thread Frank E Harrell Jr

(Ted Harding) wrote:
On 19-Dec-04 Tim Churches wrote:
Shawn Way wrote:

I've seen multiple comments about MS Excel's precision and accuracy.
Can you please point me in the right direction in locating information
about these?

As always, Google is your friend, but see for example 
http://www.nwpho.org.uk/sadb/Poisson%20CI%20in%20spreadsheets.pdf
. . .
Also see http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/ExcelProblems
--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Producing Editable Graphs for PowerPoint

2004-12-20 Thread Frank E Harrell Jr

Paul Hewson wrote:
Hello,
(apologies, I'm not entirely sure whether this question is about R or my
limitations with PowerPoint).   I've submitted a paper (which has been
accepted) but the journal now require me to submit graphs that are
editable in PowerPoint.   I would be grateful for suggestions as to
how I should do this.
The best route seems to be to copy-and-paste the figures from the
windows() device as a metafile.   However, on converting the graphs to
an editable format and ungrouping them, it appears that every line,
tickmark, dot and dash is treated as a separate entity by PowerPoint
(for example even the horizontal and vertical parts of + are
separated).   Alternatively, reading in the files from a saved metafile
is even more problematic: having used par(new = TRUE) I have a set of
layered graphs and all but the last one dissapear once I open the
graphic for editing.
Is this normal for a figure editable by PowerPoint or am I doing
something horribly wrong somewhere?

Many thanks
Paul

-=-=-=-=-=-=-=-=-=-=-=-=
Paul Hewson
Lecturer in Statistics
School of Mathematics and Statistics
University of Plymouth
Drake Circus
Plymouth PL4 8AA
tel (01752) 232778
fax (01752) 232780
email [EMAIL PROTECTED]
www.plymouth.ac.uk
I had hoped that journals engaged in reproducible research by now.
--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] How to display each symbol in a different color using plot with summary.formula.reverse

2004-12-21 Thread Frank E Harrell Jr

Greg Blevins wrote:
Dear R Masters,
I have searched high and low (the help archives and my various R reference 
material and help files) for a solution to what appears to me to be quite a 
simple problem.  In the following syntax, variable n10 has three levels.  I 
would like the symbols that appear in the graph for these three levels to be 
different colors.  The best I have been able to do is to have the Key display 
three colors, but the symbols in the graph only show up in black and white.  
Any suggestions?
par(cex=.8)
s - summary(n10 ~ n13, method=reverse, test=T)
plot(s, dotsize=1.2, cex.labels=.7, cex.axis=.5, cex.main=.5, 
which=categorical)
Key(locator(1))
R version 2.0.1
Windows XP
Thank you,
Greg Blevins
The Market Solutions Group
Greg,
These kinds of questions are probably best directed at package maintainers.
This would be an enhancement to the dotchart2 function used by 
plot.summary.formula.reverse, and the addition of a new argument to it 
from plot  I have found symbols (esp. circle vs. triangle) to be 
more effective for this purpose so I am not motivated to work on this 
very soon but would consider it.  -Frank


--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] GAM: Overfitting

2004-12-21 Thread Frank E Harrell Jr

Jean G. Orelien wrote:
I am analyzing particulate matter data (PM10) on a small data set (147
observations).  I fitted a semi-parametric model and am worried about
overfitting.  How can one check for model fit in GAM?
 

Jean G. Orelien
It's good to separate 'model fit' (or lack of fit) from 'overfitting'. 
Overfitting can cause the model fit to appear to be excellent, but there 
is still a huge problem.

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Developing R classes

2004-12-28 Thread Frank E Harrell Jr

Gabor Grothendieck wrote:
Chambers book, Programming with Data, has a lot about S4.
Note that S4 has more features but S3 is simpler and 
has higher performance so its not all one way.  Also, they
are related so if you learn the simpler S3 first it will
make it easier to learn S4 later.  

I suggest you download and read the source code for
package `zoo' for S3 and package `its' for S4.  Both packages
define irregular time series classes.
I agree with Gabor.  S3 is much easier and faster to implement and has 
flexibility advantages (e.g., adding new elements to fit objects for 
debugging).

Frank
Date:   Tue, 28 Dec 2004 13:46:10 -0300 (ART) 
From:   Gilvan Justino [EMAIL PROTECTED]
To:   r-help@stat.math.ethz.ch 
Subject:   [R] Developing R classes 

 
Hi,

I´m trying to write some R classes but I din´t find documentation enought to develop them. I read there is 2 ways to write classes, using S3 ou S4 models. And it seems that S4 is the best model, so I thing I should use this one. 

I´m new user of R and I´m searched on the net some information about creating 
new classes. I found this document:
http://www.biostat.harvard.edu/courses/individual/bio271/lectures/L11/S4Objects.pdf
If someone knows some docs about creating our own classes, could please, post 
its url at here ?
More one question... It seems that some documents about S can be applyed to R. 
How do you know if you can use this documentation using R ? I´m asking this 
because some stufs I was looking for I found more explanation using S and not 
R. Am I right or I´m looking in the wrong place ?
Thanks a lot!
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

[R] Black and white graphics and transparent strip panels with lattice under Sweave

2005-01-02 Thread Frank E Harrell Jr

What is the most elegant way to specify that strip panels are to have 
transparent backgrounds and graphs are to be in black and white when 
lattice is being used with Sweave?  I would prefer a global option that 
stays in effect for multiple plots.

If this is best done with a theme, does anyone have a lattice theme like 
col.whitebg but that is for black and white?

I'm using the following with lattice 0.10-16, grid 2.0.0:
platform i586-mandrake-linux-gnu
arch i586
os   linux-gnu
system   i586, linux-gnu
status
major2
minor0.0
year 2004
month10
day  04
language R
Thanks,
Frank
--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Black and white graphics and transparent strip panels with lattice under Sweave

2005-01-02 Thread Frank E Harrell Jr

Deepayan Sarkar wrote:
On Sunday 02 January 2005 19:40, Frank E Harrell Jr wrote:
What is the most elegant way to specify that strip panels are to have
transparent backgrounds and graphs are to be in black and white when
lattice is being used with Sweave?  I would prefer a global option
that stays in effect for multiple plots.
If this is best done with a theme, does anyone have a lattice theme
like col.whitebg but that is for black and white?

I'd do something like this as part of the initialization:
...
library(lattice)
ltheme - canonical.theme(color = FALSE) ## in-built BW theme
ltheme$strip.background$col - transparent ## change strip bg
lattice.options(default.theme = ltheme)  ## set as default
@
Deepayan
That worked perfectly.  Thank you very much Deepayan.  -Frank
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] different DF in package nlme and lme4

2005-01-03 Thread Frank E Harrell Jr

Douglas Bates wrote:
Christoph Buser wrote:
Hi all
I tried to reproduce an example with lme and used the Orthodont
dataset.
library(nlme)
fm2a.1 - lme(distance ~ age + Sex, data = Orthodont, random = ~ 1 | 
Subject)
anova(fm2a.1)
...
Regards,
Christoph Buser
No.  The calculation of denominator degrees of freedom in lme4 is bogus 
and I believe this is documented.  Note that for all practical purposes 
there is very little difference between 25 and 100 denominator degrees 
of freedom.

lme4 is under development (and has been for a seemingly interminable 
period of time).  Getting the denominator degrees of freedom calculation 
right is way down the list of priorities.

Many people express dismay about the calculation of denominator degrees 
of freedom in all versions of lme4.  IIRC Frank Harrell characterizes 
this as one of the foremost deficiencies in R relative to SAS.  I don't 
agree that this is a glaring deficiency.  In fact I believe that there 
is no correct answer.  The F statistics in a mixed model do not have 
an F distribution under the null hypothesis.  It's all an approximation, 
which is why I don't stay up nights worrying about the exact details of 
the approximation.
Doug - the main concern is accurate P-values; I don't really care which 
approximations are best, just that the ones used are at least as good as 
those in SAS.  Without being an expert, I have come to believe that at 
the moment SAS is better than R in 2 areas: accurate P-values from mixed 
models and handling massive databases.  On the former point I could 
easily be swayed by some type I error simulations.

My plan for lme4 is that one slot in the summary object for an lme model 
will be an incidence table of terms in the fixed effects versus grouping 
factors for the random effects.  This table will indicate whether a 
given term varies within groups defined by the grouping factor.  Anyone 
who wants to implement their personal favorite calculation of 
denominator degrees of freedom based on this table will be welcome to do 
so.
I will be interested also to see timings of lme4 (using S4) vs nlme 
(using S3) for the same model.

Cheers,
Frank
I personally think that tests on the fixed-effects terms will be better 
implemented using the restricted likelihood-ratio tests defined by 
Reinsel and Ahn rather than the Wald tests and the whole issue of 
denominator degrees of freedom may be moot.

My apologies if I seem to be peeved.  I am not upset by your question - 
it is an entirely reasonable question.  It is just that I have discussed 
the issue of denominator degrees of freedom too many times.

To me a more important objective of lme4 is to be able to handle random 
effects associated with crossed or partially crossed grouping factors. I 
believe that in those cases the calculation of denominator degrees of 
freedom will be very complicated and even more of an approximation than 
in the case of nested grouping factors.  This is why I would rather 
finesse the whole issue by using the Reinsel and Ahn approach.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

[R] JOBS: Vanderbilt University Department of Biostatistics

2005-01-05 Thread Frank E Harrell Jr

The well-supported and rapidly-growing Department of Biostatistics in 
the Vanderbilt University School of Medicine, Nashville, Tennessee, has 
multiple openings for persons with an MS or PhD in biostatistics, at all 
levels. We are especially recruiting for the following positions.

   - A senior faculty member to lead a multi-center clinical trial data 

 coordinating center.
 The individual must have extensive experience in NIH-sponsored data
 coordinating centers.
   - A senior faculty member with experience in health services or
 outcomes research or clinical epidemiology to lead a biostatistics
 unit for surgical sciences.
   - Senior MS biostatisticians with 5 or more years of experience.  At
 Vanderbilt, such persons have faculty positions.
   - Associate professors
   - Junior MS biostatisticians
We are also seeking assistant and full professors. The department 
currently has 16 PhD faculty biostatisticians, 6 senior and 5 junior MS 
biostatisticians, and 10 computer systems analysts and plans to grow 
significantly in size over the next three years, continuing to add a 
significant number of faculty and staff biostatisticians after that.

In general we seek individuals with expertise in clinical trials, health 
services and outcomes research, Bayesian methods, multicenter clinical 
trial data management, imaging, and modern statistical computing 
(especially R or S-Plus). See 
http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/JobOpenings for 
more information.

All candidates must enjoy collaborative research and possess excellent 
written and verbal communication skills. Send CV to Search Committee, 
Biostatistics, S-2323 MCN, Vanderbilt University, Nashville, TN 
37232-2158. Vanderbilt University is an equal opportunity/affirmative 
action employer; all qualified persons are encouraged to apply. CVs may 
be sent electronically to both [EMAIL PROTECTED] and 
[EMAIL PROTECTED] Please state in the subject of the E-mail the 
position for which you are applying or inquiring.

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] labels attached to variable names

2005-01-06 Thread Frank E Harrell Jr

Denis Chabot wrote:
Hi,
Can we attach a more descriptive label (I may use the wrong  
terminology, which would explain why I found nothing on the FAQ) to  
variable names, and later have an easy way to switch to these labels in  
plots? I fear this is not possible and one must enter this by hand as  
ylab and xlab when making plots.

Thanks in advance,
Denis Chabot
The Hmisc package supports this:
label(x) - 'Some Label'
describe(x)# uses variable name and label
plot(x, y, xlab=label(x))
Better: xYplot(x, y)   # label used automatically
And if you do units(x) - 'whatever units of measurement'  then xYplot, 
describe, and other Hmisc functions will include the units (in a 
different font on graphs or when using latex()).

Frank
--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] R-etiquette

2005-01-09 Thread Frank E Harrell Jr

Anne wrote:
I'm about to present a report (for internal use of governmental agency). I used 
extensively R , contibuted packages, as well as communications on the R-list
As well as citing R, I would like to know how to cite the contributed packages 
(it is not so easy, as some have been used exensively, other marginally, some 
are called from another package and some were not used as softwares but gave me 
insight of how to proceed), as well as thank the persons who generously 
responded to my demands on the list...
Is there an established way of doing that?
Thanks
I take the occasion of thanking all of you! The R project is really a wonderful 
idea and realisation
Anne
Good question Anne.  I don't know of an established way but here are 
examples based on the way I prefer people to site my packages.  For 
Hmisc, one of the two following ways works (BibTeX format):

@MISC{Hmisc,
  author = {Harrell, Frank E.},
  year = 2005,
  title = {{\tt Hmisc}: {A} library of miscellaneous \textsc{S} functions},
  howpublished = {Available from
 {\tt biostat.\-mc.\-vanderbilt.\-edu/\-s/Hmisc}}
}
or
@MISC{alzsplus,
  author = {Alzola, Carlos F. and Harrell, Frank E.},
  year = 2004,
  title = {An {Introduction} to {S} and the {Hmisc} and {Design}
  {Libraries}. {Available} from
  {\tt 
http://biostat.mc.vanderbilt.edu/\-twiki/pub/Main/\-RS/sintro.pdf}.},
  note = {Electronic book, 308 pages}
}

For Design one of the following 2:
@book{rms,
  author = {Harrell, Frank E.},
  year = 2001,
  title = {Regression Modeling Strategies, with Applications to Linear
  Models, Survival Analysis and Logistic Regression},
  publisher = {Springer},
  address = {New York}
}
or
@MISC{Design,
  author = {Harrell, Frank E.},
  year = 2005,
  title = {{\tt Design}: {S} functions for biostatistical/epidemiologic
  modeling, testing, estimation, validation, graphics, 
prediction, and
  typesetting by storing enhanced model design attributes in 
the fit.
  Available from {\tt
  biostat.\-mc.\-vanderbilt.\-edu/s/\-{Design}}}
}

I am never clear on the proper year to use for referencing the packages 
by themselves, as the packages are constantly being improved.

Frank

Anne Piotet
Tel: +41 79 359 83 32 (mobile)
Email: [EMAIL PROTECTED]
---
M-TD Modelling and Technology Development
PSE-C
CH-1015 Lausanne
Switzerland
Tel: +41 21 693 83 98
Fax: +41 21 646 41 33
--
 
	[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] R-etiquette

2005-01-09 Thread Frank E Harrell Jr

Prof Brian Ripley wrote:
On Sun, 9 Jan 2005, Anne wrote:
I'm about to present a report (for internal use of governmental 
agency). I used extensively R , contibuted packages, as well as 
communications on the R-list

As well as citing R, I would like to know how to cite the contributed 
packages (it is not so easy, as some have been used exensively, other 
marginally, some are called from another package and some were not 
used as softwares but gave me insight of how to proceed), as well as 
thank the persons who generously responded to my demands on the list...

Is there an established way of doing that?

See the citation() function, which works for packages too and a few 
authors have taken advantage of the means of customizing it.

I keep learning - thanks Brian.  I will add CITATION files in the parent 
directory of my packages.  The defaults generated from 
citation('packagename') for them are not bad.  -Frank

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] how to do by-processing using weighted.mean

2005-01-09 Thread Frank E Harrell Jr

Denis Chabot wrote:
Hi,
I'd like to compute the weighted mean of some variables, all using the 
same weight variable, for each combination of 3 factor variables.

I found how I could use summarize (from Hmisc) to do normal means for 
combinations of 3 factors, but I cannot find a way of doing weighted 
means. Is it possible in R?

Thanks in advance,
Denis Chabot
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! 
http://www.R-project.org/posting-guide.html

It's in one of the summarize examples!
--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] R-etiquette

2005-01-10 Thread Frank E Harrell Jr

Liaw, Andy wrote:
I'd thank them in the acknowledgement section.  I think some (most?) journal
will allow such a section, and most people use that to thank their research
funding sources and/or collaborators who had not made the list of authors.
Andy
You do need to get permission from people to include their names in an 
Ack. section, usually.  Referencing as a personal communication is 
perhaps better.

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Standard error for the area under a smoothed ROC curve?

2005-01-12 Thread Frank E Harrell Jr

Dan Bolser wrote:
Hello, 

I am making some use of ROC curve analysis. 

I find much help on the mailing list, and I have used the Area Under the
Curve (AUC) functions from the ROC function in the bioconductor project...
http://www.bioconductor.org/repository/release1.5/package/Source/
ROC_1.0.13.tar.gz 

However, I read here...
http://www.medcalc.be/manual/mpage06-13b.php
The 95% confidence interval for the area can be used to test the
hypothesis that the theoretical area is 0.5. If the confidence interval
does not include the 0.5 value, then there is evidence that the laboratory
test does have an ability to distinguish between the two groups (Hanley 
McNeil, 1982; Zweig  Campbell, 1993).
But aside from early on the above article is short on details. Can anyone
tell me how to calculate the CI of the AUC calculation?
I read this...
http://www.bioconductor.org/repository/devel/vignette/ROCnotes.pdf
Which talks about resampling (by showing R code), but I can't understand
what is going on, or what is calculated (the example given is specific to
microarray analysis I think).
I think a general AUC CI function would be a good addition to the ROC
package.

One more thing, in calculating the AUC I see the splines function is
recomended over the approx function. Here...
http://tolstoy.newcastle.edu.au/R/help/04/10/6138.html
How would I rewrite the following AUC functions (adapted from bioconductor
source) to use splines (or approxfun or splinefun) ...

spe # Specificity
 [1] 0.02173913 0.13043478 0.21739130 0.32608696 0.43478261 0.54347826
 [7] 0.65217391 0.76086957 0.89130435 1. 1. 1.
[13] 1.

sen # Sensitivity
 [1] 1.000 1.000 1.000 1.000 1.000 0.9302326 0.8139535
 [8] 0.6976744 0.5581395 0.4418605 0.3488372 0.2325581 0.1162791
trapezint(1-spe,sen)
my.integrate(1-spe,sen)
## Functions
## Nicked (and modified) from the ROC function in bioconductor.
trapezint -
function (x, y, a = 0, b = 1)
{
if (x[1]  x[length(x)]) {
  x - rev(x)
  y - rev(y)
}
y - y[x = a  x = b]
x - x[x = a  x = b]
if (length(unique(x))  2)
return(NA)
ya - approx(x, y, a, ties = max, rule = 2)$y
yb - approx(x, y, b, ties = max, rule = 2)$y
x - c(a, x, b)
y - c(ya, y, yb)
h - diff(x)
lx - length(x)
0.5 * sum(h * (y[-1] + y[-lx]))
}
my.integrate -
function (x, y, t0 = 1)
{
f - function(j) approx(x,y,j,rule=2,ties=max)$y
integrate(f, 0, t0)$value
}


Thanks for any pointers,
Dan.
I don't see why the above formulas are being used.  The 
Bamber-Hanley-McNeil-Wilcoxon-Mann-Whitney nonparametric method works 
great.  Just get the U statistic (concordance probability) used in 
Wilcoxon.  As Somers' Dxy rank correlation coefficient is 2*(1-C) where 
C is the concordance or ROC area, the Hmisc package function rcorr.cens 
uses U statistic methods to get the standard error of Dxy.  You can 
easily translate this to a standard error of C.

Frank
--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] transcan() from Hmisc package for imputing data

2005-01-12 Thread Frank E Harrell Jr

avneet singh wrote:
Hello:
I have been trying to impute missing values of a data
frame which has both numerical and categorical values
using the function transcan() with little luck.
Would you be able to give me a simple example where a
data frame is fed to transcan and it spits out a new
data frame with the NA values filled up?
Or is there any other function that i could use?
Thank you
avneet
It's in the help file for transcan.  But multiple imputation is much 
better, and transcan does not do multiple imputation as well as the 
newer Hmisc function aregImpute.

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Kolmogorov-Smirnof test for lognormal distribution with estimated parameters

2005-01-12 Thread Frank E Harrell Jr

Christoph Buser wrote:
Hi Kwabena
I did once a simulation, generating normal distributed values
(500 values) and calculating a KS test with estimated
parameters. For 1 times repeating this test I got about
1 significant tests (on a level alpha=0.05 I'm expecting about 500 
significant tests by chance)
So I think if you estiamte the parameters from the data, you fit
to good and the used distribution of the test statistic is not
adequate as it is indicated in the help page you cited. There
(in the help page) is some literature, but it is no easy stuff
to read.
Furthermore I know no implementation of an KS test which
accounts for this estimation of the parameter.

I recommend a graphical tool instead of a test:
x - rlnorm(100)
qqnorm(log(x))
See also ?qqnorm and ?qqplot.
If you insist on testing a theoretical distribution be aware
that a non significant test does not mean that your data has the
tested distribution (especially if you have few data, there is
no power in the test to detect deviations from the theoretical
distribution and the conclusion that the data fits well is
trappy)
If there are enough data I'd prefer a chi square test to the KS
test (but even there I use graphical tools instead). 

See ?chisq
For this test you have to specify classes and this is 
subjective (you can't avoid this).

You can reduce the DF of the expected chi square distribution
(under H_0) by the number of estimated parameters from the data
and will get better results. 

DF = number of classes - 1 - estimated parameters
I think this test is more powerful than the KS test,
particularly if you must estimate the parameters from data.
Regards,
Christoph
It is also a good idea to ask why one compares against a known 
distribution form.  If you use the empirical CDF to select a parametric 
distribution, the final estimate of the distribution will inherit the 
variance of the ECDF.  The main reason statisticians think that 
parametric curve fits are far more efficient than nonparametric ones is 
that they don't account for model uncertainty in their final confidence 
intervals.

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Standard error for the area under a smoothed ROC curve?

2005-01-12 Thread Frank E Harrell Jr

Dan Bolser wrote:
On Wed, 12 Jan 2005, Frank E Harrell Jr wrote:

Dan Bolser wrote:
Hello, 

I am making some use of ROC curve analysis. 

I find much help on the mailing list, and I have used the Area Under the
Curve (AUC) functions from the ROC function in the bioconductor project...
http://www.bioconductor.org/repository/release1.5/package/Source/
ROC_1.0.13.tar.gz 

However, I read here...
http://www.medcalc.be/manual/mpage06-13b.php
The 95% confidence interval for the area can be used to test the
hypothesis that the theoretical area is 0.5. If the confidence interval
does not include the 0.5 value, then there is evidence that the laboratory
test does have an ability to distinguish between the two groups (Hanley 
McNeil, 1982; Zweig  Campbell, 1993).
But aside from early on the above article is short on details. Can anyone
tell me how to calculate the CI of the AUC calculation?
I read this...
http://www.bioconductor.org/repository/devel/vignette/ROCnotes.pdf
Which talks about resampling (by showing R code), but I can't understand
what is going on, or what is calculated (the example given is specific to
microarray analysis I think).
I think a general AUC CI function would be a good addition to the ROC
package.

One more thing, in calculating the AUC I see the splines function is
recomended over the approx function. Here...
http://tolstoy.newcastle.edu.au/R/help/04/10/6138.html
How would I rewrite the following AUC functions (adapted from bioconductor
source) to use splines (or approxfun or splinefun) ...

spe # Specificity
[1] 0.02173913 0.13043478 0.21739130 0.32608696 0.43478261 0.54347826
[7] 0.65217391 0.76086957 0.89130435 1. 1. 1.
[13] 1.

sen # Sensitivity
[1] 1.000 1.000 1.000 1.000 1.000 0.9302326 0.8139535
[8] 0.6976744 0.5581395 0.4418605 0.3488372 0.2325581 0.1162791
trapezint(1-spe,sen)
my.integrate(1-spe,sen)
## Functions
## Nicked (and modified) from the ROC function in bioconductor.
trapezint -
function (x, y, a = 0, b = 1)
{
   if (x[1]  x[length(x)]) {
 x - rev(x)
 y - rev(y)
   }
   y - y[x = a  x = b]
   x - x[x = a  x = b]
   if (length(unique(x))  2)
   return(NA)
   ya - approx(x, y, a, ties = max, rule = 2)$y
   yb - approx(x, y, b, ties = max, rule = 2)$y
   x - c(a, x, b)
   y - c(ya, y, yb)
   h - diff(x)
   lx - length(x)
   0.5 * sum(h * (y[-1] + y[-lx]))
}
my.integrate -
function (x, y, t0 = 1)
{
   f - function(j) approx(x,y,j,rule=2,ties=max)$y
   integrate(f, 0, t0)$value
}


Thanks for any pointers,
Dan.
I don't see why the above formulas are being used.  The 
Bamber-Hanley-McNeil-Wilcoxon-Mann-Whitney nonparametric method works 
great.  Just get the U statistic (concordance probability) used in 
Wilcoxon.  As Somers' Dxy rank correlation coefficient is 2*(1-C) where 
C is the concordance or ROC area, the Hmisc package function rcorr.cens 
uses U statistic methods to get the standard error of Dxy.  You can 
easily translate this to a standard error of C.

I am sure I could do this easily, except I can't. 

The good thing about ROC is that I understand it (I can see it). I know
why the area means what it means, and I could even imagine how sampling
the data could give a CI on the area. 

However, I don't know why the area under the ROC curve is well known to
be equivalent to the numerator of the Mann-Whitney U statistic - from
http://www.bioconductor.org/repository/devel/vignette/ROCnotes.pdf
Nor do I know how to calculate the numerator of the Mann-Whitney U
statistic.
This is clear in the original Bamber or Hanley-McNeil articles.  The ROC 
area is a linear translation of the mean rank of predicted values in one 
of the two outcome groups.  The little somers2 function in Hmisc shows this:

##S function somers2
##
##Calculates concordance probability and Somers'  Dxy  rank  correlation
##between  a  variable  X  (for  which  ties are counted) and a binary
##variable Y (having values 0 and 1, for which ties are not  counted).
##Uses short cut method based on average ranks in two groups.
##
##Usage:
##
## somers2(X,Y)
##
##Returns vector whose elements are C Index, Dxy, n and missing, where
##C Index is the concordance probability and Dxy=2(C Index-.5).
##
##F. Harrell 28 Nov 90 6 Apr 98: added weights
somers2 - function(x, y, weights=NULL, normwt=FALSE, na.rm=TRUE) {
  if(length(y)!=length(x))stop(y must have same length as x)
  y - as.integer(y)
  wtpres - length(weights)
  if(wtpres  (wtpres != length(x)))
stop('weights must have same length as x')
  if(na.rm) {
miss - if(wtpres) is.na(x + y + weights) else is.na(x + y)
nmiss - sum(miss)
if(nmiss0) {
  miss - !miss
  x - x[miss]
  y - y[miss]
  if(wtpres) weights - weights[miss]
}
  } else nmiss - 0
   u - sort(unique(y))
  if(any(y %nin% 0:1)) stop('y must be binary')  ## 7dec02
  if(wtpres) {
if(normwt) weights - length(x

Re: [R] empirical (sandwich) SE estimate in lme ()?

2005-01-14 Thread Frank E Harrell Jr

A.J. Rossini wrote:
Right, but you still can sandwich them if you want.  

(I recently did that in Proc MIXED, but Michael, I'm not sure how to
do it using lme).
best,
-tony
The sandwich estimator can help if the model is misspecified but at a 
cost of worse precision in estimating variances.

Frank

On Fri, 14 Jan 2005 11:16:10 -0800, Berton Gunter
[EMAIL PROTECTED] wrote:
???
correlated within group errors are explicitly modeled by corStruct classes.
See ?lme and Chapter 5.3 in Bates and Pinheiro.
-- Bert Gunter
Genentech Non-Clinical Statistics
South San Francisco, CA
The business of the statistician is to catalyze the scientific learning
process.  - George E. P. Box

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Na Li
Sent: Friday, January 14, 2005 8:42 AM
To: r-help@stat.math.ethz.ch
Subject: [R] empirical (sandwich) SE estimate in lme ()?
Is it possible to get the empirical (sandwich) S.E. estimates for the
fixed effects in lme () (thus allowing possibly correlated
errors within
the group)? In SAS you can get it by the 'empirical' option
to PROC MIXED.
Cheers,
Michael
--
Na (Michael) Li, Ph.D.
Division of Biostatistics  A443 Mayo Building, MMC 303
School of Public Health420 Delaware St SE
University of MinnesotaMinneapolis, MN 55455
Phone: (612) 626-4765  Email: [EMAIL PROTECTED]
Fax:   (612) 626-0660  http://www.biostat.umn.edu/~nali
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html



--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Omitting constant in ols() from Design

2005-01-17 Thread Frank E Harrell Jr

Tobias Muhlhofer wrote:
Hi!
I need to run ols regressions with Huber-White sandwich estimators and 
the correponding standard errors, without an intercept. What I'm trying 
to do is create an ols object and then use the robcov() function, on the 
order of:

f - ols(depvar ~ ind1 + ind2, x=TRUE)
robcov(f)
However, when I go
f - ols(depvar ~ ind1 + ind2 -1, x=TRUE)
I get the following error:
Error in ols(nareit ~ SnP500 + d3yrtr - 1) :
length of dimnames [2] not equal to array extent
same with +0 instead of -1.
Is there a different way to create an ols object without a constant? I 
can't use lm(), because robcov() needs an object from the Design() series.

Or is there a different way to go about this?
Tobias Muhlhofer
ols does not support this.  Sorry.
--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] bwplot: how not to draw outliers

2005-01-17 Thread Frank E Harrell Jr

Sundar Dorai-Raj wrote:

RenE J.V. Bertin wrote:
Hello, and (somewhat belated) best wishes for 2005.
Can one order not to draw outliers in bwplot, or at least exclude them 
from the vertical axis scaling? If so, how (or what doc do I need to 
consult)?
The options that have this effect in boxplot() do not appear to have 
any effect with bwplot (although outline=FALSE in boxplot does *not* 
change the scaling).

Thanks,
RenE Bertin

RenE,
There may be other solutions but you can do this using the prepanel 
option to set the ylim:

library(lattice)
set.seed(1)
z - data.frame(x = rt(100, 1), g = rep(letters[1:4], each = 25))
bwplot(x ~ g, z,
   prepanel = function(x, y) {
 bp - boxplot(split(y, x), plot = FALSE)
 ylim - range(bp$stats)
 list(ylim = ylim)
   })
If you actually want to exclude the points (rather than just prevent 
outliers from affecting the scale), you will have to modify the 
panel.bwplot function in addition to using the above.

--sundar
You may also want to try
library(Hmisc)
library(lattice)
bwplot(..., panel=panel.bpplot)
?panel.bpplot
--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] (no subject)

2005-01-20 Thread Frank E Harrell Jr

Virginie Rondeau wrote:
Hello
I would like to compare the results obtained with a classical non 
parametric proportionnal hazard model with a parametric proportionnal 
hazard model using a Weibull.

How can we obtain the equivalence of the parameters using coxph(non 
parametric model) and survreg(parametric model) ?

Thanks
Virginie
In the Design package look at the pphsm function that converts a survreg 
Weibull fit (fitted by the psm function which is an adaptation of 
survreg) to PH form.

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Cross-validation accuracy in SVM

2005-01-20 Thread Frank E Harrell Jr

Ton van Daelen wrote:
Hi all -
I am trying to tune an SVM model by optimizing the cross-validation
accuracy. Maximizing this value doesn't necessarily seem to minimize the
number of misclassifications. Can anyone tell me how the
cross-validation accuracy is defined? In the output below, for example,
cross-validation accuracy is 92.2%, while the number of correctly
classified samples is (1476+170)/(1476+170+4) = 99.7% !?
Thanks for any help.
Regards - Ton
Percent correctly classified is an improper scoring rule.  The percent 
is maximized when the predicted values are bogus.  In addition, one can 
add a very important predictor and have the % actually decrease.

Frank Harrell
---
Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
   cost:  8 
  gamma:  0.007 

Number of Support Vectors:  1015
 ( 148 867 )
Number of Classes:  2 

Levels: 
 false true

5-fold cross-validation on training data:
Total Accuracy: 92.24242 
Single Accuracies:
 90 93.3 94.84848 92.72727 90.30303 

Contingency Table
   predclasses
origclasses false true
  false 1476 0
  true 4   170
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] logistic regression 3D-plot

2005-02-03 Thread Frank E Harrell Jr

[EMAIL PROTECTED] wrote:
Dear R-helpers,
I tried to create a 3D surface showing the interaction between two
continuous explanatory variables; the response variable is binary (0/1).
The model is:
model-glm(incidence~sun*trees,binomial)
then I used wireframe to create a 3D plot:
xyz-expand.grid(sun=seq(30,180,1),trees=seq(0,4000,10))
xyz$incidence-as.vector(predict(model,xyz))
wireframe(incidence~sun*trees,xyz,scales=list(arrows=FALSE))
which gives me a 3D plot, but the scaling of the y-axis is wrong. the range
is not from 0 to 1.
so my question: is there a way to plot these kind of models, with binary
response variables?
thanks for your help, Heike
library(Design)
d - datadist(mydata); options(datadist='d')
f - lrm(incidence ~ sun*trees)  # lrm is for binary or ordinal response
plot(f, sun=NA, trees=NA)
# add method='image' or 'contour' to get other types of graphs
plot(f, sun=NA, trees=NA, fun='plogis')  # probability scale
--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] logistic regression 3D-plot CORRECTION

2005-02-03 Thread Frank E Harrell Jr

library(Design)
d - datadist(mydata); options(datadist='d')
f - lrm(incidence ~ sun*trees)  # lrm is for binary or ordinal response
plot(f, sun=NA, trees=NA)
# add method='image' or 'contour' to get other types of graphs
plot(f, sun=NA, trees=NA, fun='plogis')  # probability scale
Correction: fun=plogis not 'plogis'.  Sorry about that  -FH
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Reading Dates in a csv File

2005-02-03 Thread Frank E Harrell Jr

Charles and Kimberly Maner wrote:
Hi all.  I'm reading in a flat, comma-delimited flat file using read.csv.
It works marvelously for the most part.  I am using the colClasses argument
to, basically, create numeric, factor and character classes for the columns
I'm reading in.  However, a couple of the fields in the file are date
fields.  I'm fully aware that POSIXct can be used as a class, however the
field must obey, (I think), the standard/default POSIXct format.  Hence the
following question:  Does anyone have a method they can share to read in a
non-standard formatted date to convert to POSIXct?  I can read it in then
convert it, but that's a two pass approach and not as elegant as a single
pass through read.csv.  I've read, from the documentation, that [o]therwise
there needs to be an as method (from package methods) for conversion from
character to the specified formal class but I do not know and have not
figured out how to do that.
Any suggestion(s) would be greatly appreciated.
Thanks,
Charles
The csv.get function in the Hmisc package may do most of what you want.
--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] no. at risk in survfit()

2005-02-04 Thread Frank E Harrell Jr

array chip wrote:
Hi,
when I generated a survfit() object, I can get number
of patients at risk at various time points by using
summary():
fit-survfit(Surv(time,status)~class,data=mtdata)
summary(fit)
class=1
 time n.risk n.event survival std.err lower 95% CI
upper 95% CI 
  9.9 78   10.987  0.0127  0.963 1
 41.5 77   10.974  0.0179  0.940 1
 54.0 76   10.962  0.0218  0.920 1
 99.1 38   10.936  0.0328  0.874 1

   class=2
  time n.risk n.event survival std.err lower 95% CI
upper 95% CI 
   6.9102   10.990 0.00976 0.971   1.000
   8.0101   10.980 0.01373 0.954   1.000
  14.4100   10.971 0.01673 0.938   1.000
  16.1 99   10.961 0.01922 0.924   0.999
  16.6 98   10.951 0.02138 0.910   0.994
  18.7 97   10.941 0.02330 0.897   0.988
   :
   :
   :

I have many censoring observations in the dataset, and
I would like to know the number of patients at risk
(n.risk in the above output) for certain time points,
for example at 60, 72, etc, which is not available
from the above printout for class=1. Is there anyway I
can get them?
Thanks
The Design package's survplot function can print n.risk over equally 
spaced time points.  You might see an easy way to print this by looking 
at the code.  -Frank

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Programming/scripting with expressions - variables

2005-02-07 Thread Frank E Harrell Jr

Gorjanc Gregor wrote:
Hello to Rusers!
I am puzzled with R and I really do not know where to look
in for my problem. I am moving from SAS and I have difficulties
in translating SAS to R world. I hope I will get some hints 
or pointers so I can study from there on.

I would like to do something like this. In SAS I can write 
a macro as example bellow, which is afcourse a silly one but 
shows what I don't know how to do in R.

%macro test(data, colname, colvalue);
data data;
...
colname=colvalue;
other_colname=other_colvalue;
run;
%mend;
And if I run it with this call:
%test(Gregor, Gorjanc, 25);
I get a table with name 'Gregor' and columns 'Gorjanc', 
and 'other_Gorjanc' with values:

Gorjanc other_Gorjanc
25other_25
So can one show me the way to do the same thing in R? 

Thanks!
--
Lep pozdrav / With regards,
Gregor GORJANC
Greetings, Gregor, from a fan of Slovenia.
Here is a start.  Multiple variables are often put together in data 
frames.  Your example did not include an input dataset name.  I took 
data to be an input data frame, and added two new variables.  I assume 
that colvalue and colname are character values.  If colvalue has length 
one it will be duplicated for each row of data.

test - function(data, colname, colvalue) {
  data[[colname]] - colvalue
  data[[paste('other',colname,sep='_')]] - paste('other',colvalue,sep='_')
  data
}
Example usage:
data - data.frame(x=1:5, y=11:15)
data2 - test(data, 'a', 'b')
data - test(data, 'a', 'b')  # over-write
test(data, 'a', 'b')  # create data but don't store
with(test(data, ),{...})# reference variables in temporary 
dataset

If we know what your ultimate goal is, there may be a much better 
approach than the test function.  In many cases, you do not need to 
create new variables at all.

Franc
--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Re:logistic regression

2005-02-08 Thread Frank E Harrell Jr

Vito Ricci wrote:
Hi,
I don't know if a pseudo squared R for glm exists in
any R package, but I find some interesting functions
in S mailing list:
It is included in lrm in the Design package.  But note that this is not 
for checking fit but rather for quantifying predictive discrimination.

.
--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Re:logistic regression

2005-02-08 Thread Frank E Harrell Jr

Joe Nocera wrote:
Helene - 

In addition to some of the excellent suggestions already posited (e.g. 
examining AIC,
pseudo R^2 in the Design package), you might want to consider another tool to 
assess
logistic regression model accuracy: the area-under-curve (AUC) from a
receiver-operating characteristic (ROC).
The ROC curve describes the relationship between the number of true positives 
observed
(sensitivity) to false positives, and also for negatives.  The AUC is the 
probability
that a model can correctly distinguish between the two.  This is an appealling
alternative to some of the known issues of citing only a pseudo-R^2 (like 
Nagelkerke's
for instance) to describe 'fit'.
That's also standard output in Design's lrm function (C index).  But 
unlike R^2 comparing two models on the basis of ROC area is not very 
sensitive.  -Frank

Check out the ROC functions available at the Bioconductor website.  There 
was also some
code sent around on the list a few months back for calculating trapeziodal AUC, 
se's
from ROC, and comparing two ROC curves...search the archives if interested, or 
I can
probably dig them out for you offline...
Cheers,
Joe
Quoting Frank E Harrell Jr [EMAIL PROTECTED]:

Vito Ricci wrote:
Hi,
I don't know if a pseudo squared R for glm exists in
any R package, but I find some interesting functions
in S mailing list:
It is included in lrm in the Design package.  But note that this is not 
for checking fit but rather for quantifying predictive discrimination.

.
--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Joseph J. Nocera
Ph.D. Candidate
NB Coop. Fish  Wildlife Research Unit
Biology Department - Univ. New Brunswick
Fredericton, NB
Canada   E3B 6E1
tel: (902) 679-5733
Why does it have to be spiders?  Why can't it be 'follow the butterflies'?!
 Ron Weasley, Harry Potter  The Chamber of Secrets
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Forward Stepwise regression based on partial F test

2005-02-24 Thread Frank E Harrell Jr

Smit, Robin wrote:
I am hoping to get some advise on the following:
 
I am looking for an automatic variable selection procedure to reduce the
number of potential predictor variables (~ 50) in a multiple regression
model.
 
I would be interested to use the forward stepwise regression using the
partial F test. 
I have looked into possible R-functions but could not find this
particular approach. 
 
There is a function (stepAIC) that uses the Akaike criterion or Mallow's
Cp criterion. 
In addition, the drop1 and add1 functions came closest to what I want
but with them I cannot perform the required procedure. 
Do you have any ideas? 
 
Kind regards,
Robin Smit

Business Unit TNO Automotive
Environmental Studies  Testing
PO Box 6033, 2600 JA Delft
THE NETHERLANDS
Robin,
If you are looking for a method that does not offer the best predictive 
accuracy and that violates every aspect of statistical inference, you 
are on the right track.  See 
http://www.stata.com/support/faqs/stat/stepwise.html for details.

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Teaching R in 40 minutes. What should be included?

2005-02-26 Thread Frank E Harrell Jr

Spencer Graves wrote:
 I agree with Thomas and Georg:  A 40 minute intro should be mostly 
Marketing and very little how to.
 I think you'll have a more effective sales job if you target, say, 
4 examples averaging 5 slides each plus some general overview, max 25-30 
slides.  If I had sufficient prep time and a few collaborators among 
physicists, geographers, etc., I might get their help in preparing 
examples, showing how they would do something in Matlab or Scilab or 
something else vs. R.  And I'd end with a discussion of technical 
support via an R site search and r-help and showing a list of available 
contributed packages.  I'd do a couple of searches for physics and 
geographical questions.  ODESOLVE, maps, etc.  Maybe pick examples that 
are part of the help files.  Then show, here is how I find X, here is 
the vignette, help or whatever.
 Is it fair to say that R is rapidly becoming (if it is not already) 
the primary platform of choice for new statistical algorithm 
development?  I think they might be interested in a brief overview of 
the contributed software.  If this is an academic audience, they might 
like to know how easy it is to contribute software, plus journal on 
statistical computing and graphics, etc.
 hope this helps.  Good Luck!
 spencer graves   
I often give talks like that.  The thing that has impressed audiences 
the most is a multi-panel lattice graphic with 2 classification 
variables and in each panel a scatterplot and a lowess trend line.  A 
single page with 24 small high-resolution histograms also seems to 
impress people.  The nomogram function in the Design package seems to 
also connect with non-statisticians, as does 
latex(describe(mydataframe)) using Hmisc.  People like seeing in the 
latex previewer some output that mixes tabular summaries and graphics.

Frank Harrell
Thomas Schönhoff wrote:
Hello,
Am Freitag, 25. Februar 2005 22:37 schrieb Dr Carbon:
 

If _you_ were asked to give a 40 minute dog and pony show about R
for a group of scientists ranging from physicists to geographers
what would you put in? These people want to know what R can do.
I'm thinking about something like:
A. Overview
B. data structures
C. arithmetic and manipulation
D. reading data
E. linear models using glm
F. graphics
G. programming
H. other tricks like rpart or time series analysis?
  

If your audience is well known I would be inclined to target some 
(simple) examples derived from physics and geography to demonstrate 
basic ideas of working with R, similar like the ones listed above.

Well, 40 minutes are not too long, so I recommend to simplify your 
presentation as much as you can. You want teach them R in 40 minutes 
but rather tend to confuse them if you don't shorten your plan a bit.
I.E. teaching programming in R in a few minutes for scientists who are 
not at all acustomed to programming  is much overhead, I think.
Well, it's up to your estimation on what is expected to follow your 
presentation. If you are sure that most of them know enough 
programming to unterstand the basic concepts in R-programming, 
everything will be fine!
If not, I'd recommend  to concentrate on basic operations (data 
structures, arithmetic and manipulation, import/export data and some 
often used default statistical procedures demonstrating common tasks 
(is time series analysis important in physics or geography, I don't 
know??), including some remarks on diffenrences to widespread 
statistical packages like SPSS or SAS, maybe LispStat.
Finally there shouuld be some extended view of available ressources 
(manuals, FAQ, community) as a starter to learn, use and program R by 
themselves.
I think this would do for a 40 minutes presentation without taking the 
risk to deter people due to overcomplexity.

regards
Thomas
--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] subset selection for logistic regression

2005-03-02 Thread Frank E Harrell Jr

Wittner, Ben wrote:
R-packages leaps and subselect implement various methods of selecting best or
good subsets of predictor variables for linear regression models, but they do
not seem to be applicable to logistic regression models.
 
Does anyone know of software for finding good subsets of predictor variables for
linear regression models?
 
Thanks.
 
-Ben
Why are these procedures still being used?  The performance is known to 
be bad in almost every sense (see r-help archives).

Frank Harrell
 
p.s., The leaps package references Subset Selection in Regression by Alan
Miller. On page 2 of the
2nd edition of that text it states the following:
 
  All of the models which will be considered in this monograph will be linear;
that is they
   will be linear in the regression coefficients.Though most of the ideas and
problems carry
   over to the fitting of nonlinear models and generalized linear models
(particularly the fitting
   of logistic relationships), the complexity is greatly increased.

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] subset selection for logistic regression

2005-03-02 Thread Frank E Harrell Jr

dr mike wrote:

-Original Message-
From: [EMAIL PROTECTED] 
[mailto:[EMAIL PROTECTED] On Behalf Of Wittner, Ben
Sent: 02 March 2005 11:33
To: [EMAIL PROTECTED]
Subject: [R] subset selection for logistic regression

R-packages leaps and subselect implement various methods of 
selecting best or good subsets of predictor variables for 
linear regression models, but they do not seem to be 
applicable to logistic regression models.

Does anyone know of software for finding good subsets of 
predictor variables for linear regression models?

Thanks.
-Ben
p.s., The leaps package references Subset Selection in 
Regression by Alan Miller. On page 2 of the 2nd edition of 
that text it states the following:

 All of the models which will be considered in this 
monograph will be linear; that is they
  will be linear in the regression coefficients.Though most 
of the ideas and problems carry
  over to the fitting of nonlinear models and generalized 
linear models (particularly the fitting
  of logistic relationships), the complexity is greatly increased.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! 
http://www.R-project.org/posting-guide.html

The LASSO method and the Least Angle Regression method are two such that
have both been implemented (efficiently IMHO - only one least squares for
all levels of shrinkage IIRC) in the lars package for R of Hastie and Efron.
There is a paper by Madigan and Ridgeway that discusses the use of the Least
Angle Regresson approach in the context of logistic regression - available
for download from Madigan's space at Ruttgers: 
www.stat.rutgers.edu/~madigan/PAPERS/lars3.pdf 

HTH
Mike
Yes things like lasso can help a lot.
--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] subset selection for logistic regression

2005-03-02 Thread Frank E Harrell Jr

Christian Hennig wrote:
Perhaps I should not write it because I will discredit myself with this
but...
Suppose I have a setup with 100 variables and some 1000 cases and I want to
boil down the number of variables to a maximum of 10 for practical reasons
even if I lose 10% prediction quality by this (for example because it is
expensive to measure all variables on new cases).  

Is it really so wrong to use a stepwise method?
Yes.  Read about model uncertainty and bias in models developed using 
stepwise methods.  One exception: if there is a large number of 
variables with truly zero regression coefficients, and the rest are not 
very weak, stepwise can sort things out fairly well.  But you never know 
this in advance.

Let's say I divide the sample into three parts and do variable selction on
the first part, estimation on the second and test on the third part (this
solves almost all problems Frank is talking about on p. 56/57 in his
excellent book). Is there always a tractable alternative? 
That's a good way to find out how bad the method is, not to fix the 
problems inherent in it.

Of course it is wrong to interpret the selected variables as the true
influences and all others as unrelated, but if I don't do that?
If it should really be a taboo to do stepwise variable selection, why are p.
58/59 of Regression Modeling Strategies devoted to how to do it of you
must?
Stress on if.  And note that if you ask what is the optimum alpha for 
variables to be kept in the model when doing backwards stepdown, it's 
alpha=1.0.  A good compromise is alpha=0.5.  See

@Article{ste01pro,
  author = 		 {Steyerberg, Ewout W. and Eijkemans, Marinus
  J. C. and Harrell, Frank E. and Habbema, J. Dik F.},
  title = 		 {Prognostic modeling with logistic regression
  analysis: {In} search of a sensible strategy in small data sets},
  journal = 	 Medical Decision Making,
  year = 		 2001,
  volume =		 21,
  pages =		 {45-56},
  annote =		 {shrinkage; variable selection; dichotomization of
  continuous varibles; sign of regression coefficient; calibration; 
validation}
}

And on Bert's excellent question about why shrinkage is not used more 
often, here is our attempt at a remedy:

@Article{moo04pen,
  author = 		 {Moons, K. G. M. and Donders, A. Rogier T. and
Steyerberg, E. W. and Harrell, F. E.},
  title = 		 {Penalized maximum likelihood estimation to directly
adjust diagnostic and prognostic prediction models for overoptimism: a
clinical example},
  journal = 	 J Clinical Epidemiology,
  year = 		 2004,
  volume =		 57,
  pages =		 {1262-1270},
  annote =		 {prediction 
research;overoptimism;overfitting;penalization;bootstrapping;shrinkage}
}

Frank

Please forget my name;-)
Christian
On Wed, 2 Mar 2005, Berton Gunter wrote:

To clarify Frank's remark ...
A prominent theme in statistical research over at least the last 25 years
(with roots that go back 50 or more, probably) has been the superiority of
shrinkage methods over variable selection. I also find it distressing that
these ideas have apparently not penetrated much (at all?) into the wider
scientific community (but I suppose I shouldn't be surprised -- most
scientists still do one factor at a time experiments 80 years after Fisher).
Specific incarnations can be found in anything Bayesian, mixed effects
models for repeated measures, ridge regression, and the R packages lars and
lasso, among others.
I would speculate that aside from the usual statistics/science cultural
issues, part of the reason for this is that the estimators don't generally
come with neat, classical inference procedures: like it or not, many
scientists have been conditioned by their Stat 101 courses to expect P
values, so in some sense, we are hoisted by our own petard.
Just my $.02 -- contrary(and more knowledgeable) opinions welcome.
-- Bert Gunter

-Original Message-
From: [EMAIL PROTECTED] 
[mailto:[EMAIL PROTECTED] On Behalf Of Frank 
E Harrell Jr
Sent: Wednesday, March 02, 2005 5:13 AM
To: Wittner, Ben
Cc: [EMAIL PROTECTED]
Subject: Re: [R] subset selection for logistic regression

Wittner, Ben wrote:
R-packages leaps and subselect implement various methods of 
selecting best or
good subsets of predictor variables for linear regression 
models, but they do
not seem to be applicable to logistic regression models.
Does anyone know of software for finding good subsets of 
predictor variables for
linear regression models?
Thanks.
-Ben
Why are these procedures still being used?  The performance 
is known to 
be bad in almost every sense (see r-help archives).

Frank Harrell

p.s., The leaps package references Subset Selection in 
Regression by Alan
Miller. On page 2 of the
2nd edition of that text it states the following:
 All of the models which will be considered in this 
monograph will be linear;
that is they
  will be linear in the regression coefficients.Though 
most of the ideas and
problems carry
  over to the fitting of nonlinear models and generalized 
linear models
(particularly

[R] Quantian

2005-03-08 Thread Frank E Harrell Jr

Congratulations Dirk on the article in linuxtoday.com today about Quantian.
--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Logistic regression goodness of fit tests

2005-03-10 Thread Frank E Harrell Jr

Trevor Wiens wrote:
I was unsure of what suitable goodness-of-fit tests existed in R for logistic 
regression. After searching the R-help archive I found that using the Design 
models and resid, could be used to calculate this as follows:
d - datadist(mydataframe)
options(datadist = 'd')
fit - lrm(response ~ predictor1 + predictor2..., data=mydataframe, x =T, y=T)
resid(fit, 'gof').
I set up a script to first use glm to create models use stepAIC to determine 
the optimal model. I used this instead of fastbw because I found the AIC values 
to be completely different and the final models didn't always match. Then my 
script takes the reduced model formula and recreates it using lrm as above. Now 
the problem is that for some models I run into an error to which I can find no 
reference whatsoever on the mailing list or on the web. It is as follows:
test.lrm - lrm(cclo ~ elev + aspect + cti_var + planar + feat_div + loamy + sands + sandy + wet + slr_mean, data=datamatrix, x = T, y = T)
singular information matrix in lrm.fit (rank= 10 ).  Offending variable(s):
slr_mean 
Error in j:(j + params[i] - 1) : NA/NaN argument

Now if I add the singularity criterion and make the value smaller than the 
default of 1E-7 to 1E-9 or 1E-12 which is the default in calibrate, it works. 
Why is that?
Not being a statistician but a biogeographer using regression as a tool, I don't really understand what is happening here. 

Does changing the tol variable, change how I should interpret goodness-of-fit 
results or other evaluations of the models created?
I've included a summary of the data below (in case it might be helpful) with 
all variables in the data frame as it was easier than selecting out the ones 
used in the model.
Thanks in advance.
T
The goodness of fit test only works on prespecified models.  It is not 
valid when stepwise variable selection is used (unless perhaps you use 
alpha=0.5).

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Logistic regression goodness of fit tests

2005-03-11 Thread Frank E Harrell Jr

Trevor Wiens wrote:
On Thu, 10 Mar 2005 16:19:41 -0600
Frank E Harrell Jr [EMAIL PROTECTED] wrote:

The goodness of fit test only works on prespecified models.  It is not 
valid when stepwise variable selection is used (unless perhaps you use 
alpha=0.5).


Perhaps I'm blind, but I can't find any reference to an alpha parameter on the 
help page for stepAIC. It runs fine however when I set the parameter and 
produces as model with the same right hand variables as without. Can you tell 
me what it is ?
T
What I mean is the effective significance level for keeping a variable 
in the model.  Using AIC for one degree of freedom variables is 
effectively using an alpha of 0.16 if I recall properly.

But I hope you got the point that resid(fit,'gof') as with most goodness 
of fit tests assumes prespecified models.

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Confidence interval for Tau-a or c-index to compare logistic lrm (binary) models with each other.

2005-03-22 Thread Frank E Harrell Jr

Jan Verbesselt wrote:
Dear R list,
How can confidence interval be derived for e.g. the Tau-a coefficient or the
c index (area under ROC curve) such that I can compare the fitted lrm
(logistic) models with each other. Is this possible?
The aim is to conclude that one model is significantly better than other
model (a, b or c). 

y~a (continu)+ d(catergoric)
y~b (continu)+ d(catergoric)
y~c (continu)+ d(catergoric)
I will look at residual deviance, the AIC, c-index en Tau-a to compare
different logistic models (lrm Design package).  All extra advice is mostly
welcome!
Regards,
Jan
You can only do this if you have an independent test dataset that was 
never used in model development or coefficient estimation.  Given that, 
look at the rcorrp.cens function in Hmisc.

Frank Harrell
___
ir. Jan Verbesselt 
Research Associate 
Lab of Geomatics Engineering K.U. Leuven
Vital Decosterstraat 102. B-3000 Leuven Belgium 
Tel: +32-16-329750   Fax: +32-16-329760
http://gloveg.kuleuven.ac.be/

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] R-squared in Logistic Regression

2005-03-29 Thread Frank E Harrell Jr

Johan Stenberg wrote:
Dear all,
How do I make R show the R-squared (deviance explained by the model) in
a logistic regression?
Below is how I write my syntax. Basically I want to investigate
density-dependence in parasitism of larvae. Note that in the end I
perform a F-test because the dispersion factor (residual deviance /
residual df) is significantly higher than 1. But how do I make R show
the R-squared?
Best wishes
Johan
The proportion of deviance explained has been shown to not be such a 
good measure.  You can use the lrm function in the Design package to get 
various measures including ROC area (C index), Somers' Dxy and Kendall 
tau rank correlation, Nagelkerke generalization of R-squared for maximum 
likelihood-based models (related to Maddala and others).

Frank Harrell

y-cbind(para,unpara)
model-glm(y~log(larvae),binomial)
summary(model)

Call:
glm(formula = y ~ log(larvae), family = binomial)
Deviance Residuals:
Min   1Q   Median   3Q  Max
-2.0633  -1.6218  -0.1871   0.7907   2.7670
Coefficients:
Estimate Std. Error z value Pr(|z|)
(Intercept)   1.0025 0.7049   1.422  0.15499
log(larvae)  -1.0640 0.3870  -2.749  0.00597 **
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 35.981  on 12  degrees of freedom
Residual deviance: 27.298  on 11  degrees of freedom
AIC: 40.949
Number of Fisher Scoring iterations: 4

anova(model,test=F)
Analysis of Deviance Table
Model: binomial, link: logit
Response: y
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev  F   Pr(F)
NULL   12 35.981
log(larvae)  18.68311 27.298 8.6828 0.003212 **
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] fastbw question

2005-03-30 Thread Frank E Harrell Jr

Peter Flom wrote:
Hello
I am running R 2.0.1 on Windows, I am attempting to use Frank Harrell's
'fastbw' function (from the Design library), but I get an error that the
fit was not created with a Design library fitting function; yet when I
go to the help for fastbw (and also look in Frank's book Regression
Modeling Strategies) it appears that fastbw should work with a model
created with lm.
Relevant code

model.borrow.logols- lm(logborrow~age + sex + racgp + yrseduc + 
 needlchg + gallery  + totni + inject + poly(year.of.int,3)  + druginj
+ inj.years  + HTLV3)

fastbw(model.borrow.logols)
The error message was exactly correct.  lm is not a Design fitting 
function.  Try ols.  -Frank


Thanks in advance
Peter
Peter L. Flom, PhD
Assistant Director, Statistics and Data Analysis Core
Center for Drug Use and HIV Research
National Development and Research Institutes
71 W. 23rd St
www.peterflom.com
New York, NY 10010
(212) 845-4485 (voice)
(917) 438-0894 (fax)
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] question

2005-03-31 Thread Frank E Harrell Jr

wim van baarle wrote:
Sir,
I found your description of the dataset about nodal involvement in prostate 
cancer. It comes from the book biostatistics casebook. I like to use the 
dataset for doing logistics regression. Can you tell me where I can find the 
dataset.
Thanks and greetings
Wim van Baarle
[EMAIL PROTECTED]
You didn't say which prostate cancer study that represented.  One famous 
one has data in R form on our web site http://biostat.mc.vanderbilt.edu.

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

[R] (no answer)

2005-04-01 Thread Frank E Harrell Jr

I wish to perform brain surgery this afternoon at 4pm and don't know 
where to start.  My background is the history of great statistician 
sports legends but I am willing to learn.  I know there are courses and 
numerous books on brain surgery but I don't have the time for those. 
Please direct me to the appropriate HowTos, and be on standby for 
solving any problem I may encounter while in the operating room.  Some 
of you might ask for specifics of the case, but that would require my 
following the posting guide and spending even more time than I am 
already taking to write this note.

I will be out of the office from 1:15pm to 1:25pm today.  This 
information should be valuable to many.

I. Ben Fooled
Technical University of Nonparametric Multivariate Statistics
Slapout, Alabama
---
LEGAL NOTICE\ Unless expressly stated otherwise, this messag...{{dropped}}
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] exclusion rules for propensity score matchng (pattern rec)

2005-04-05 Thread Frank E Harrell Jr

[EMAIL PROTECTED] wrote:
Dear R-list,
i have 6 different sets of samples.  Each sample has about 5000 observations,
with each observation comprised of 150 baseline covariates (X), 125 of which
are dichotomous. Roughly 20% of the observations in each sample are treatment
and the rest are control units.
i am doing propensity score matching, i have already estimated propensity
scores(predicted probabilities) using logistic regression, and in each sample i
am going to have to exclude approximately 100 treated observations for which I
cannot find matching control observations (because the scores for these treated
units are outside the support of the scores for control units).
in each sample, i must identify an exclusion rule that is interpretable on the
scale of the X's that excludes these unmatchable treated observations and
excludes as FEW of the remaining treated observations as possible.
(the reason is that i want to be able to explain, in terms of the Xs, who the
individuals are that I making causal inference about.)
i've tried some simple stuff over the past few days and nothing's worked.
is there an R-package or algorithm, or even estimation strategy that anyone
could recommend?
(i am really hoping so!)
thank you,
alexis diamond
Exclusion can be based on the non-overlap regions from the propensity. 
It should not be done in the individual covariate space.  I tend to look 
at the 10th smallest and largest values of propensity for each of the 
two treatment groups for making the decision.  You will need to exclude 
non-overlap regions whether you use matching or covariate adjustment of 
propensity but covariate adjustment (using e.g. regression splines in 
the logit of propensity) is often a better approach once you've been 
careful about non-overlap.

Frank Harrell
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] exclusion rules for propensity score matchng (pattern rec)

2005-04-05 Thread Frank E Harrell Jr

Alexis J. Diamond wrote:
hi,
thanks for the reply to my query about exclusion rules for propensity
score matching.

Exclusion can be based on the non-overlap regions from the propensity.
It should not be done in the individual covariate space.

i want a rule inspired by non-overlap in propensity score space, but that
binds in the space of the Xs.  because i don't really know how to
interpret the fact that i've excluded, say, people with scores  .87,
but i DO know what it means to say that i've excluded people from
country XYZ over age Q because i can't find good matches for them. if i
make my rule based on Xs, i know who i can and cannot make inference for,
and i can explain to other people who are the units that i can and cannot
make inference for.
after posting to the list last night, i thought of using the RGENOUD
package (genetic algorithm) to search over the space of exclusion rules
(eg., var 1 = 1, var 2 = 0 var 3 = 1 or 0, var 4 = 0); the loss function
associated with a rule should be increasing in # of tr units w/out support
excluded and decreasing in # of tr units w/ support excluded.
it might be tricky to get the right loss function, and i know this idea is
kind of nutty, but it's the only automated search method i could think of.
any comments?
alexis
Use the X space directly will not result in optimum exclusions unless 
you use a distance function but that will make assumptions.  My advice 
is to use rpart to make a classification rule that approximates the 
exclusion criteria to some desired degree of accuracy.  I.e. use rpart 
to predict propensity  lower cutoff and separately to predict 
propensity  upper cutoff.  This just assists in interpretation.

Frank


I tend to look
at the 10th smallest and largest values of propensity for each of the
two treatment groups for making the decision.  You will need to exclude
non-overlap regions whether you use matching or covariate adjustment of
propensity but covariate adjustment (using e.g. regression splines in
the logit of propensity) is often a better approach once you've been
careful about non-overlap.
Frank Harrell

On Tue, 5 Apr 2005, Frank E Harrell Jr wrote:

[EMAIL PROTECTED] wrote:
Dear R-list,
i have 6 different sets of samples.  Each sample has about 5000 observations,
with each observation comprised of 150 baseline covariates (X), 125 of which
are dichotomous. Roughly 20% of the observations in each sample are treatment
and the rest are control units.
i am doing propensity score matching, i have already estimated propensity
scores(predicted probabilities) using logistic regression, and in each sample i
am going to have to exclude approximately 100 treated observations for which I
cannot find matching control observations (because the scores for these treated
units are outside the support of the scores for control units).
in each sample, i must identify an exclusion rule that is interpretable on the
scale of the X's that excludes these unmatchable treated observations and
excludes as FEW of the remaining treated observations as possible.
(the reason is that i want to be able to explain, in terms of the Xs, who the
individuals are that I making causal inference about.)
i've tried some simple stuff over the past few days and nothing's worked.
is there an R-package or algorithm, or even estimation strategy that anyone
could recommend?
(i am really hoping so!)
thank you,
alexis diamond

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University


--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] rcorrp.cens

2006-04-21 Thread Frank E Harrell Jr

Stefano Mazzuco wrote:
 Hi R-users,
 
 I'm having some problems in using the Hmisc package.
 
 I'm estimating a cox ph model and want to test whether the drop in
 concordance index due to omitting one covariate is significant. I think (but
 I'm not sure) here are two ways to do that:
 
 1) predict two cox model (the full model and model without the covariate of
 interest) and estimate the concordance index (i.e. area under the ROC curve)
 with rcorr.cens for both models, then compute the difference
 
 2) predict the two cox models and estimate directly the difference between
 the two c-indices using rcorrp.cens. But it seems that the rcorrp.cens gives
 me the drop of Dxy index.
 
 Do you have any hint?
 
 Thanks
 Stefano

First of all, any method based on comparing rank concordances loses 
powers and is discouraged.  Likelihood ratio tests (e.g., by embedding a 
smaller model in a bigger one) are much more powerful.  If you must base 
comparisons on rank concordance (e.g., ROC area=C, Dxy) then rcorrp.cens 
can work if the sample size is large enough so that uncertainty about 
regression coefficient estimates may be ignored.  rcorrp.cens doesn't 
give the drop in C; it gives the probability that one model is more 
concordant with the outcome than another, among pairs of paired 
predictions.

The bootcov function in the Design package has a new version that will 
output bootstrap replicates of C for a model, and its help file tells 
you how to use that to compare C for two models.  This should only be 
done to show how low a power such a procedure has.  rcporrp is likely to 
be more powerful than that, but likelihood ratio is what you want.  You 
will find many cases where one model increases C by only 0.02 but it has 
many more useful (more extreme) predictions.

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Hmisc + summarize + quantile: Why only quantiles for first variable in data frame?

2006-04-21 Thread Frank E Harrell Jr

Kim Milferstedt wrote:
 Hi Frank Harrell,
 
 thanks for the response. I understand your comment but I wasn't able to 
 find (or recognize) an answer on how to tell FUN explicitely to use 
 matrix operations. Would you be so kind and give me an example?
 
 Thanks so much,
 
 Kim
 

See http://biostat.mc.vanderbilt.edu/SasByMeansExample plus an example 
in the help file for summary.formula in Hmisc which uses the apply 
function.  summary.formula and summarize are similar in the use of FUN 
(which summary.formula unfortunately calls 'fun').

Frank

 
 Please read the documentation and see the examples.  The first 
 argument to summarize is a matrix or vector and if a matrix, FUN must 
 use matrix operations if you want column-by-column results.

 FH
 -- 
 Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University
 
 
 Kim Milferstedt wrote:

 Hi,
 I'm working on a data set that contains a couple of factors and a 
 number of dependent variables. From all of these dependent variables 
 I would like to calculate mean, standard deviation and quantiles. 
 With the function FUN I get all the means and stdev that I want but 
 quantiles are only calculated for the first of the dependent 
 variables (column 8 in the summarize command). What do I have to do 
 differently in order to get all the quantiles that I want?
 Thanks,
 Kim
 sgldm2-  read.table(E:/analysistemp/060412_test_data2.txt, 
 header=T)
 attach(sgldm2)
 names(sgldm2)
 FUN -  function(x)c(Mean=mean(x,na.rm=TRUE), 
 STDEV=sd(x,na.rm=TRUE), Quantile=quantile(x, probs= 
 c(0.25,0.50,0.75),na.rm=TRUE))
 ordering-  llist(time_h_f, Distance_f)
 resALL  -  summarize(sgldm2[,8:10], ordering, FUN)


 __

 Kim Milferstedt
 University of Illinois at Urbana-Champaign
 Department of Civil and Environmental Engineering
 4125 Newmark Civil Engineering Building
 205 North Mathews Avenue MC 250
 Urbana, IL 61801
 USA
 phone: (001) 217 333-9663
 fax: (001) 217 333-6968
 email: [EMAIL PROTECTED]
 http://cee.uiuc.edu/research/morgenroth/index.asp
 ___
 
 
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Creat new column based on condition

2006-04-21 Thread Frank E Harrell Jr

Duncan Murdoch wrote:
 On 4/21/2006 4:05 PM, Sachin J wrote:
 
Hi,
   
  How can I accomplish this task in R?
   
V1
10
20
30
10
10
20
 
  Create a new column V2 such that: 
  If V1 = 10 then V2 = 4
  If V1 = 20 then V2 = 6
  V1 =   30 then V2 = 10
 
 
 Gabor's solution is fine; something that looks a little bit more like 
 your code is this:
 
   V2 - NA
   V2 - ifelse( V1 == 10, 4, V2)
   V2 - ifelse( V1 == 20, 6, V2)
   V2 - ifelse( V1 == 30, 10, V2)

or

V2 - 4*(V1==10)+6*(V2==20)+10*(V2==30)
V2[V2==0] - NA

Frank

 
 or
 
   V2 - ifelse( V1 == 10, 4,
   ifelse( V1 == 20, 6,
 ifelse( V1 == 30, 10, NA )))
 
 (where the NA is to handle any unexpected case where V1 isn't 10, 20 or 
 30).  My preference would be to use just one assignment, and if I was 
 sure 10, 20 and 30 were the only possibilities, would use
 
   V2 - ifelse( V1 == 10, 4,
   ifelse( V1 == 20, 6, 10 ))
 
 Duncan Murdoch
 
   
  So the O/P looks like this
   
V1  V2
10   4
20   6
30  10
10   4
10   4  
20   6
   
  Thanks in advance.
   
  Sachin

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] regression modeling

2006-04-25 Thread Frank E Harrell Jr

Berton Gunter wrote:
 May I offer a perhaps contrary perspective on this.
 
 Statistical **theory** tells us that the precision of estimates improves as
 sample size increases. However, in practice, this is not always the case.
 The reason is that it can take time to collect that extra data, and things
 change over time. So the very definition of what one is measuring, the
 measurement technology by which it is measured (think about estimating tumor
 size or disease incidence or underemployment, for example), the presence or
 absence of known or unknown large systematic effects, and so forth may
 change in unknown ways. This defeats, or at least complicates, the
 fundamental assumption that one is sampling from a (fixed) population or
 stable (e.g. homogeneous, stationary) process, so it's no wonder that all
 statistical bets are off. Of course, sometimes the necessary information to
 account for these issues is present, and appropriate (but often complex)
 statistical analyses can be performed. But not always.
 
 Thus, I am suspicious, cynical even, about those who advocate collecting
 all the data and subjecting the whole vast heterogeneous mess to arcane
 and ever more computer intensive (and adjustable parameter ridden) data
 mining algorithms to detect trends or discover knowledge. To me, it
 sounds like a prescription for turning on all the equipment and waiting to
 see what happens in the science lab instead of performing careful,
 well-designed experiments.
 
 I realize, of course, that there are many perfectly legitimate areas of
 scientific research, from geophysics to evolutionary biology to sociology,
 where one cannot (easily) perform planned experiments. But my point is that
 good science demands that in all circumstances, and especially when one
 accumulates and attempts to aggregata data taken over spans of time and
 space, one needs to beware of oversimplification, including statistical
 oversimplification. So interrogate the measurement, be skeptical of
 stability, expect inconsistency. While all models are wrong but some are
 useful (George Box), the second law tells us that entropy still rules.
 
 (Needless to say, public or private contrary views are welcome).
 
 -- Bert Gunter
 Genentech Non-Clinical Statistics
 South San Francisco, CA

Bert raises some great points.  Ignoring the important issues of
doing good research and stability in the meaning of data as time marches 
on, it is generally true that the larger the sample size the greater the 
complexity of the model we can afford to fit, and the better the fit of 
the model.  This is the AIC school.  The BIC school assumes there is 
an actual model out there waiting for us, of finite dimension, and the 
complexity of our models should not grow very fast as N increases.  I 
find the AIC approach gives me more accurate predictions.
-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] stepwise regression

2006-04-28 Thread Frank E Harrell Jr

Jinsong Zhao wrote:
 Dear all,
 
 I have encountered a problem when perform stepwise regression. 

You have more problems than you know.

 The dataset have more 9 independent variables, but 7 observation.

Why collect any data?  You can get great fits using random numbers using
this procedure.

Frank

 
 In R, before performing stepwise, a lm object should be given.
 fm - lm(y ~ X1 + X2 + X3 + X11 + X22 + X33 + X12 + X13 + X23)
 
 However, summary(fm) will give: 
 
 Residual standard error: NaN on 0 degrees of freedom
 Multiple R-Squared: 1,  Adjusted R-squared:   NaN 
 F-statistic:   NaN on 6 and 0 DF,  p-value: NA 
 
 In this situation, step() or stepAIC() will not give any useful information.
 
 I don't know why SAS could deal with this situation:
 PROC REG;
  MODEL y=X1 X2 X3 X11 X22 X33 X12 X13 X23/SELECTION=STEPWISE;
 RUN;
 
 Any help will be really appreciated.
 
 Wishes,
 
 Jinsong Zhao


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] function for linear regression with White std. errors

2006-04-28 Thread Frank E Harrell Jr

Thomas Lumley wrote:
 On Thu, 27 Apr 2006, Brian Quinif wrote:
 
 
John,

Thanks for the suggestion, but tomorrow I am teaching a little seminar
for my department trying to convince people about how wonderful R is.
These people are all Stata users, and they really like the idea that
they only have to type , robust to get het. consistent std. errors.

 
 
 This really is a unique feature of Stata.  It's not hard to add the 
 standard errors to glm(), but it is harder to make other functions such as 
 predict() use them properly.
 
 The survey package gives these standard errors by default, but that might 
 also be regarded as too much work.

The Design package also does this:
f - fittingfunction( )
g - robcov(f, ...)
plot(g); summary(g) etc uses robust s.e.

Frank

 
   -thomas
 
 Thomas Lumley Assoc. Professor, Biostatistics
 [EMAIL PROTECTED] University of Washington, Seattle
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] New-user support package - suggestions?

2006-05-04 Thread Frank E Harrell Jr

Søren Højsgaard wrote:
 Dear Andrew,
  
 I tend to agree with Uwe. 
  
 A perhaps more useful approach for achieving your goal would be to create a 
 video introduction to R. On http://www.learnvisualstudio.net/ you can find 
 examples of such introductory videos for programming, for example in C#.
  
 I've experimented a little myself with creating such videos for getting 
 started with R, i.e. installing R, installing the Tinn-R editor, using the 
 help facilities, finding things on CRAN. It is not as easy to do as I had 
 thought... (If anyone is interested in the result, please drop me a line and 
 I'll send you a link).. There are various relevant pieces of software around, 
 for example camtasia (commercial) and camstudio (freeware)..
  
 Such a set of videos could make it onto CRAN...
  
 Best
 Søren

But please choose an editor that works on all platforms.

Frank

  
  
  
 
 
 
 Fra: [EMAIL PROTECTED] på vegne af Uwe Ligges
 Sendt: to 04-05-2006 09:13
 Til: Andrew Robinson
 Cc: R-Help Discussion
 Emne: Re: [R] New-user support package - suggestions?
 
 
 
 Andrew Robinson wrote:
 
 
Dear Community,

This is largely a repost with some new information.

I'm interested in developing a package that could ease the
command-line learning curve for new users.  It would provide more
detailed syntax checking and commentary as feedback.  It would try to
anticipate common new-user errors, and provide feedback to help
correct them.
 
 
 Re. anticipate, see
 
 install.packages(fortunes)
 library(fortunes)
 fortune(anticipate)
 
 
 
As a trivial example, instead of



mean(c(1,2,NA))

[1] NA

we might have



mean(c(1,2,NA))

[1] NA
Warning: your data contains missing values.  If you wish to ignore
these for the purposes of computing the mean, use the argument:
na.rm=TRUE.
 
 
 
 Attention. If you give help like this, you will implicitly teach users
 not to read the manuals but trust on R's warning/error messages and
 suggestions.
 Some students won't ask the question why do my data contain NAs but
 will start using na.rm=TRUE, even if the error was in a prior step and
 no NAs were expected at all.
 
 
 
I'm interested in any thoughts that people have about this idea -
what errors do you commonly see, and how can they be dealt with?

I have funding for 6 weeks of programming support for this idea.  All
suggestions are welcome.

Cheers,

Andrew
 
 
 
 Your project sounds very ambitious. I anticipate you will be
 disappointed after 6 weeks of programming, because you won't have
 achieved very much. I'd rather try to spend 6 weeks of time for some
 more promising projects...
 
 Just my first thoughts, just go on with your project if you are convinced.
 
 Best,
 Uwe Ligges
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] data manipulation docs

2006-05-05 Thread Frank E Harrell Jr

Federico Calboli wrote:
 Hi All,
 
 Is there some document/manual about data manipulation within R that I 
 could use as a reference (obviously, aside the R manuals)?
 
 The reason I am asking is that I have a number of data frames/matrices 
 containg genetic data. The data is in a character form, as in:
 
V1 V2 V3 V4 V5
 1 AA AG AA GG AG
 2 AC AA AA GG AG
 3 AA AG AA GG AG
 4 AA AA AA GG AG
 5 AA AA AA GG AA
 
 I need, to chop, subset, and variously manipulate this kind of data, 
 sometimes keeping the data in its character format, sometimes converting 
 it to numeric form (i.e. substitute each data point with the equivalent 
 factor value). Since the data is ofthe quite big, I have to keep things 
 memory efficient.
 
 This whole game is getting excedingly time consuming and frustrating, 
 because I end up with random pieces of code that I save, patching a 
 particular problem, but difficult to be 'abstracted' for a new task, so 
 I get back close to square one annoyingly often.
 
 Cheers,
 
 Federico Calboli
 
 

There is a large data manipulation section on the Alzola Harrell 
document available on CRAN under contributed docs, or a slightly more up 
to date version at biostat.mc.vanderbilt.edu

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] devide data into decile

2006-05-09 Thread Frank E Harrell Jr

Uwe Ligges wrote:
 Guojun Zhu wrote:
 
 
I guess this is really basic.  But I do not find an
answer yet.  I have a big data.frame.  I would like to
divede them into 10 deciles accounding to one of its
member.  Then I need a number for each decile with
some computaion within each group.  How to devide it?
 
 
 For example, the result of cut() as a new variable to the data.frame and 
 afterwards split() the data.frame by the resulting factor.

Or

library(Hmisc)
xg - cut2(x, g=10)# labels deciles nicely

Frank

 
 Uwe Ligges
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] lmer, p-values and all that

2006-05-19 Thread Frank E Harrell Jr

Douglas Bates wrote:
 Users are often surprised and alarmed that the summary of a linear
. . . .
Doug,

I have been needing this kind of explanation.  That is very helpful. 
Thank you.  I do a lot with penalized MLEs for ordinary regression and 
logistic models and know that getting sensible P-values is not 
straightforward even in that far simpler situation.

Frank
-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] odd feature

2006-05-21 Thread Frank E Harrell Jr

ivo welch wrote:
 Dear R wizards:
 
 I just got stung by the ifelse() feature.
 
 
a - 10:15
b - 20:300
test - 1
ifelse(test,a,b)
 
 [1] 10
 
 I had not realized that this was the default behavior---I had expected
 10:15.  mea culpa.  however, I wonder whether it would make sense to
 replace ifelse with a different semantic, where if test is a single
 scalar, it means what a stupid user like me would imagine.
 
 Aside, I like the flexibility of R, but I am not thrilled by all the
 recycling rules.  I either mean I want a scalar or a vector of
 equal/appropriate dimension.  I never want a recycle of a smaller
 vector.  (I do often use a recycle of a scalar.)
 
 regards,
 
 /iaw

The current behavior is logical to me.  Are you looking for if(test) a 
else b?
Frank
-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Can lmer() fit a multilevel model embedded in a regression?

2006-05-21 Thread Frank E Harrell Jr

 a multilevel model embedded in a
regression?

I would like to fit a hierarchical regression model from Witte et al.
(1994; see reference below).  It's a logistic regression of a health
outcome on quntities of food intake; the linear predictor has the form,
X*beta + W*gamma,
where X is a matrix of consumption of 82 foods (i.e., the rows of X
represent people in the study, the columns represent different foods,
and X_ij is the amount of food j eaten by person i); and W is a matrix
of some other predictors (sex, age, ...).

The second stage of the model is a regression of X on some food-level
predictors.

Is it possible to fit this model in (the current version of) lmer()?
The challenge is that the persons are _not_ nested within food items, so
it is not a simple multilevel structure.

We're planning to write a Gibbs sampler and fit the model directly, but
it would be convenient to be able to flt in lmer() as well to check.

Andrew

---

Reference:

Witte, J. S., Greenland, S., Hale, R. W., and Bird, C. L. (1994).
Hierarchical regression analysis applied to a
study of multiple dietary exposures and breast cancer.  Epidemiology 5,
612-621.

--
Andrew Gelman
Professor, Department of Statistics
Professor, Department of Political Science
[EMAIL PROTECTED]
www.stat.columbia.edu/~gelman

Statistics department office:
  Social Work Bldg (Amsterdam Ave at 122 St), Room 1016
  212-851-2142
Political Science department office:
  International Affairs Bldg (Amsterdam Ave at 118 St), Room 731
  212-854-7075

Mailing address:
  1255 Amsterdam Ave, Room 1016
  Columbia University
  New York, NY 10027-5904
  212-851-2142
  (fax) 212-851-2164

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html



--
Andrew Gelman
Professor, Department of Statistics
Professor, Department of Political Science
[EMAIL PROTECTED]
www.stat.columbia.edu/~gelman

Statistics department office:
  Social Work Bldg (Amsterdam Ave at 122 St), Room 1016
  212-851-2142
Political Science department office:
  International Affairs Bldg (Amsterdam Ave at 118 St), Room 731
  212-854-7075

Mailing address:
  1255 Amsterdam Ave, Room 1016
  Columbia University
  New York, NY 10027-5904
  212-851-2142
  (fax) 212-851-2164



 
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Can lmer() fit a multilevel model embedded in a regression?

2006-05-21 Thread Frank E Harrell Jr

Andrew Gelman wrote:
 Frank,
 Just to check:  are you saying that this model can be fit in lmer()?
 Thanks.
 Andrew

Not sure about that.  Greenland probably used BUGS but doesn't mean you 
can't do it with lmer.

Frank

 
 Frank E Harrell Jr wrote:
 

 A great reference for that kind of model, plus a way to 'connect' 
 foods via a composition matrix is

   author =   {Greenland, Sander},
   title ={When should epidemiologic regressions use 
 random coeff
 icients?},
   journal =  Biometrics,
   year = 2000,
   volume =   56,
   pages ={915-921},
 
 
 
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] odd feature

2006-05-22 Thread Frank E Harrell Jr

Duncan Murdoch wrote:
 On 5/22/2006 3:26 AM, Martin Maechler wrote:
 
Gabor == Gabor Grothendieck [EMAIL PROTECTED]
on Sun, 21 May 2006 09:47:07 -0400 writes:

Gabor If you know that test is a scalar

Gabor result - if (test) a else b

Gabor will do it.  

Yes, indeed!
IMO,  ifelse(test, a, b) is much overused where as  
  if(test) a else b  is much UNDER used.

From some e-mail postings, and even some documents (even printed
books?), I get the impression that too many people think that
 ifelse(.,.,.)  is to be used as expression / function and 
  if(.) . else . only for program flow control.
This leads to quite suboptimal code, and I personally use
if(.) . else .  __as expression__ much more frequently than ifelse(.,.,.)
 
 
 For overuse of ifelse(), do you mean cases where test is length 1, so 
 if() would work?  Or are you thinking of something else?
 
 I'd also be interested in what you mean by quite suboptimal code.  Are 
 you thinking of things like
 
   if (test)
  temp - a
   else
  temp - b
   result - f(temp)
 
 versus
 
   result - f( if (test) a else b )
 
 ?
 
 I would generally use the former, because it's easier to get the 
 formatting right, and I find it easier to read.  It's suboptimal in 
 speed and memory use because of creating the temp variable, but in most 
 cases I think that would be such a small difference that the small 
 increase in readability is worthwhile.

IMHO that approach too verbose and not more readable.

Frank

 
 Duncan Murdoch
 
 
 
Martin Maechler, ETH Zurich.

Gabor Here is another approach:

Gabor as.vector(test * ts(a) + (!test) * ts(b))



Gabor On 5/21/06, ivo welch [EMAIL PROTECTED] wrote:
 Dear R wizards:
 
 I just got stung by the ifelse() feature.
 
  a - 10:15
  b - 20:300
  test - 1
  ifelse(test,a,b)
 [1] 10
 
 I had not realized that this was the default behavior---I had expected
 10:15.  mea culpa.  however, I wonder whether it would make sense to
 replace ifelse with a different semantic, where if test is a single
 scalar, it means what a stupid user like me would imagine.
 
 Aside, I like the flexibility of R, but I am not thrilled by all the
 recycling rules.  I either mean I want a scalar or a vector of
 equal/appropriate dimension.  I never want a recycle of a smaller
 vector.  (I do often use a recycle of a scalar.)
 
 regards,
 
 /iaw
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! 
 http://www.R-project.org/posting-guide.html
 



-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Ordinal Independent Variables

2006-05-22 Thread Frank E Harrell Jr

Rick Bilonick wrote:
 When I run lrm from the Design package, I get a warning about
 contrasts when I include an ordinal variable:
 
 Warning message:
 Variable ordfac is an ordered factor.
  You should set
 options(contrasts=c(contr.treatment,contr.treatment))
 or Design will not work properly. in: Design(eval(m, sys.parent()))
 
 I don't get this message if I use glm with family=binomial. It produces
 linear and quadratic contrasts.
 
 If it's improper to do this for an ordinal variable, why does glm not
 balk?
 
 Rick B.

Standard regression methods don't make good use of ordinal predictors 
and just have to treat them as categorical.  Design is a bit picky about 
this.  If the predictor has numeric scores for the categories, you can 
get a test of adequacy of the scores (with k-2 d.f. for k categories) by 
using scored(predictor) in the formula.  Or just create a factor( ) 
variable to hand to Design.
-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Ordinal Independent Variables

2006-05-23 Thread Frank E Harrell Jr

Prof Brian Ripley wrote:
 On Mon, 22 May 2006, Frank E Harrell Jr wrote:
 
 Rick Bilonick wrote:

 When I run lrm from the Design package, I get a warning about
 contrasts when I include an ordinal variable:

 Warning message:
 Variable ordfac is an ordered factor.
  You should set
 options(contrasts=c(contr.treatment,contr.treatment))
 or Design will not work properly. in: Design(eval(m, sys.parent()))

 I don't get this message if I use glm with family=binomial. It produces
 linear and quadratic contrasts.

 If it's improper to do this for an ordinal variable, why does glm not
 balk?

 Rick B.


 Standard regression methods don't make good use of ordinal predictors
 and just have to treat them as categorical.  Design is a bit picky about
 this.  If the predictor has numeric scores for the categories, you can
 get a test of adequacy of the scores (with k-2 d.f. for k categories) by
 using scored(predictor) in the formula.  Or just create a factor( )
 variable to hand to Design.
 
 
 Contrasts in S/R are used to set the coding of factors, and 
 model.matrix() does IMO 'make good use of ordinal predictors'.
 
 I don't know what is meant by 'Standard regression methods': the 
 charitable interpretation is that these are the overly restrictive 
 methods used by certain statistical packages.  (I first learnt of the 
 use of polynomial codings for ordinal factors in the late 1970s, when I 
 first learnt anything about ANOVA, so to me they are 'standard'.)
 
 So are you saying this is a design deficiency in package Design, or that 
 the authors of S ca 1991 were wrong to allow arbitrary contrasts?
 

Brian,

What I meant was that unlike the case of ordinal response varables where 
multiple intercepts in logistical models do not cost degrees of freedom 
because the ordering constraint is fully utilized, ordinal predictors 
require k-1 degrees of freedom for k levels using any standard contrast. 
   Special methods (e.g. pool adjacent violators to impose a 
monotonicity constraint) would have to be used to get a lot out of the 
ordinal nature of the predictor.

There's nothing wrong with allowing arbitrary contrasts; more progress 
has been made in statistics for ordinal responses than ordinal predictors.

Frank
-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] position of number at risk in survplot() graphs

2006-05-30 Thread Frank E Harrell Jr

Osman Al-Radi wrote:
 Dear R-help
 
 How can one get survplot() to place the number at risk just below the
 survival curve as opposed to the default which is just above the x-axis?
 I tried the code bellow but the result is not satisfactory as some numbers
 are repeated several times at different y coordinates and the position of
 the n.risk numbers corresponds to the x-axis tick marks not the survival
 curve time of censoring.
 
 n - 20
 set.seed(731)
 cens - 15*runif(n)
 h - .02*exp(2)
 dt - -log(runif(n))/h
 label(dt) - 'Follow-up Time'
 e - ifelse(dt = cens,1,0)
 dt - pmin(dt, cens)
 units(dt) - Year
 S - Surv(dt,e)
 km-survfit(S~1)
 survplot(km,n.risk=T,conf='none', y.n.risk=unique(summary(km)$surv))
 
 Any suggestion on addressing this problem would be apprecited.
 
 Also, is there a way to add a tick mark to the survival curve at times of
 censoring similar to the mark.time=T argument in plot.survplot()?
 
 Thanks
 
 Osman
 

Osman,

y.n.risk has to be a scalar and gives the y-coordinate of the bottom 
line of number at risk.  I take it that you want the numbers not all at 
the same height.  This will require a customization of survplot or 
fetching information from the km object and using text( ).

Frank
-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] SPSS variable lables import

2006-06-06 Thread Frank E Harrell Jr

Thomas Lumley wrote:
 On Tue, 6 Jun 2006, Frank Thomas wrote:
 
 
Hi,
I try to get the variable labels of a SPSS data file into R but don't
find this mentioned in the help file for foreign. Is there another way
to get them ?
BTW: An SPSS variable name is like: VAR001, whereas the variable label
might be 'Identification no.'
 
 
 mydata - read.spss(mydata.sav)
 attr(mydata, variable.labels)
 
   -thomas

Or:

library(Hmisc)
mydata - spss.get('mydata.sav')
describe(mydata)
contents(mydata)
xYplot( )
etc. use the variable labels attached to individual variables.  You can 
retrieve them also this way:

with(mydata, plot(x, y, xlab=label(x), ylab=label(y)))

Frank Harrell


 
 Thomas Lumley Assoc. Professor, Biostatistics
 [EMAIL PROTECTED] University of Washington, Seattle
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] modeling logit(y/n) using lrm

2006-06-17 Thread Frank E Harrell Jr

Hamilton, Cody wrote:
 After a little digging, it turns out that fit.mult.impute will allow
 fitter = glm, so previous suggestions regarding modeling cbind(y,n) as
 an outcome will work fine.  Thanks!

Also lrm can easily handle your setup using the weights argument.

Frank

 
 
 
 Cody Hamilton, Ph.D
 
 Institute for Health Care Research and Improvement
 
 Baylor Health Care System
 
 (214) 265-3618
 
 
 
 
 
 This e-mail, facsimile, or letter and any files or attachments transmitted 
 with it contains information that is confidential and privileged. This 
 information is intended only for the use of the individual(s) and entity(ies) 
 to whom it is addressed. If you are the intended recipient, further 
 disclosures are prohibited without proper authorization. If you are not the 
 intended recipient, any disclosure, copying, printing, or use of this 
 information is strictly prohibited and possibly a violation of federal or 
 state law and regulations. If you have received this information in error, 
 please notify Baylor Health Care System immediately at 1-866-402-1661 or via 
 e-mail at [EMAIL PROTECTED] Baylor Health Care System, its subsidiaries, and 
 affiliates hereby claim all applicable privileges related to this information.
   [[alternative HTML version deleted]]
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

[R] useR! Thanks

2006-06-19 Thread Frank E Harrell Jr

After attending my first useR! conference I want to thank the organizers 
for doing a wonderful job and the presenters for their high quality 
presentations and stimulating ideas.   The conference venue was 
excellent and of course Vienna is one of the greatest cities in the 
world to visit.  useR! is one of the most fun conferences I've attended.

Thanks again!
-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] latex function with lm

2006-06-21 Thread Frank E Harrell Jr

Erin Hodgess wrote:
 Dear R People:
 
 I have used the latex function from the Hmisc
 package and it is just great!
 
 However, I have a new question regarding that function:
 is there an option for summary(lm(y~x)), please?  There are
 options for different types of objects, but I didn't see one
 for that.  Maybe I just missed it.

There is no latex method for summary(lm); contributions welcomed.  You 
might also look at latex.ols, latex.anova.Design, latex.summary.Design.

Frank

 
 Thanks in advance!
 
 R for Windows Version 2.3.1
 
 Sincerely,
 Erin Hodgess
 Associate Professor
 Department of Computer and Mathematical Sciences
 University of Houston - Downtown
 mailto: [EMAIL PROTECTED]
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] effect size

2006-06-21 Thread Frank E Harrell Jr

Matthew Bridgman wrote:
 Does anyone know a simple way of calculating effect sizes?
 
 Thanks
 MB

Yes - the following formula is simple and fairly universal:

2

:-)
-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] As.Factor with Logistic Regression

2006-06-22 Thread Frank E Harrell Jr

Justin Rapp wrote:
 I am modeling the probability of player succeeding in the NFL with a
 binomial logistic regression with 1 signifying success and 0
 signifying no success.   I performed the regression of the binomial
 variable against overall draft position using the college conference
 for which each player played as a factor using the
 as.factor(Conference) command.
 
 My question is:
 
 How do I plot specific factors against the curve for the entire set.
 There are only a few factors that have significant coefficients and I
 would like to plot those against the logistic curve for the entire
 set.
 
 Any help would be greatly appreciated.
 

It will bias the analysis to omit insignificant factors.

To get the plots you want:

library(Hmisc); library(Design)
dd - datadist(mydata); options(datadist='dd')
f - lrm(y ~ x1+x2+)
plot(f, x1=NA, fun=plogis)
plot(f, x2=NA, fun=plogis)
plot(f, fun=plogis) # plot partial effects of all predictors, 
probability scale

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] R Reporting - PDF/HTML mature presentation quality package?

2006-06-24 Thread Frank E Harrell Jr

zubin wrote:
 Hello, searching for the key packages so i can output, Text, Tables ,and 
 Graphics into a HTML or PDF report - have R create these reports in an 
 easy and efficient way without LaTeX - I have searched the R pages but 
 don't see any mature packages - anyone have any advice on a easy to use 
 R package one can use for generating publication quality reports?  
 Outputting HTML or PDF.

Doing this without LaTeX is like doing statistical analysis without 
linear models and the Wilcoxon test.

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

< 1 2 3 4 5 6 7 >

101 - 200 of 609 matches

Mail list logo